Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Identifieur interne : 000D99 ( Pmc/Corpus ); précédent : 000D98; suivant : 000E00

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Auteurs : Abedalrhman Alkhateeb ; Luis Rueda

Source :

RBID : PMC:5563921

Abstract

Abstract

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of uniquek-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage ink-mers. Based on az-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold.

Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.


Url:
DOI: 10.1089/cmb.2017.0021
PubMed: 28414515
PubMed Central: 5563921

Links to Exploration step

PMC:5563921

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Zseq: An Approach for Preprocessing Next-Generation Sequencing Data</title>
<author>
<name sortKey="Alkhateeb, Abedalrhman" sort="Alkhateeb, Abedalrhman" uniqKey="Alkhateeb A" first="Abedalrhman" last="Alkhateeb">Abedalrhman Alkhateeb</name>
</author>
<author>
<name sortKey="Rueda, Luis" sort="Rueda, Luis" uniqKey="Rueda L" first="Luis" last="Rueda">Luis Rueda</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28414515</idno>
<idno type="pmc">5563921</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5563921</idno>
<idno type="RBID">PMC:5563921</idno>
<idno type="doi">10.1089/cmb.2017.0021</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000D99</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D99</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Zseq: An Approach for Preprocessing Next-Generation Sequencing Data</title>
<author>
<name sortKey="Alkhateeb, Abedalrhman" sort="Alkhateeb, Abedalrhman" uniqKey="Alkhateeb A" first="Abedalrhman" last="Alkhateeb">Abedalrhman Alkhateeb</name>
</author>
<author>
<name sortKey="Rueda, Luis" sort="Rueda, Luis" uniqKey="Rueda L" first="Luis" last="Rueda">Luis Rueda</name>
</author>
</analytic>
<series>
<title level="j">Journal of Computational Biology</title>
<idno type="ISSN">1066-5277</idno>
<idno type="eISSN">1557-8666</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<p>
<bold>Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique</bold>
<italic>k</italic>
<bold>-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in</bold>
<italic>k</italic>
<bold>-mers. Based on a</bold>
<italic>z</italic>
<bold>-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold.</bold>
</p>
<p>
<bold>Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as
<italic>de novo</italic>
assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover,
<italic>de novo</italic>
assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.</bold>
</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S F" uniqKey="Altschul S">S.F. Altschul</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, T" uniqKey="Brown T">T. Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheadle, C" uniqKey="Cheadle C">C. Cheadle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, Y C" uniqKey="Chen Y">Y.-C. Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cortes, C" uniqKey="Cortes C">C. Cortes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabherr, M G" uniqKey="Grabherr M">M.G. Grabherr</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ishiguro, H" uniqKey="Ishiguro H">H. Ishiguro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kannan, K" uniqKey="Kannan K">K. Kannan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D. Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, J H" uniqKey="Kim J">J.H. Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lavezzo, E" uniqKey="Lavezzo E">E. Lavezzo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mackinnon, M J" uniqKey="Mackinnon M">M.J. Mackinnon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Margulies, M" uniqKey="Margulies M">M. Margulies</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgulis, A" uniqKey="Morgulis A">A. Morgulis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pozzoli, U" uniqKey="Pozzoli U">U. Pozzoli</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quail, M A" uniqKey="Quail M">M.A. Quail</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmieder, R" uniqKey="Schmieder R">R. Schmieder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Traish, A M" uniqKey="Traish A">A.M. Traish</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C. Trapnell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vogel, F" uniqKey="Vogel F">F. Vogel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Waszak, S M" uniqKey="Waszak S">S.M. Waszak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wuitschick, J" uniqKey="Wuitschick J">J. Wuitschick</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yakovchuk, P" uniqKey="Yakovchuk P">P. Yakovchuk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z. Zhao</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">J Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">J. Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">cmb</journal-id>
<journal-title-group>
<journal-title>Journal of Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1066-5277</issn>
<issn pub-type="epub">1557-8666</issn>
<publisher>
<publisher-name>Mary Ann Liebert, Inc.</publisher-name>
<publisher-loc>140 Huguenot Street, 3rd FloorNew Rochelle, NY 10801USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28414515</article-id>
<article-id pub-id-type="pmc">5563921</article-id>
<article-id pub-id-type="publisher-id">10.1089/cmb.2017.0021</article-id>
<article-id pub-id-type="doi">10.1089/cmb.2017.0021</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Zseq: An Approach for Preprocessing Next-Generation Sequencing Data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Alkhateeb</surname>
<given-names>Abedalrhman</given-names>
</name>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Rueda</surname>
<given-names>Luis</given-names>
</name>
</contrib>
<aff id="aff1">School of Computer Science,
<institution>University of Windsor</institution>
, Windsor,
<country>Canada</country>
.</aff>
</contrib-group>
<author-notes>
<corresp>
<addr-line>Address correspondence to:</addr-line>
<addr-line>
<italic>Prof. Luis Rueda</italic>
</addr-line>
<addr-line>
<italic>School of Computer Science</italic>
</addr-line>
<institution>
<italic>University of Windsor</italic>
</institution>
<addr-line>
<italic>401 Sunset Avenue</italic>
</addr-line>
<addr-line>
<italic>Windsor ON N9B 3P4</italic>
</addr-line>
<country>Canada</country>
<break></break>
<italic>E-mail:</italic>
<email xlink:href="mailto:alkhate@uwindsor.ca">alkhate@uwindsor.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>8</month>
<year>2017</year>
<pmc-comment>string-date: August 2017</pmc-comment>
</pub-date>
<pub-date pub-type="epub">
<day>01</day>
<month>8</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>01</day>
<month>8</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>24</volume>
<issue>8</issue>
<fpage>746</fpage>
<lpage>755</lpage>
<permissions>
<copyright-statement>© Abedalrhman Alkhateeb and Luis Rueda, 2017. Published by Mary Ann Liebert, Inc.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="open-access">
<license-p>This Open Access article is distributed under the terms of the Creative Commons Attribution Noncommercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="cmb.2017.0021.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<p>
<bold>Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique</bold>
<italic>k</italic>
<bold>-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in</bold>
<italic>k</italic>
<bold>-mers. Based on a</bold>
<italic>z</italic>
<bold>-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold.</bold>
</p>
<p>
<bold>Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as
<italic>de novo</italic>
assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover,
<italic>de novo</italic>
assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.</bold>
</p>
</abstract>
<kwd-group kwd-group-type="author">
<title>
<bold>Keywords:</bold>
</title>
<kwd>machine learning</kwd>
<kwd>next-generation sequencing</kwd>
<kwd>preprocessing</kwd>
<kwd>RNA-SEQ analysis</kwd>
</kwd-group>
<counts>
<fig-count count="9"></fig-count>
<table-count count="5"></table-count>
<equation-count count="3"></equation-count>
<ref-count count="25"></ref-count>
<page-count count="10"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s001">
<title>1. Introduction</title>
<p>I
<sc>n the last decade</sc>
, next-generation sequencing (NGS) technology has evolved rapidly, reducing the cost of genome sequencing and influencing the progression of cancer research and other fields. The main purpose of NGS studies is to find clues to gene and protein structures and functions in the sequenced reads. However, this advanced technology can also produce unexpected artifacts (Waszak et al.,
<xref rid="B22" ref-type="bibr">2014</xref>
; Lavezzo et al.,
<xref rid="B11" ref-type="bibr">2016</xref>
). Some of these artifacts come from cDNA library preparation; those are repetitive low-complex regions that appear in the sequenced reads (Mackinnon et al.,
<xref rid="B13" ref-type="bibr">2009</xref>
). High GC content is also a common bias due to cDNA library preparation, while GC content tends to last more in the preparation process (Yakovchuk et al.,
<xref rid="B24" ref-type="bibr">2006</xref>
). GC-content bias in reads is also known to aggravate genome assembly, and hence it may result in poor genome assembly. Nevertheless, the sequencing procedure itself can produce low-complex repetitive regions such as a sequence of ambiguous nucleotides. In general, it is not clear to what extent GC-content bias affects genome assembly (Chen et al.,
<xref rid="B4" ref-type="bibr">2013</xref>
).</p>
<p>A low-complexity sequence of nucleotides has highly biased distribution of nucleotides in a way that makes the sequence less diverse of unique
<italic>k</italic>
-mers of nucleotides. The lower the complexity of a sequence, the more likely that the sequence will be mapped to different parts of the genome. In other words, when we process low-complex sequences, there is less chance that we can align it to a specific part of the genome uniquely. This low level of certainty regarding the real position of a sequence makes it less desirable to be used.</p>
<p>Poly A/Poly T is a chain of A or T, used to prime the three and five sites in a genome sequence during cDNA library preparation (Brown,
<xref rid="B2" ref-type="bibr">2012</xref>
). Poly A/T sequences may cause bias in the reads. The intronic Poly A/T tails tend to splice out rather than staying between coding exons (Zhao et al.,
<xref rid="B25" ref-type="bibr">2014</xref>
). The GC content represents the ratio of a G-C pair in the genome sequence. The stop codons show a significantly high ratio of A-T nucleotides (Wuitschick and Karrer,
<xref rid="B23" ref-type="bibr">1999</xref>
), while coding codons have a higher GC content (Pozzoli et al.,
<xref rid="B16" ref-type="bibr">2008</xref>
). The GC content of a gene plays an important role in carrying the genetic information. The GC content of the human genome varies among different chromosomes. However, the average GC content of the human genome is 41% (Vogel,
<xref rid="B21" ref-type="bibr">1997</xref>
). The representation of A+T sequences can be significantly lower, because in the preparation of a standard library, a gel slice is used and heated up to 50°C, thereby increasing the bias of the GC content (Quail et al.,
<xref rid="B17" ref-type="bibr">2008</xref>
).</p>
<p>There are different techniques that try to remove those sequences with low-complex patterns from samples. Morgulis et al. (
<xref rid="B15" ref-type="bibr">2006</xref>
) presented the symmetric DUST method, which masks low-complex regions in a sequence to overcome context sensitivity in calculating the complexity score. Schmieder and Edwards (
<xref rid="B18" ref-type="bibr">2011</xref>
) proposed two methods to evaluate the sequence complexity. The first method is based on entropy as a measure. The second method, which is a variant of the DUST algorithm based on BLAST search, filters out the low-complex score sequences. Both methods consider each triplet of nucleotides as a word.</p>
<p>One of the downsides of the previous methods is that they focus only on the complexity of the sequences. This can be misleading in some cases due to the highly biased nature of the sequences. In this article, we propose a novel method called Zseq, which decreases the uniqueness score of highly biased regions, thereby filtering highly biased sequences and low-complex sequences.</p>
</sec>
<sec id="s002">
<title>2. Methods</title>
<p>The
<italic>z</italic>
-score measurement has been used in different applications in bioinformatics (Cheadle et al.,
<xref rid="B3" ref-type="bibr">2003</xref>
; Margulies et al.,
<xref rid="B14" ref-type="bibr">2005</xref>
). Chopping sequence into
<italic>k</italic>
-mers is an essential technique in read assembly. We present the Zseq algorithm that uses the
<italic>z</italic>
-score measurement based on uniqueness scores of all reads. The uniqueness score is the normalized number of unique
<italic>k</italic>
-mers in each read that takes low-complex regions into account.
<xref ref-type="fig" rid="f1">Figure 1</xref>
depicts the process of finding reads with improved quality. Each module is explained in detail in the next few paragraphs.</p>
<fig id="f1" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 1.</bold>
</label>
<caption>
<p>Schematic representation of the process for filtering reads using the Zseq method.</p>
</caption>
<graphic xlink:href="fig-1"></graphic>
</fig>
<p>In the first step, Zseq scans all the reads and calculates the uniqueness score for all reads. The uniqueness score corresponding to each read is equal to the number of unique
<italic>k</italic>
-mers in that read. Zseq considers the default
<italic>k</italic>
-mer size,
<italic>w</italic>
, as 4-mers, which makes the vocabulary of four nucleotides (A,T,C,G) to be
<inline-formula>
<tex-math id="eq1" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${4^4} = 256$$ \end{document}</tex-math>
</inline-formula>
words. As the long reads may contain thousands of nucleotides, the 3-mer size is not sufficient to measure the complexity of the reads. This is because a 3-mer word can exist many times in the same read without being considered as unique, even when it is associated with different nucleotides each time. Zseq excludes the 5-mers of the low-complex/biased artifacts, such as ambiguous bases (N), PolyA/T, and GC content, from being unique by decreasing the unique score of the reads by one for each
<inline-formula>
<tex-math id="eq2" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$2w$$ \end{document}</tex-math>
</inline-formula>
to reduce the chances of selecting this sequence later. The uniqueness score of each read is then normalized by dividing it by the length of the read. The normalized uniqueness scores of all reads are stored in a vector with the same order of the read in the input file.</p>
<p>
<xref ref-type="fig" rid="f2">Figure 2</xref>
shows the distribution of the normalized uniqueness scores for all reads for sample SRR202054 from the prostate cancer data set used in the study of Kim et al. (
<xref rid="B10" ref-type="bibr">2011</xref>
). The
<italic>x</italic>
-axis shows the normalized uniqueness scores, while the
<italic>y</italic>
-axis shows the number of reads. As shown in the figure, the penalized sequences have a very small score down to −30. These are sequences that have been generated using reads that contain long PolyA/T sequences, very high GC content, or very high number of ambiguous nucleotides (N).</p>
<fig id="f2" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 2.</bold>
</label>
<caption>
<p>Distribution of the normalized uniqueness scores for all reads in sample (SRR202054) (
<inline-formula>
<tex-math id="eq3" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu = 25.8169 , \sigma = 7.1681$$ \end{document}</tex-math>
</inline-formula>
).</p>
</caption>
<graphic xlink:href="fig-2"></graphic>
</fig>
<p>In the next step, Zseq calculates the mean and standard deviation for the normalized uniqueness scores. The mean of the normalized uniqueness scores of all reads is calculated in the first loop. The variance is also calculated linearly using a naive algorithm to reduce the cost of this step. The standard deviation is calculated from the variance of the vector of the normalized uniqueness scores.</p>
<p>Next, for each normalized uniqueness score, we calculate the
<italic>z</italic>
-score using the mean,
<inline-formula>
<tex-math id="eq4" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu$$ \end{document}</tex-math>
</inline-formula>
, and the standard deviation,
<inline-formula>
<tex-math id="eq5" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma$$ \end{document}</tex-math>
</inline-formula>
, as follows:
<disp-formula>
<tex-math id="eq6" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} z = ( s - \mu ) / \sigma . \tag{1} \end{align*} \end{document}</tex-math>
</disp-formula>
</p>
<p>The
<italic>z</italic>
-score represents how many standard deviations the normalized uniqueness score of the read is away from the mean
<inline-formula>
<tex-math id="eq7" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu$$ \end{document}</tex-math>
</inline-formula>
for all normalized uniqueness scores. In other words, if a read has a
<italic>z</italic>
-score of 0, it means that the read has the normalized uniqueness score of
<inline-formula>
<tex-math id="eq8" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu$$ \end{document}</tex-math>
</inline-formula>
, while a
<italic>z</italic>
-score of value 1 means that the normalized uniqueness score is away exactly one standard deviation from the
<inline-formula>
<tex-math id="eq9" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu$$ \end{document}</tex-math>
</inline-formula>
.
<xref ref-type="fig" rid="f3">Figure 3</xref>
shows the
<italic>z</italic>
-scores for all reads in the sample (SRR202054), where the
<italic>x</italic>
-axis is the
<italic>z</italic>
-score of the normalized uniqueness scores, while the
<italic>y</italic>
-axis indicates how many reads a particular
<italic>z</italic>
-score has in the sample.</p>
<fig id="f3" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 3.</bold>
</label>
<caption>
<p>Distribution of the
<italic>z</italic>
-scores of the normalized uniqueness scores corresponding to each read for sample (SRR202054).</p>
</caption>
<graphic xlink:href="fig-3"></graphic>
</fig>
<p>Finally, the user-adjustable threshold
<inline-formula>
<tex-math id="eq10" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
is used to determine whether or not to select the reads, if the
<italic>z</italic>
-score of the normalized uniqueness score of the reads is greater than or equal to
<inline-formula>
<tex-math id="eq11" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
, the read will be selected; otherwise, it will be filtered out.</p>
<sec id="s003">
<title>2.1. Estimating the cutoff point</title>
<p>A data-driven method based on the labeling rules is used to filter out the reads with low uniqueness score. The method automatically determines the cutoff point
<italic>c</italic>
to compensate
<inline-formula>
<tex-math id="eq12" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
in the histogram of reads uniqueness scores and removes those reads whose uniqueness score is less than
<italic>c</italic>
. The labeling rules model calculates the rst quartile q1 and third quartile q3 using mean and standard deviation, both of which are in the rst loop through the reads. The cutoff point is calculated as follows:
<disp-formula>
<tex-math id="eq13" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} c = q1 - g ( q3 - q1 ) , \tag{2} \end{align*} \end{document}</tex-math>
</disp-formula>
</p>
<p>where
<italic>g</italic>
is the g-factor that can be calculated as follows:
<disp-formula>
<tex-math id="eq14" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} g = ( h - q1 ) / h , \tag{3} \end{align*} \end{document}</tex-math>
</disp-formula>
</p>
<p>with
<italic>h</italic>
being the highest value in the histogram of reads' uniqueness scores. After calculating the cutoff point
<italic>c</italic>
, the method sweeps again throughout the reads and selects those that have
<inline-formula>
<tex-math id="eq15" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$uniquenessscore > = c$$ \end{document}</tex-math>
</inline-formula>
.</p>
</sec>
</sec>
<sec id="s004">
<title>3. Results</title>
<p>In our experiments, we used the prostate cancer data set utilized in the study by Kim et al. (
<xref rid="B10" ref-type="bibr">2011</xref>
). The data set is publicly available in NCBI Gene Expression Omnibus (GEO) under Accession No. GSE29155. It contains 11 samples in total, where 7 of them belong to tumor tissues and the remaining 4 samples are benign. We measured the GC content and the number of ambiguous bases of the outcomes of each method, and then aligned the results of both methods to the human genome using Tophat2 as the alignment method (Kim et al.,
<xref rid="B9" ref-type="bibr">2013</xref>
).</p>
<p>DUST takes a value that ranges from 0 and 100 as the complexity threshold, while Zseq takes a
<italic>z</italic>
-score value as a complexity threshold, which shows how many standard deviations the normalized uniqueness score of the read is away from the mean. For the DUST method, we chose the value 5 as the threshold, which means that the value of the complexity of the read has to be greater than or equal to 5 to be selected; otherwise, DUST will ignore the read. For Zseq, we have chosen −1.5 as the value of the threshold, which makes the read good to be selected if the
<italic>z</italic>
-score of that read is greater than or equal to −1.5. The reason behind selecting these two thresholds is that both methods filter almost the same number of reads in each sample. The filtered reads using Zseq have less GC content than the filtered reads using DUST. It also has smaller standard deviation, which makes the reads centered more around the mean than DUST.
<xref ref-type="fig" rid="f4">Figures 4</xref>
and
<xref ref-type="fig" rid="f5">5</xref>
show the GC-content distributions for both methods applied on the same sample set (SRR202058).</p>
<fig id="f4" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 4.</bold>
</label>
<caption>
<p>Percentage of GC content for all filtered reads using the Zseq histogram with
<inline-formula>
<tex-math id="eq16" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu = 52.63 \%$$ \end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="eq17" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma = 12.08 \%$$ \end{document}</tex-math>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="fig-4"></graphic>
</fig>
<fig id="f5" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 5.</bold>
</label>
<caption>
<p>Percentage of GC content for all filtered reads using the DUST histogram with
<inline-formula>
<tex-math id="eq18" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu = 53.09 \%$$ \end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="eq19" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma = 12.36 \%$$ \end{document}</tex-math>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="fig-5"></graphic>
</fig>
<p>Zseq shows a slight improvement in reducing the GC content, mapping rate, and mapping time, while dropping the number of ambiguous bases drastically in comparison with DUST.
<xref ref-type="table" rid="T1">Table 1</xref>
shows that the number of ambiguous bases, N, in the filtered reads using Zseq has drastically decreased compared with the ambiguous bases that have been filtered out using DUST in all samples. For example, the number of occurrences of N in sample SRR202054 for filtered reads by DUST is 19,177, while there are only 11,135 filtered reads using Zseq for the same sample. The results indicate that Zseq slightly shrunk the GC-content percentage distribution and reduced the mean of the GC-content percentage. For sample SRR202055, the mean of the GC content is 52.48% ± 12.10% using Zseq, which is less than the 52.91% ± 12.38% obtained using the DUST method. Zseq also shows better mapping alignment for the filtered reads than DUST for most of the samples. For example, in sample SRR202061, the reads filtered by Zseq have 79.20% mapping rate, which is greater than 77.90% mapping rate for reads filtered by DUST, the only exception is sample SRR202062, which shows a similar mapping rate of 71.30% for both DUST and Zseq.</p>
<table-wrap id="T1" orientation="portrait" position="float">
<label>
<sc>Table</sc>
1.</label>
<caption>
<p>
<sc>Comparison of the Results of Applying Zseq on Samples from the Prostate Cancer Data Set as a Result of Applying DUST on the Same Samples</sc>
</p>
</caption>
<pmc-comment>OASIS TABLE HERE</pmc-comment>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead>
<tr>
<th align="left"> </th>
<th colspan="3" align="center">
<italic>Original</italic>
</th>
<th colspan="3" align="center">
<italic>Zseq</italic>
</th>
<th colspan="3" align="center">
<italic>DUST</italic>
</th>
</tr>
<tr>
<th align="left">
<italic>Sample number</italic>
</th>
<th align="center">
<italic>Occurrences of N</italic>
</th>
<th align="center">
<italic>Mean GC content (%)</italic>
</th>
<th align="center">
<italic>Mapping rate (%)</italic>
</th>
<th align="center">
<italic>Occurrences of N</italic>
</th>
<th align="center">
<italic>Mean GC content (%)</italic>
</th>
<th align="center">
<italic>Mapping rate (%)</italic>
</th>
<th align="center">
<italic>Occurrences of N</italic>
</th>
<th align="center">
<italic>Mean GC content (%)</italic>
</th>
<th align="center">
<italic>Mapping rate (%)</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SRR202054</td>
<td align="center">40,690</td>
<td align="center">52.82 ± 14.06</td>
<td align="center">91.50</td>
<td align="center">11,135</td>
<td align="center">52.61 ± 12.20</td>
<td align="center">93.00</td>
<td align="center">19,177</td>
<td align="center">52.89 ± 12.33</td>
<td align="center">92.80</td>
</tr>
<tr>
<td align="left">SRR202055</td>
<td align="center">42,965</td>
<td align="center">53.01 ± 13.74</td>
<td align="center">91.20</td>
<td align="center">9336</td>
<td align="center">52.48 ± 12.10</td>
<td align="center">92.40</td>
<td align="center">19,470</td>
<td align="center">52.91 ± 12.38</td>
<td align="center">92.10</td>
</tr>
<tr>
<td align="left">SRR202056</td>
<td align="center">40,243</td>
<td align="center">52.94 ± 13.99</td>
<td align="center">91.40</td>
<td align="center">10,721</td>
<td align="center">52.67 ± 12.22</td>
<td align="center">92.80</td>
<td align="center">18,336</td>
<td align="center">52.95 ± 12.36</td>
<td align="center">92.60</td>
</tr>
<tr>
<td align="left">SRR202057</td>
<td align="center">42,630</td>
<td align="center">52.94 ± 13.94</td>
<td align="center">91.30</td>
<td align="center">10,403</td>
<td align="center">52.65 ± 12.22</td>
<td align="center">92.60</td>
<td align="center">20,018</td>
<td align="center">52.93 ± 12.36</td>
<td align="center">92.40</td>
</tr>
<tr>
<td align="left">SRR202058</td>
<td align="center">16,643</td>
<td align="center">53.12 ± 14.03</td>
<td align="center">91.00</td>
<td align="center">14,023</td>
<td align="center">52.63 ± 12.08</td>
<td align="center">92.40</td>
<td align="center">16,198</td>
<td align="center">53.09 ± 12.36</td>
<td align="center">92.30</td>
</tr>
<tr>
<td align="left">SRR202059</td>
<td align="center">17,741</td>
<td align="center">52.56 ± 13.88</td>
<td align="center">90.70</td>
<td align="center">14,042</td>
<td align="center">52.18 ± 12.02</td>
<td align="center">92.00</td>
<td align="center">17,091</td>
<td align="center">52.61 ± 12.28</td>
<td align="center">91.90</td>
</tr>
<tr>
<td align="left">SRR202060</td>
<td align="center">19,958</td>
<td align="center">53.44 ± 13.98</td>
<td align="center">90.90</td>
<td align="center">13,775</td>
<td align="center">53.23 ± 12.09</td>
<td align="center">92.40</td>
<td align="center">17,281</td>
<td align="center">53.51 ± 12.21</td>
<td align="center">92.30</td>
</tr>
<tr>
<td align="left">SRR202061</td>
<td align="center">2156</td>
<td align="center">50.06 ± 11.50</td>
<td align="center">77.00</td>
<td align="center">1849</td>
<td align="center">48.87 ± 9.96</td>
<td align="center">79.20</td>
<td align="center">2100</td>
<td align="center">49.95 ± 11.12</td>
<td align="center">77.90</td>
</tr>
<tr>
<td align="left">SRR202062</td>
<td align="center">5837</td>
<td align="center">52.81 ± 13.64</td>
<td align="center">69.10</td>
<td align="center">5122</td>
<td align="center">52.69 ± 11.77</td>
<td align="center">71.30</td>
<td align="center">5466</td>
<td align="center">52.91 ± 11.84</td>
<td align="center">71.30</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s005">
<title>3.1. De novo sequence validation</title>
<p>Using Trinity
<italic>de novo</italic>
assembler (Grabherr et al.,
<xref rid="B6" ref-type="bibr">2011</xref>
), transcripts have been reconstructed for the original reads of sample SRR202058, reads that have been filtered by DUST and reads that have been filtered by Zseq. In the next step, all three sets of constructed transcripts were evaluated by searching the assembled transcripts with the human genome sequences using BLAST (Altschul et al.,
<xref rid="B1" ref-type="bibr">1997</xref>
). The set of the reconstructed transcript using the filtered reads by Zseq contains a higher number of long sequences in comparison with the other two sets.
<xref ref-type="fig" rid="f6">Figures 6</xref>
,
<xref ref-type="fig" rid="f7">7</xref>
, and
<xref ref-type="fig" rid="f8">8</xref>
show the meaningful sequences for each set. Some of the sequences, which were built using the reads filtered by Zseq, have a length of 1000 bp or more along with a high alignment score, while the sequence length is slightly more than 300 bp using the reads filtered by DUST and 200 bp for the original reads without filtering.</p>
<fig id="f6" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 6.</bold>
</label>
<caption>
<p>Biologically meaningful human genomic sequences found using BLAST.
<italic>De novo</italic>
assembled transcripts using original reads.</p>
</caption>
<graphic xlink:href="fig-6"></graphic>
</fig>
<fig id="f7" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 7.</bold>
</label>
<caption>
<p>Biologically meaningful human genomic sequences found using BLAST.
<italic>De novo</italic>
assembled transcripts using reads filtered by DUST.</p>
</caption>
<graphic xlink:href="fig-7"></graphic>
</fig>
<fig id="f8" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 8.</bold>
</label>
<caption>
<p>Biologically meaningful human genomic sequences found using BLAST.
<italic>De novo</italic>
assembled transcripts using reads filtered by Zseq.</p>
</caption>
<graphic xlink:href="fig-8"></graphic>
</fig>
</sec>
<sec id="s006">
<title>3.2. Machine learning validation</title>
<p>In another experiment, we used an independent data set containing 12 samples (six tumors and six matched normal) (Kannan et al.,
<xref rid="B8" ref-type="bibr">2011</xref>
). Using these samples, three data sets were generated, one from the original reads, one by applying DUST on the reads, and the third one by applying Zseq on the reads for all samples. In the next step, all reads corresponding to each data set have been aligned to human genome hg19 using Tophat2 (Kim et al.,
<xref rid="B9" ref-type="bibr">2013</xref>
) and Cufflinks assembler (Trapnell et al.,
<xref rid="B20" ref-type="bibr">2012</xref>
) with default parameters to assemble the transcripts to the human genome and estimate their abundance, which is measured by FPKM value (fragments per kilo bases of exons for per million mapped reads).
<xref ref-type="table" rid="T2">Table 2</xref>
shows the average mapping rate of reads filtered by each method.</p>
<table-wrap id="T2" orientation="portrait" position="float">
<label>
<sc>Table</sc>
2.</label>
<caption>
<p>
<sc>Average Mapping Rate of Transcripts Using the Data Set Generated by the Original Reads, Reads Filtered by DUST, and Reads Filtered by Zseq</sc>
</p>
</caption>
<pmc-comment>OASIS TABLE HERE</pmc-comment>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead>
<tr>
<th align="left">
<italic>Original</italic>
</th>
<th align="center">
<italic>DUST</italic>
</th>
<th align="center">
<italic>Zseq</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">88.90%</td>
<td align="center">90.10%</td>
<td align="center">90.40%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Each generated data set using filtered reads has 43,497 features (transcripts) with FPKM values. Also, each of the 12 samples was labeled as
<italic>cancer</italic>
or
<italic>matched benign</italic>
. The FPKM value equals 0 if the transcript has not been presented in that sample. We measured the number of transcripts that can individually separate all cancer samples from normal samples perfectly, with 100% accuracy. In other words, we want to compute the number of transcripts generated using filtered reads by each method, in such a way that the FPKM values corresponding to cancer samples can be separated from those of FPKM of normal samples.
<xref ref-type="fig" rid="f9">Figure 9</xref>
depicts two transcripts; transcript
<italic>a</italic>
has clearly separable FPKM values, while in transcript
<italic>b</italic>
, the FPKM values cannot be separated accurately.</p>
<fig id="f9" fig-type="figure" orientation="portrait" position="float">
<label>
<bold>FIG. 9.</bold>
</label>
<caption>
<p>An example of two transcripts, one with separable FPKM values
<bold>(a)</bold>
, and other transcript with inseparable FPKM values
<bold>(b)</bold>
.</p>
</caption>
<graphic xlink:href="fig-9"></graphic>
</fig>
<p>
<xref ref-type="table" rid="T3">Table 3</xref>
shows the number of transcripts that contain separable FPKM values. These results indicate that applying Zseq influences the alignment tool and assembler to quantify more meaningful transcripts that can discriminate cancer and normal samples in comparison with the DUST method and original reads.</p>
<table-wrap id="T3" orientation="portrait" position="float">
<label>
<sc>Table</sc>
3.</label>
<caption>
<p>
<sc>The Number of Discriminative Transcripts For Each of the Three Data Sets</sc>
</p>
</caption>
<pmc-comment>OASIS TABLE HERE</pmc-comment>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead>
<tr>
<th align="left">
<italic>Data set</italic>
</th>
<th align="center">
<italic>No. of discriminative transcripts</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="center">167</td>
</tr>
<tr>
<td align="left">Filtered by DUST</td>
<td align="center">159</td>
</tr>
<tr>
<td align="left">Filtered by Zseq</td>
<td align="center">231</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Moreover, using chi2 (Liu and Setiono,
<xref rid="B12" ref-type="bibr">1995</xref>
) statistical test on the 231 discriminative transcripts from Zseq data set, the NM_001145410 transcript corresponding to NONO gene was the most significant transcript among all other transcripts in all three data sets. NONO is known to regulate in different types of cancers such as breast and prostate cancer (Traish et al.,
<xref rid="B19" ref-type="bibr">1997</xref>
; Ishiguro et al.,
<xref rid="B7" ref-type="bibr">2003</xref>
). Next, a support vector machine (SVM) with linear kernel was applied on the three data sets using this transcript as feature. SVM is a supervised learning machine that tries to find an optimal separating hyperplane between classes (Cortes and Vapnik,
<xref rid="B5" ref-type="bibr">1995</xref>
). Using a leave-two-out cross-validation scheme, the classification returns 100% accuracy for the Zseq data set, 91.66% for the DUST data set, while it was down to 83.33% in the original read data set.</p>
</sec>
<sec id="s007">
<title>3.3. Result of estimated cutoff point Zseq</title>
<p>Result of estimated cutoff point Zseq as shown in
<xref ref-type="table" rid="T4">Tables 4</xref>
and
<xref ref-type="table" rid="T5">5</xref>
suggested that the method does not find the optimal point. The result of Zseq on the prostate cancer data set using the threshold
<inline-formula>
<tex-math id="eq20" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
 = −1.5 in the previous section outperformed the result of the EC-Zseq. Despite having a better mapping rate, EC-Zseq falls short in mean GC content to Zseq with
<inline-formula>
<tex-math id="eq21" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
, in a number of ambiguous nucleotide measurements comparing to DUST and Zseq with
<inline-formula>
<tex-math id="eq22" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
, and in a number of decisive transcripts comparing to Zseq with
<inline-formula>
<tex-math id="eq23" notation="LaTeX">\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document}</tex-math>
</inline-formula>
. However, EC-Zseq still shows a better result than the original data set or preprocessing the data set using the DUST method.</p>
<table-wrap id="T4" orientation="portrait" position="float">
<label>
<sc>Table</sc>
4.</label>
<caption>
<p>
<sc>Some Artifact Measurements of Prostate Cancer Samples That Were Preprocessed By Ec-Zseq</sc>
</p>
</caption>
<pmc-comment>OASIS TABLE HERE</pmc-comment>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead>
<tr>
<th align="left"> </th>
<th colspan="3" align="center">
<italic>EC-Zseq</italic>
</th>
</tr>
<tr>
<th align="left">
<italic>Sample number</italic>
</th>
<th align="center">
<italic>Occurrences of N</italic>
</th>
<th align="center">
<italic>Mean GC content (%)</italic>
</th>
<th align="center">
<italic>Mapping rate (%)</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SRR202054</td>
<td align="center">33,124</td>
<td align="center">52.71 ± 13.40</td>
<td align="center">93.40</td>
</tr>
<tr>
<td align="left">SRR202055</td>
<td align="center">27,890</td>
<td align="center">52.91 ± 12.76</td>
<td align="center">93.30</td>
</tr>
<tr>
<td align="left">SRR202056</td>
<td align="center">34,453</td>
<td align="center">52.82 ± 13.07</td>
<td align="center">93.50</td>
</tr>
<tr>
<td align="left">SRR202057</td>
<td align="center">30,321</td>
<td align="center">52.68 ± 12.52</td>
<td align="center">93.40</td>
</tr>
<tr>
<td align="left">SRR202058</td>
<td align="center">14,760</td>
<td align="center">52.87 ± 13.43</td>
<td align="center">92.90</td>
</tr>
<tr>
<td align="left">SRR202059</td>
<td align="center">15,203</td>
<td align="center">52.18 ± 12.62</td>
<td align="center">92.80</td>
</tr>
<tr>
<td align="left">SRR202060</td>
<td align="center">16,704</td>
<td align="center">53.31 ± 12.09</td>
<td align="center">92.70</td>
</tr>
<tr>
<td align="left">SRR202061</td>
<td align="center">1926</td>
<td align="center">49.11 ± 10.62</td>
<td align="center">79.70</td>
</tr>
<tr>
<td align="left">SRR202062</td>
<td align="center">5484</td>
<td align="center">532.70 ± 12.47</td>
<td align="center">72.10</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5" orientation="portrait" position="float">
<label>
<sc>Table</sc>
5.</label>
<caption>
<p>
<sc>The Number of Decisive Transcripts for the Data Set That Was Preprocessed By Ec-Zseq</sc>
</p>
</caption>
<pmc-comment>OASIS TABLE HERE</pmc-comment>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead>
<tr>
<th align="left">
<italic>Data set</italic>
</th>
<th align="center">
<italic>No. of decisive transcripts</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Preprocessed by EC-Zseq</td>
<td align="center">222</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s008">
<title>4. Conclusion</title>
<p>We have presented a novel method for filtering the reads that reduce the number of biased, duplicate, or ambiguous sequences. Our method finds the complexity of the sequences by assigning a unique score to each read. Using a user-defined threshold, the user can filter the reads with a score less than the threshold. Applying the proposed method on real samples shows that the Zseq algorithm is statistically sound and provides a better mapping rate, while it significantly reduces the number of ambiguous bases in comparison with other state-of-the-art methods. Estimating the cutoff point using Labeling rules shows a good result. However, it is not the optimal. The Zseq method is publicly available and can be accessed using the following link:
<uri xlink:type="simple" xlink:href="http://sourceforge.net/projects/zseq">http://sourceforge.net/projects/zseq</uri>
.</p>
</sec>
</body>
<back>
<sec id="s009" sec-type="ack">
<title>Acknowledgment</title>
<p>This work has been partially supported by NSERC, the Natural Science and Engineering Research Council of Canada.</p>
</sec>
<sec id="s010" sec-type="COI-statement">
<title>Author Disclosure Statement</title>
<p>No competing financial interests exist.</p>
</sec>
<ref-list content-type="parsed">
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>S.F.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Madden</surname>
<given-names>T.L.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Schäffer</surname>
<given-names>A.A.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>1997</year>
<article-title>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</article-title>
.
<source>Nucleic Acids Res</source>
.
<volume>25</volume>
,
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="web">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>T.</given-names>
</name>
</person-group>
<source>Introduction to Genetics: A Molecular Approach</source>
. Garland Science,
<year>2012</year>
ISBN 9780815365099. Available at: URL
<uri xlink:type="simple" xlink:href="http://books.google.ca/books?id=TsvKPQAACAAJ">http://books.google.ca/books?id=TsvKPQAACAAJ</uri>
Last viewed on
<season>Jan.</season>
<day>20</day>
,
<year>2017</year>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheadle</surname>
<given-names>C.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Vawter</surname>
<given-names>M.P.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Freed</surname>
<given-names>W.J.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2003</year>
<article-title>Analysis of microarray data using z score transformation</article-title>
.
<source>J. Mol. Diagn</source>
.
<volume>5</volume>
,
<fpage>73</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="pmid">12707371</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.-C.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>T.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>C.-H.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2013</year>
<article-title>Effects of GC bias in next-generation-sequencing data on
<italic>de novo</italic>
genome assembly</article-title>
.
<source>PLoS One</source>
<volume>8</volume>
,
<fpage>e62856</fpage>
<pub-id pub-id-type="pmid">23638157</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cortes</surname>
<given-names>C.</given-names>
</name>
</person-group>
, and
<person-group person-group-type="author">
<name>
<surname>Vapnik</surname>
<given-names>V.</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>Support-vector networks</article-title>
.
<source>Machine Learn</source>
.
<volume>20</volume>
,
<fpage>273</fpage>
<lpage>297</lpage>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grabherr</surname>
<given-names>M.G.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Haas</surname>
<given-names>B.J.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Yassour</surname>
<given-names>M.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2011</year>
<article-title>Full-length transcriptome assembly from RNA-Seq data without a reference genome</article-title>
.
<source>Nat. Biotechnol</source>
.
<volume>29</volume>
,
<fpage>644</fpage>
<lpage>652</lpage>
<pub-id pub-id-type="pmid">21572440</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ishiguro</surname>
<given-names>H.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Uemura</surname>
<given-names>H.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Fujinami</surname>
<given-names>K.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2003</year>
<article-title>55 kDa nuclear matrix protein (nmt55) mRNA is expressed in human prostate cancer tissue and is associated with the androgen receptor</article-title>
.
<source>Int. J. Cancer</source>
.
<volume>105</volume>
,
<fpage>26</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="pmid">12672026</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kannan</surname>
<given-names>K.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2011</year>
<article-title>Recurrent chimeric RNAs enriched in human prostate cancer identified by deep sequencing</article-title>
.
<source>Proc. Natl Acad. Sci. U. S. A.</source>
<volume>108</volume>
,
<fpage>9172</fpage>
<lpage>9177</lpage>
<pub-id pub-id-type="pmid">21571633</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>D.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Pertea</surname>
<given-names>G.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Trapnell</surname>
<given-names>C.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2013</year>
<article-title>TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions</article-title>
.
<source>Genome Biol</source>
.
<volume>14</volume>
,
<fpage>R36</fpage>
<pub-id pub-id-type="pmid">23618408</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>J.H.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Dhanasekaran</surname>
<given-names>S.M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Prensner</surname>
<given-names>J.R.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2011</year>
<article-title>Deep sequencing reveals distinct patterns of DNA methylation in prostate cancer</article-title>
.
<source>Genome Res</source>
.
<volume>21</volume>
,
<fpage>1028</fpage>
<lpage>1041</lpage>
<pub-id pub-id-type="pmid">21724842</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lavezzo</surname>
<given-names>E.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Barzon</surname>
<given-names>L.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Toppo</surname>
<given-names>S.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2016</year>
<article-title>Third generation sequencing technologies applied to diagnostic microbiology: Benefits and challenges in applications and data analysis</article-title>
.
<source>Expert Rev. Mol. Diagn</source>
.
<volume>16</volume>
,
<fpage>1011</fpage>
<lpage>1023</lpage>
<pub-id pub-id-type="pmid">27453996</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
</person-group>
, and
<person-group person-group-type="author">
<name>
<surname>Setiono</surname>
<given-names>R.</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>Chi2: Feature selection and discretization of numeric attributes</article-title>
.
<conf-name>Presented at 2012 IEEE 24th International Conference on Tools with Artificial Intelligence</conf-name>
,
<publisher-name>IEEE Computer Society</publisher-name>
,
<publisher-loc>Herndon, VA, USA</publisher-loc>
pp.
<fpage>388</fpage>
<lpage>388</lpage>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mackinnon</surname>
<given-names>M.J.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Mok</surname>
<given-names>S.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2009</year>
<article-title>Comparative transcriptional and genomic analysis of
<italic>Plasmodium falciparum</italic>
field isolates</article-title>
.
<source>PLoS Pathog</source>
.
<volume>5</volume>
,
<fpage>e1000644</fpage>
<pub-id pub-id-type="pmid">19898609</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Margulies</surname>
<given-names>M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Egholm</surname>
<given-names>M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Altman</surname>
<given-names>W.E.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2005</year>
<article-title>Genome sequencing in microfabricated high-density picolitre reactors</article-title>
.
<source>Nature</source>
<volume>437</volume>
,
<fpage>376</fpage>
<lpage>380</lpage>
<pub-id pub-id-type="pmid">16056220</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morgulis</surname>
<given-names>A.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Gertz</surname>
<given-names>E.M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Schäffer</surname>
<given-names>A.A.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2006</year>
<article-title>A fast and symmetric dust implementation to mask low-complexity DNA sequences</article-title>
.
<source>J. Comput. Biol</source>
.
<volume>13</volume>
,
<fpage>1028</fpage>
<lpage>1040</lpage>
<pub-id pub-id-type="pmid">16796549</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pozzoli</surname>
<given-names>U.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Menozzi</surname>
<given-names>G.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Fumagalli</surname>
<given-names>M.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2008</year>
<article-title>Both selective and neutral processes drive GC content evolution in the human genome.
<italic>BMC Evol</italic>
</article-title>
.
<source>Biol</source>
.
<volume>8</volume>
,
<fpage>99</fpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quail</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Kozarewa</surname>
<given-names>I.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Smith</surname>
<given-names>F.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2008</year>
<article-title>A large genome center's improvements to the Illumina sequencing system</article-title>
.
<source>Nat. Methods</source>
.
<volume>5</volume>
,
<fpage>1005</fpage>
<lpage>1010</lpage>
<pub-id pub-id-type="pmid">19034268</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schmieder</surname>
<given-names>R.</given-names>
</name>
</person-group>
, and
<person-group person-group-type="author">
<name>
<surname>Edwards</surname>
<given-names>R.</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Quality control and preprocessing of metagenomic datasets</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
,
<fpage>863</fpage>
<lpage>864</lpage>
<pub-id pub-id-type="pmid">21278185</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Traish</surname>
<given-names>A.M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>Y.-H.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Ashba</surname>
<given-names>J.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>1997</year>
<article-title>Loss of expression of a 55 kDa nuclear protein (nmt55) in estrogen receptor-negative human breast cancer</article-title>
.
<source>Diagn. Mol. Pathol</source>
.
<volume>6</volume>
,
<fpage>209</fpage>
<lpage>221</lpage>
<pub-id pub-id-type="pmid">9360842</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trapnell</surname>
<given-names>C.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Roberts</surname>
<given-names>A.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Goff</surname>
<given-names>L.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2012</year>
<article-title>Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks</article-title>
.
<source>Nat. Protoc</source>
.
<volume>7</volume>
,
<fpage>562</fpage>
<lpage>578</lpage>
<pub-id pub-id-type="pmid">22383036</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Vogel</surname>
<given-names>F.</given-names>
</name>
</person-group>
<year>1997</year>
<source>Vogel and Motulsky's Human Genetics: Problems and Approaches</source>
, Volume
<volume>878</volume>
<publisher-name>Springer</publisher-name>
:
<publisher-loc>London, New York</publisher-loc>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Waszak</surname>
<given-names>S.M.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Kilpinen</surname>
<given-names>H.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Gschwind</surname>
<given-names>A.R.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2014</year>
<article-title>Identification and removal of low-complexity sites in allele-specific analysis of ChIP-seq data</article-title>
.
<source>Bioinformatics</source>
<volume>30</volume>
,
<fpage>165</fpage>
<lpage>171</lpage>
<pub-id pub-id-type="pmid">24255646</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wuitschick</surname>
<given-names>J.</given-names>
</name>
</person-group>
, and
<person-group person-group-type="author">
<name>
<surname>Karrer</surname>
<given-names>K.</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Analysis of genomic G+C content, codon usage, initiator codon context and translation termination sites in
<italic>Tetrahymena thermophila</italic>
</article-title>
.
<source>J. Eukaryot. Microbiol</source>
.
<volume>46</volume>
,
<fpage>239</fpage>
<lpage>247</lpage>
<pub-id pub-id-type="pmid">10377985</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yakovchuk</surname>
<given-names>P.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Protozanova</surname>
<given-names>E.</given-names>
</name>
</person-group>
, and
<person-group person-group-type="author">
<name>
<surname>Frank-Kamenetskii</surname>
<given-names>M.D.</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Base-stacking and base-pairing contributions into thermal stability of the DNA double helix</article-title>
.
<source>Nucleic Acids Res</source>
.
<volume>34</volume>
,
<fpage>564</fpage>
<lpage>574</lpage>
<pub-id pub-id-type="pmid">16449200</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
</person-group>
,
<person-group person-group-type="author">
<name>
<surname>Kumar</surname>
<given-names>P.K.</given-names>
</name>
,
<etal>et al.</etal>
</person-group>
<year>2014</year>
<article-title>Bioinformatics analysis of alternative polyadenylation in green alga
<italic>Chlamydomonas reinhardtii</italic>
using transcriptome sequences from three different sequencing platforms</article-title>
.
<source>G3</source>
<volume>4</volume>
,
<fpage>871</fpage>
<lpage>883</lpage>
<pub-id pub-id-type="pmid">24626288</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D99 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000D99 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5563921
   |texte=   Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:28414515" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021