Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0010619 ( Pmc/Corpus ); précédent : 0010618; suivant : 0010620 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Reference-Free Validation of Short Read Data</title>
<author>
<name sortKey="Schroder, Jan" sort="Schroder, Jan" uniqKey="Schroder J" first="Jan" last="Schröder">Jan Schröder</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bailey, James" sort="Bailey, James" uniqKey="Bailey J" first="James" last="Bailey">James Bailey</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Conway, Thomas" sort="Conway, Thomas" uniqKey="Conway T" first="Thomas" last="Conway">Thomas Conway</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zobel, Justin" sort="Zobel, Justin" uniqKey="Zobel J" first="Justin" last="Zobel">Justin Zobel</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">20877643</idno>
<idno type="pmc">2943903</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943903</idno>
<idno type="RBID">PMC:2943903</idno>
<idno type="doi">10.1371/journal.pone.0012681</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">001061</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001061</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Reference-Free Validation of Short Read Data</title>
<author>
<name sortKey="Schroder, Jan" sort="Schroder, Jan" uniqKey="Schroder J" first="Jan" last="Schröder">Jan Schröder</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bailey, James" sort="Bailey, James" uniqKey="Bailey J" first="James" last="Bailey">James Bailey</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Conway, Thomas" sort="Conway, Thomas" uniqKey="Conway T" first="Thomas" last="Conway">Thomas Conway</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zobel, Justin" sort="Zobel, Justin" uniqKey="Zobel J" first="Justin" last="Zobel">Justin Zobel</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of
<italic>k</italic>
-mers; and analysis of distributions of
<italic>k</italic>
-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Sanger, F" uniqKey="Sanger F">F Sanger</name>
</author>
<author>
<name sortKey="Nicklen, S" uniqKey="Nicklen S">S Nicklen</name>
</author>
<author>
<name sortKey="Coulson, Ar" uniqKey="Coulson A">AR Coulson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Bubnoff, A" uniqKey="Von Bubnoff A">A von Bubnoff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Johnson, Ds" uniqKey="Johnson D">DS Johnson</name>
</author>
<author>
<name sortKey="Mortazavi, A" uniqKey="Mortazavi A">A Mortazavi</name>
</author>
<author>
<name sortKey="Myers, Rm" uniqKey="Myers R">RM Myers</name>
</author>
<author>
<name sortKey="Wold, B" uniqKey="Wold B">B Wold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mortazavi, A" uniqKey="Mortazavi A">A Mortazavi</name>
</author>
<author>
<name sortKey="Williams, Ba" uniqKey="Williams B">BA Williams</name>
</author>
<author>
<name sortKey="Mccue, K" uniqKey="Mccue K">K McCue</name>
</author>
<author>
<name sortKey="Schaeffer, L" uniqKey="Schaeffer L">L Schaeffer</name>
</author>
<author>
<name sortKey="Wold, B" uniqKey="Wold B">B Wold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author>
<name sortKey="Wang, W" uniqKey="Wang W">W Wang</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Tian, G" uniqKey="Tian G">G Tian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wheeler, Da" uniqKey="Wheeler D">DA Wheeler</name>
</author>
<author>
<name sortKey="Srinivasan, M" uniqKey="Srinivasan M">M Srinivasan</name>
</author>
<author>
<name sortKey="Egholm, M" uniqKey="Egholm M">M Egholm</name>
</author>
<author>
<name sortKey="Shen, Y" uniqKey="Shen Y">Y Shen</name>
</author>
<author>
<name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hernandez, D" uniqKey="Hernandez D">D Hernandez</name>
</author>
<author>
<name sortKey="Francois, P" uniqKey="Francois P">P François</name>
</author>
<author>
<name sortKey="Farinelli, L" uniqKey="Farinelli L">L Farinelli</name>
</author>
<author>
<name sortKey=" Ster S, M" uniqKey=" Ster S M">M Østerås</name>
</author>
<author>
<name sortKey="Schrenzel, J" uniqKey="Schrenzel J">J Schrenzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dohm, Jc" uniqKey="Dohm J">JC Dohm</name>
</author>
<author>
<name sortKey="Lottaz, C" uniqKey="Lottaz C">C Lottaz</name>
</author>
<author>
<name sortKey="Borodina, T" uniqKey="Borodina T">T Borodina</name>
</author>
<author>
<name sortKey="Himmelbauer, H" uniqKey="Himmelbauer H">H Himmelbauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Harismendy, O" uniqKey="Harismendy O">O Harismendy</name>
</author>
<author>
<name sortKey="Ng, P" uniqKey="Ng P">P Ng</name>
</author>
<author>
<name sortKey="Strausberg, R" uniqKey="Strausberg R">R Strausberg</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Stockwell, T" uniqKey="Stockwell T">T Stockwell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Erlich, Y" uniqKey="Erlich Y">Y Erlich</name>
</author>
<author>
<name sortKey="Mitra, Pp" uniqKey="Mitra P">PP Mitra</name>
</author>
<author>
<name sortKey="Delabastide, M" uniqKey="Delabastide M">M delaBastide</name>
</author>
<author>
<name sortKey="Mccombie, Wr" uniqKey="Mccombie W">WR McCombie</name>
</author>
<author>
<name sortKey="Hannon, Gj" uniqKey="Hannon G">GJ Hannon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kircher, M" uniqKey="Kircher M">M Kircher</name>
</author>
<author>
<name sortKey="Stenzel, U" uniqKey="Stenzel U">U Stenzel</name>
</author>
<author>
<name sortKey="Kelso, J" uniqKey="Kelso J">J Kelso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rougemont, J" uniqKey="Rougemont J">J Rougemont</name>
</author>
<author>
<name sortKey="Amzallag, A" uniqKey="Amzallag A">A Amzallag</name>
</author>
<author>
<name sortKey="Iseli, C" uniqKey="Iseli C">C Iseli</name>
</author>
<author>
<name sortKey="Farinelli, L" uniqKey="Farinelli L">L Farinelli</name>
</author>
<author>
<name sortKey="Xenarios, I" uniqKey="Xenarios I">I Xenarios</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Brinza, D" uniqKey="Brinza D">D Brinza</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qu, W" uniqKey="Qu W">W Qu</name>
</author>
<author>
<name sortKey="Hashimoto, Si" uniqKey="Hashimoto S">Si Hashimoto</name>
</author>
<author>
<name sortKey="Morishita, S" uniqKey="Morishita S">S Morishita</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewing, B" uniqKey="Ewing B">B Ewing</name>
</author>
<author>
<name sortKey="Green, P" uniqKey="Green P">P Green</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">20877643</article-id>
<article-id pub-id-type="pmc">2943903</article-id>
<article-id pub-id-type="publisher-id">10-PONE-RA-17552R1</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0012681</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline">
<subject>Computational Biology</subject>
<subject>Genetics and Genomics</subject>
<subject>Computational Biology/Comparative Sequence Analysis</subject>
<subject>Genetics and Genomics/Bioinformatics</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Reference-Free Validation of Short Read Data</article-title>
<alt-title alt-title-type="running-head">Read Quality Control</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Schröder</surname>
<given-names>Jan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bailey</surname>
<given-names>James</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Conway</surname>
<given-names>Thomas</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zobel</surname>
<given-names>Justin</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>NICTA Victoria Research Laboratory, Parkville, Victoria, Australia</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Aramayo</surname>
<given-names>Rodolfo</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">Texas A&M University, United States of America</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>schroder@csse.unimelb.edu.au</email>
</corresp>
<fn fn-type="con">
<p>Conceived and designed the experiments: JS JB JZ. Performed the experiments: JS. Analyzed the data: JS JB TC JZ. Contributed reagents/materials/analysis tools: JS. Wrote the paper: JS JB JZ.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>22</day>
<month>9</month>
<year>2010</year>
</pub-date>
<volume>5</volume>
<issue>9</issue>
<elocation-id>e12681</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>3</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>8</month>
<year>2010</year>
</date>
</history>
<permissions>
<copyright-statement>Schröder et al.</copyright-statement>
<copyright-year>2010</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of
<italic>k</italic>
-mers; and analysis of distributions of
<italic>k</italic>
-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.</p>
</sec>
</abstract>
<counts>
<page-count count="11"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>High-throughput or next-generation sequencing techniques are now cheap and widely available. Current machines, which produce short reads of up to around 500 bases, are emerging as a fundamental tool in biology and medicine. Short-read sequencing has replaced Sanger sequencing
<xref rid="pone.0012681-Sanger1" ref-type="bibr">[1]</xref>
,
<xref rid="pone.0012681-vonBubnoff1" ref-type="bibr">[2]</xref>
for applications involving long sequences such as chromosomes or whole genomes, and the availability of short-read sequencing has given rise to ambitious projects such as the
<italic>1000 genomes project</italic>
(
<ext-link ext-link-type="uri" xlink:href="http://www.1000genomes.org">www.1000genomes.org</ext-link>
), which is using the technology to generate a detailed map of the genetic variation in humans. Applications of next-generation sequencing platforms involve identification of DNA protein interactions by ChIP-seq
<xref rid="pone.0012681-Johnson1" ref-type="bibr">[3]</xref>
, transcriptome analysis with RNA-seq
<xref rid="pone.0012681-Mortazavi1" ref-type="bibr">[4]</xref>
, or whole genome assembly
<xref rid="pone.0012681-Wang1" ref-type="bibr">[5]</xref>
,
<xref rid="pone.0012681-Wheeler1" ref-type="bibr">[6]</xref>
of both known and novel organisms.</p>
<p>Short-read sequencing generates vast amounts of data, using the shotgun process. DNA molecules are amplified and then fragmented using techniques such as sonification or nebulisation. Portions of the DNA fragments are then sequenced by an iterative process involving fluorescence, digital photography, and image analysis, yielding
<italic>short reads</italic>
of some fixed length, from around 35 to 100 bases (for Illumina and SOLiD) and up to 500 for (Roche 454). Each of these steps, chemical and digital, may introduce biases and errors.</p>
<p>The short read data produced by sequencing machines is analysed using bioinformatics tools such as resequencers and assemblers. When using such tools, simplifying assumptions are commonly made: for example, that reads are evenly spread over the sequenced genome
<xref rid="pone.0012681-Hernandez1" ref-type="bibr">[7]</xref>
<xref rid="pone.0012681-Zerbino1" ref-type="bibr">[9]</xref>
; and that errors randomly occur within each read according to random substitutions. A richer assumption is that errors are more likely towards the end of the read, but are otherwise random as to base and location. However, as we demonstrate in this paper, such assumptions do not appear to apply to the data generated by one of the main current sequencing platforms; we find that the biases are more extreme and more complex than has previously been suspected.</p>
<p>Identification of bias in short-read data has been explored in other work, such as that of Dohm et al.
<xref rid="pone.0012681-Dohm1" ref-type="bibr">[10]</xref>
and Harismendy et al.
<xref rid="pone.0012681-Harismendy1" ref-type="bibr">[11]</xref>
. Dohm et al.
<xref rid="pone.0012681-Dohm1" ref-type="bibr">[10]</xref>
focus on the Solexa 1G sequencing platform and measure aspects such as error rates, regional coverage, and biases towards particular sequences. The approach of Harismendy et al.
<xref rid="pone.0012681-Harismendy1" ref-type="bibr">[11]</xref>
focuses on comparison of different sequencing technologies, analysing similarities and differences in their genome coverage. Ehrlich et al., Kircher et al., and Rougemont et al. independently published works in 2008–09
<xref rid="pone.0012681-Erlich1" ref-type="bibr">[12]</xref>
<xref rid="pone.0012681-Rougemont1" ref-type="bibr">[14]</xref>
that focus on the base-calling process as an alternative to the solution Bustard provided by Solexa (Illumina). The work of Kircher et al.
<xref rid="pone.0012681-Kircher1" ref-type="bibr">[13]</xref>
points out that systematic errors can be made by the standard software, arising from chemical and optical issues.</p>
<p>In contrast, the methodology that we present focuses on identification of biases in the short-read data itself. In particular, we explore detection of bias with respect to the distributions of bases and
<italic>k</italic>
-mers at distinct positions in the reads. An advantage of this approach is that it does not require the availability of any reference sequence, which also means it is less sensitive to errors caused by polymorphisms in the organism being investigated.</p>
<p>Our methodology consists of three related aspects: first, counting and comparing the frequency of bases at specific positions in the reads; second, counting and comparing the frequency of
<italic>k</italic>
-mers at specific positions in the reads; and third, evaluating and comparing the distribution of
<italic>k</italic>
-mers at specific positions in the reads.</p>
<p>We apply our methodology to short-read sequencing data generated by the Illumina platform for the 1000 Genomes Project. This project is notable due to its public profile and ambitious goals, and aims to produce data of high quality. Our analysis identifies strong, complex biases in the reads from this dataset. For example, the base A is significantly overrepresented at the start of reads, while
<italic>k</italic>
-mers such as poly-T are dramatically underrepresented; the
<italic>k</italic>
-mers in the middle few bases of the reads are distributed differently to the other
<italic>k</italic>
-mers. This analysis raises questions about the quality of the data and the ways in which the data might be correctly interpreted.</p>
<p>We do not attempt to use our analysis to identify the causes of the biases in the test cases we examine. Our primary aim is to develop general techniques that can be used for processing any short read data (though in particular data intended for assembly rather than peak analysis), and each experiment will have its own characteristics; that is, we propose a first stage of analysis that should be applied before any further processing is undertaken. These methods could be used to track down sources of bias, but could equally be used, say, to choose between data sets; for example, we have observed that different data sets – even different lanes from the same sequencing run – may have different characteristics. A further application is that they could be used in assembly or resequencing to augment other information such as quality scores.</p>
<p>We have developed a software package that allows the user to examine biases in a set of short-read data. In the following, we focus on whole genome sequencing data from the Illumina platform, but our methodology is generally applicable to other sequencing techniques and platforms, as is our
<italic>quarc</italic>
package (introduced in the next section).</p>
<p>The paper is structured as follows. We outline a mathematical model for describing the common assumptions that have been made about short-read sequencing data and present three simple, yet powerful, techniques to analyse and assess bias. We then apply our techniques to data from the 1000 Genomes Project, and demonstrate the presence of data quality issues.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>Materials and Methods</title>
<p>We present three simple, yet powerful, techniques for assessing a collection of short reads. They can be easily applied to any kind of read data and do not require any information about the organism. A reference sequence is not required, but is useful if available. We begin with an overview of each of our techniques.</p>
<sec id="s2a">
<title>Technique 1: Analysing base calls</title>
<p>This simple measure is a count of how many times each specific base (A, C, G, T, or N) was called at given position in any read. Representing these base counts graphically can then provide hints about the general state of the data. If no bias is present, then one would expect equal counts at each position for a fixed base, and equal amounts within the two couples A&T and C&G, as there should be no bias between the forward and reverse strands in DNA. In general, the counts should directly reflect the C&G content of the sequenced material. Deviation from this expected outcome suggests possible biases in the data and where (in the reads) it may be present.</p>
</sec>
<sec id="s2b">
<title>Technique 2: Comparing occurrences of
<italic>k</italic>
-mers in the reads</title>
<p>Using a background model, not necessarily based on a reference sequence, we estimate the expected number of counts for given
<italic>k</italic>
-mers and then compare to their actual counts found at varying positions in the reads. In our experiments we consider
<italic>k</italic>
-mers of different lengths:
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e001.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, 4, 5, and 6. If no bias is present, one would expected relatively equal counts for a given
<italic>k</italic>
-mer at different read positions, and the count for each
<italic>k</italic>
-mer should reflect its content in the sequenced DNA. Again, deviation from this expected outcome suggests possible biases in the reads, allowing one to identify
<italic>k</italic>
-mers that are biased towards appearing at specific positions. If the background model is based on a reference sequence, then one can also identify which
<italic>k</italic>
-mers are generally under- or over-represented in the data. Note that technique 1 is a special case of technique 2 with
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e002.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</sec>
<sec id="s2c">
<title>Technique 3: Analysing and comparing distributions of
<italic>k</italic>
-mers in the read set</title>
<p>In this method, at each particular read position we compute the distribution of the frequencies of all possible
<italic>k</italic>
-mers (of fixed length). We then assess bias by comparing distributions at different read positions, using the Kullback-Leibler divergence measure known from information theory. This yields an overall dissimilarity score between the two distributions of
<italic>k</italic>
-mers, with higher values indicating higher dissimilarity. We can then compare this score with an expected value obtained using a bootstrap approach; the method is explained in detail in the results section. If no bias is present, then we expect distributions from different read positions to have a low dissimilarity score that is close to the expected value.</p>
</sec>
<sec id="s2d">
<title>Modelling</title>
<p>Note that our methodology is generally applicable to any kind of sequencing data. The interpretation of the results however is specific to the kind of experiment the data was acquired through. The following assumptions apply to whole genome sequencing data only. When interpreting other experiments, such as ChIP-seq or RNA-seq, different hypotheses have to be formulated and applied accordingly.</p>
<p>Our methods can be used to test hypotheses about the data, which we formalise as follows. Let
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e003.jpg" mimetype="image"></inline-graphic>
</inline-formula>
be the genome length,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e004.jpg" mimetype="image"></inline-graphic>
</inline-formula>
the read length, and
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e005.jpg" mimetype="image"></inline-graphic>
</inline-formula>
the number of reads. For a substring
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e006.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we define
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e007.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to be a random variable describing the number of reads that contain
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e008.jpg" mimetype="image"></inline-graphic>
</inline-formula>
at position
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e009.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. We can now model two standard assumptions.</p>
<sec id="s2d1">
<title>Uniform distribution of reads in the genome</title>
<p>Under this assumption, each position in the genome is equally likely to be sampled with a read by the sequencing machine. This has the following consequence: if there are
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e010.jpg" mimetype="image"></inline-graphic>
</inline-formula>
occurrences of a (short)
<italic>k</italic>
-mer
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e011.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in the genome, every read has a probability
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e012.jpg" mimetype="image"></inline-graphic>
</inline-formula>
of starting with
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e013.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. More precisely, let
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e014.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, then the number of reads starting with
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e015.jpg" mimetype="image"></inline-graphic>
</inline-formula>
should follow the binomial distribution below. This forms a null “uniform sampling” hypothesis:
<disp-formula>
<graphic xlink:href="pone.0012681.e016.jpg" mimetype="image" position="float"></graphic>
<label>(1)</label>
</disp-formula>
</p>
<p>An implication of this assumption is that
<italic>k</italic>
-mers occurring in reads are equally likely to occur at any position within the reads. We can state an associated second null “position independence” hypothesis:
<disp-formula>
<graphic xlink:href="pone.0012681.e017.jpg" mimetype="image" position="float"></graphic>
<label>(2)</label>
</disp-formula>
In other words, the random variable
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e018.jpg" mimetype="image"></inline-graphic>
</inline-formula>
should be uncorrelated to the parameter
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e019.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. If the probability of a read starting with
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e020.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e021.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, the same probability applies for all reads to have
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e022.jpg" mimetype="image"></inline-graphic>
</inline-formula>
at the second position, the third, and so forth; for convenience we neglect the special cases at the start and end of the genome.</p>
</sec>
<sec id="s2d2">
<title>Random substitution error model</title>
<p>Under this assumption, the errors made by the sequencing machine do not occur uniformly across all positions in the reads, but, where they do occur, random substitution errors are assumed. We model this assumption as follows. Let
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e023.jpg" mimetype="image"></inline-graphic>
</inline-formula>
be a random variable representing a substitution at an error position in a read, then we can state a third null hypothesis:
<disp-formula>
<graphic xlink:href="pone.0012681.e024.jpg" mimetype="image" position="float"></graphic>
<label>(3)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e025.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the probability distribution for the substitution of a base.</p>
</sec>
</sec>
<sec id="s2e">
<title>Datasets</title>
<p>We evaluate our techniques on a range of data sets coming from different sequencing platforms and organisms. The analysis includes a total of
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e026.jpg" mimetype="image"></inline-graphic>
</inline-formula>
reads of various lengths, organisms, and sequencing platforms. The full list of read sets used can be found in
<xref ref-type="supplementary-material" rid="pone.0012681.s001">Table S1</xref>
. For the sake of consistency and comparability, we will present only one of the data sets in the main manuscript, a publicly available read set from the
<italic>1000 genomes project</italic>
(
<ext-link ext-link-type="uri" xlink:href="http://www.1000genomes.org">www.1000genomes.org</ext-link>
).</p>
<p>The data
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e027.jpg" mimetype="image"></inline-graphic>
</inline-formula>
was generated by a recent edition of Illumina's Genome Analyzer II and is a union of the components
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e028.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e029.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e030.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, and
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e031.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where each component is extracted from the NA10847 dataset (available at
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA10847/sequence_read/">ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA10847/sequence_read/</ext-link>
). The data was generated with the
<italic>SBS v2</italic>
kit and processed by the
<italic>Pipeline v1.3</italic>
software package. This set is 6.6 GB of data in FASTQ format, consisting of approximately 52 million reads of length 51. Memory restrictions on our test hardware prevented us from using the entire set of data from the
<italic>NA10847</italic>
folder. Note that the volume of data we processed, and the size of the effects observed, ensures that results such as the differences in proportions of observed bases are statistically significant.</p>
<p>The reference for the human genome used in this paper is hg18 NCBI Build 36.1 from March 2006, available from genome.ucsc.edu/cgi-bin/hgTracks.</p>
<p>We have also successfully applied our analysis techniques to other datasets. Some results are included in the supplementary material and referenced in our discussions; we refer to this data as data sets
<italic>D1</italic>
to
<italic>D7</italic>
. The data presented in the main manuscript,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e032.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, corresponds to
<italic>D5</italic>
in
<xref ref-type="supplementary-material" rid="pone.0012681.s001">Table S1</xref>
.</p>
<p>We filtered the data for artefacts that were present in the collection of reads, in order to exclude artificial biases. We filtered out poly-A fragments, which, users suggested, may occur frequently at the peripheral areas of the flow cells in the Genome Analyzer due to reflections. We also filtered out a sequencing primer starting with “
<named-content content-type="gene">GATTACAGGCATGAGC</named-content>
”, which we were able to identify after k-mer analysis of the data set.</p>
</sec>
<sec id="s2f">
<title>Software</title>
<p>Publicly available software for our analysis methods can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://www.genomics.csse.unimelb.edu.au/quarc">www.genomics.csse.unimelb.edu.au/quarc</ext-link>
. The package is called
<italic>quarc</italic>
(Quality Analysis and Read Control).</p>
</sec>
</sec>
<sec id="s3">
<title>Results and Discussion</title>
<p>We applied our three analysis techniques to our subset of the 1000 Genomes Project data, and used the model and null hypotheses proposed above to interpret the significance of our results.</p>
<sec id="s3a">
<title>Analysing base call frequencies</title>
<p>Our first approach to analysing the data was inspired by observations made by Dohm et al.
<xref rid="pone.0012681-Dohm1" ref-type="bibr">[10]</xref>
and the base-call analysis routines provided by Illumina (
<ext-link ext-link-type="uri" xlink:href="http://www.illumina.com/">http://www.illumina.com/</ext-link>
).
<xref ref-type="fig" rid="pone-0012681-g001">Figure 1</xref>
shows a base-call graph for the dataset, and
<xref ref-type="fig" rid="pone-0012681-g002">Figure 2</xref>
shows the accompanying quality values for these base calls. The following observations can immediately be made:</p>
<list list-type="bullet">
<list-item>
<p>A common assumption about short-read data is that base call frequencies should be independent of the position in the read.
<xref ref-type="fig" rid="pone-0012681-g001">Figure 1</xref>
, however, indicates that this assumption is only true from about base 10. The beginnings (bases 1 to 9) of the reads show great deviation from the expected behaviour.</p>
</list-item>
<list-item>
<p>The deviant behaviour we observe across the initial bases cannot be attributed to the internal quality measures used by the sequencing machine.
<xref ref-type="fig" rid="pone-0012681-g002">Figure 2</xref>
shows that there is no significant drop in quality score values across the first ten positions of the reads, indicating good reliability of the base frequencies at these positions.</p>
</list-item>
<list-item>
<p>
<xref ref-type="fig" rid="pone-0012681-g002">Figure 2</xref>
has strong similarity to a graph presented by Dohm et al.
<xref rid="pone.0012681-Dohm1" ref-type="bibr">[10]</xref>
, showing a fall in quality scores toward the end of the reads. This result is also consistent with the observations made by Chaisson et al. in
<xref rid="pone.0012681-Chaisson1" ref-type="bibr">[15]</xref>
.</p>
</list-item>
<list-item>
<p>The presence of major biases in the starting locations, or possibly the presence of sequencing primers left in the input data, could be responsible for the shape of the base call graph in
<xref ref-type="fig" rid="pone-0012681-g001">Figure 1</xref>
. We observed this behaviour (see
<xref ref-type="supplementary-material" rid="pone.0012681.s002">Figures S1</xref>
and
<xref ref-type="supplementary-material" rid="pone.0012681.s003">S2</xref>
as examples) in all the Genome Analyzer outputs we investigated.</p>
</list-item>
<list-item>
<p>There is a noticeable enrichment of As over Ts and Cs over Gs in the base call frequencies. This is true for all of the Illumina Genome Analyzer II read sets investigated in the supplementary data as well (except one that is not representative of a regular sequencing run). Such differences in complementary bases could be explained with strand sampling biases. However, a strand bias can't explain the consistent preference for one base over the other. Poly-A fragments (mentioned in the data set section) that have not been removed from the data could play a role in oversampling of As, but there is no analogous phenomenon that would explain the difference between Cs and Gs. An amplification and sampling bias may be the cause for the observation.</p>
</list-item>
<list-item>
<p>Note that the same observation can be made for the 454 data regarding AT, but the roles are switched for CG (see
<xref ref-type="supplementary-material" rid="pone.0012681.s009">Figure S8</xref>
).</p>
</list-item>
</list>
<fig id="pone-0012681-g001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Basecalls for each position in the reads for dataset
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e033.jpg" mimetype="image"></inline-graphic>
</inline-formula>
from the 1000 genomes project (see Section “Datasets”).</title>
</caption>
<graphic xlink:href="pone.0012681.g001"></graphic>
</fig>
<fig id="pone-0012681-g002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Average Phred quality scores for
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e034.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</title>
<p>Bars represent the mean quality score for each position in a read (see
<xref rid="pone.0012681-Ewing1" ref-type="bibr">[17]</xref>
for details).</p>
</caption>
<graphic xlink:href="pone.0012681.g002"></graphic>
</fig>
<p>The base calls are exhibiting this behaviour, even after our filtering of the dataset for primers and other artefacts, as described in Section “Datasets”. We discovered this kind of pattern as a common characteristic for base calls coming from Illumina sequencers; see the supplementary data for further analysis.</p>
<p>The data sets
<italic>D1</italic>
and
<italic>D4</italic>
in the supplementary data have been mapped to their reference genomes to ensure only valid reads. Furthermore, the reads of
<italic>D4*</italic>
have been quality filtered, so that we only retained reads in which all bases have a high quality score. Our aim is to develop methods that do not require a reference, but it is plausible to hypothesise that the biases are due to poor reads; this mapping and filtering eliminates poor reads as an explanation for the biases.</p>
<p>Analysis of
<italic>D3</italic>
(another
<italic>1000 genomes</italic>
dataset) revealed another striking anomaly in base call frequencies, shown in
<xref ref-type="supplementary-material" rid="pone.0012681.s004">Figure S3</xref>
, with base frequency varying wildly with position and significant falloff in calling of A towards the end of the read. We believe it is important to be aware of such characteristics before undertaking genome assembly.</p>
<p>Other sequencing platforms show different error characteristics, which then shows in the base call frequencies. Data set
<italic>D7</italic>
was created with Roche's pyrosequencing technology 454; its base call progression can be reviewed in
<xref ref-type="supplementary-material" rid="pone.0012681.s009">Figure S8</xref>
. There is a noticeable increase of A nucleotides towards the end of reads, whereas the occurrences of Cs decline. Note however, that this read set is composed out of various read lengths, and the observed behaviour could be explained by sampling biases as well as error biases.</p>
<p>Being aware of these characteristics of the input data can help to better interpret them, thus demonstrating the value of applying simple statistics of this kind of data before trying to make use of it. For instance, base call progressions at the end of reads, such as those in the data sets
<italic>D1</italic>
,
<italic>D4</italic>
,
<italic>D5</italic>
(
<italic>D</italic>
), and
<italic>D7</italic>
(see
<xref ref-type="supplementary-material" rid="pone.0012681.s001">Table S1</xref>
), suggest that users should make use of trimming techniques like that presented by Qu et al.
<xref rid="pone.0012681-Qu1" ref-type="bibr">[16]</xref>
.</p>
</sec>
<sec id="s3b">
<title>Analysing occurrences of
<italic>k</italic>
-mers</title>
<p>Our second technique examines the frequency of
<italic>k</italic>
-mer occurrences against a background model. Having observed anomalies occurring at the start of reads identified in Section “Analysing base call frequencies”, we generalised the concept and compared the frequencies of
<italic>k</italic>
-mers for varying
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e035.jpg" mimetype="image"></inline-graphic>
</inline-formula>
at the read start against their overall frequencies in the reference sequence or, if not available, with a background model derived from the rest of the reads. As was formalised in “Modelling” section, we would expect that each position in the genome is equally likely to be sampled by a read and thus the
<italic>k</italic>
-mers at a start of the read should follow the distribution of the
<italic>k</italic>
-mers in the reference or background (methylation issues excluded). However, we found large discrepancies from this assumption in our analysis, with
<italic>k</italic>
-mers at the start of a read being under- or over-represented with respect to the background model by orders of magnitude.</p>
<p>In more detail: we obtain the probability distribution for all
<italic>k</italic>
-mers from a reference genome as a background model. No reference is ideal; some genomes are CG-rich to a greater extent than others, for example, while others have such a high proportion of coding region (such as Drosophila) that other forms of non-randomness may be observed. However, a reference from a similar organism should have reasonably similar statistical properties. As noted below, the pool of reads themselves can be used as a background model.</p>
<p>According to our null hypothesis
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e036.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, the distribution of any
<italic>k</italic>
-mer should be binomial
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e037.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. Since
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e038.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is large and
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e039.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is small for this kind of data, we calculate
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e040.jpg" mimetype="image"></inline-graphic>
</inline-formula>
values of the distribution using the Sterling approximation (that is, we approximate the factorials for large numbers as
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e041.jpg" mimetype="image"></inline-graphic>
</inline-formula>
). We can then obtain p-values by approximating the cumulative probabilities for the given binomial distribution taking the observed values; for values far from the distribution's mean we estimate the tail by
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e042.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e043.jpg" mimetype="image"></inline-graphic>
</inline-formula>
the mean and
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e044.jpg" mimetype="image"></inline-graphic>
</inline-formula>
the binomial probability. The p-values obtained for certain
<italic>k</italic>
-mers show highly significant differences from the expected values at the
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e045.jpg" mimetype="image"></inline-graphic>
</inline-formula>
confidence level. This provides strong evidence to reject the
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e046.jpg" mimetype="image"></inline-graphic>
</inline-formula>
hypothesis.</p>
<p>We identified some of these anomalies as PCR primers contained in the reads (as for example in the NA06985 read set presented in the supplementary data). Other anomalous
<italic>k</italic>
-mers could not be explained as easily and, more curiously, showed consistently unusual behaviour across completely unrelated datasets. The polynucleotide sequences are a notable example. Our experiments with several datasets showed an under-representation of poly-C, poly-G, and poly-T sequences at the read start – and in general compared to the reference genome – whereas the reverse complement of poly-T, namely poly-A, occurred well in the expected range. The calculated p-values for poly-C, poly-G, and poly-T
<italic>6-mers</italic>
are all significant at the
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e047.jpg" mimetype="image"></inline-graphic>
</inline-formula>
significance level. We also analysed where such polynucleotide sequences occurred in the reads.
<xref ref-type="fig" rid="pone-0012681-g003">Figures 3</xref>
and
<xref ref-type="fig" rid="pone-0012681-g004">4</xref>
show the frequency for
<italic>k</italic>
-mers of repeated nucleotides at each position in a read, with the dotted line showing their expected frequency based on the reference genome. Note the interesting and significant difference of representation of identical strings in complementary form (poly-A versus poly-T and poly-C versus poly-G).</p>
<fig id="pone-0012681-g003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Occurrences of poly-A and poly-T sequences of different length depending on position in the read and their expected values for read set
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e048.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</title>
<p>y-axis shows the total number of occurrences (log scale). Dotted lines represent the expected occurrences for the respective
<italic>k</italic>
-mer lengths. Poly-A sequences displayed as
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e049.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, poly-T as
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e050.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="pone.0012681.g003"></graphic>
</fig>
<fig id="pone-0012681-g004" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g004</object-id>
<label>Figure 4</label>
<caption>
<title>Occurrences of poly-C and poly-G sequences of different length depending on position in the read and their expected values for read set
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e051.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</title>
<p>y-axis shows the total number of occurrences (log scale). Dotted lines represent the expected occurrences for the respective
<italic>k</italic>
-mer lengths. Poly-C sequences displayed as
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e052.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, poly-G as
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e053.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="pone.0012681.g004"></graphic>
</fig>
<p>The frequencies of these anomalous
<italic>k</italic>
-mers show significant correlation with their position of occurrence in a read. Squared correlation coefficients and their p-values are shown in
<xref ref-type="table" rid="pone-0012681-t001">Table 1</xref>
. The statistics were generated using the PASW statistics 17.0 software and cubic interpolation was found to provide the best fit when finding the correlation between read position and frequency for a given
<italic>k</italic>
-mer. We summarise our observations as follows:</p>
<list list-type="bullet">
<list-item>
<p>The highly significant p-values in
<xref ref-type="table" rid="pone-0012681-t001">Table 1</xref>
suggest that frequencies of certain
<italic>k</italic>
-mers are not independent of their position in a read. We thus reject the null hypothesis
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e054.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and hence also reject
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e055.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>All the polynucleotide sequences show increased frequency towards the end of reads.</p>
</list-item>
<list-item>
<p>Unusual behaviour is found at the start of reads, where the selected
<italic>k</italic>
-mers have unusually high or low frequencies (see
<xref ref-type="fig" rid="pone-0012681-g003">Figure 3</xref>
). We found this kind of behaviour repeated across different datasets.</p>
</list-item>
<list-item>
<p>The majority of
<italic>k</italic>
-mers (that is, those other than the polynucleotide strings) don't show this kind of behaviour and their frequency distributions are consistent with the null hypothesis.</p>
</list-item>
<list-item>
<p>Comparing our results from
<xref ref-type="fig" rid="pone-0012681-g004">Figure 4</xref>
with
<xref ref-type="fig" rid="pone-0012681-g001">Figure 1</xref>
shows that there is a dramatic increase of poly-G 6-mers at the end of reads, even though the count of base calls for single G bases remains stable towards the end of reads.</p>
</list-item>
<list-item>
<p>Polynucleotide sequences consisting of C or G are significantly less represented in the reads compared to the reference genome. This stands in contrast to the observations made by Dohm et al.
<xref rid="pone.0012681-Harismendy1" ref-type="bibr">[11]</xref>
, who detected an enriched representation with higher CG content. This bias however, was observed in Solexa 1G sequencers. Other unpublished experiments we have undertaken show that this bias is not prevalent for later editions of the Illumina sequencing platform.</p>
</list-item>
<list-item>
<p>Experiments show that the higher occurrences of poly-A, poly-T, poly-C, or poly-G sequences at the end of reads is not a reflection of the material being sequenced but is due to a systematic introduction of error, turning quasi poly-nucleotide sequences into actual ones by minor changes. We verified this by mapping the reads to a reference genome and identifying erroneous bases in uniquely mapping reads: The majority of poly-G sequences in the read data were seen to occur due to systematic sequencing errors, violating the assumption
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e056.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. This contradicts our error models, because errors are not context free.</p>
</list-item>
</list>
<table-wrap id="pone-0012681-t001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.t001</object-id>
<label>Table 1</label>
<caption>
<title>Cubic correlation coefficients for polynucleotide sequences of different length for the read set
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e057.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</title>
</caption>
<alternatives>
<graphic id="pone-0012681-t001-1" xlink:href="pone.0012681.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">polyA</td>
<td align="left" rowspan="1" colspan="1">polyC</td>
<td align="left" rowspan="1" colspan="1">polyG</td>
<td align="left" rowspan="1" colspan="1">polyT</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="5" align="left" rowspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e058.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Correlation
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e059.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e060.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e061.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<xref ref-type="table-fn" rid="nt101">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e062.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e063.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<xref ref-type="table-fn" rid="nt101">*</xref>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">p-value</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e064.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e065.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<xref ref-type="table-fn" rid="nt101">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e066.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e067.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<xref ref-type="table-fn" rid="nt101">*</xref>
</td>
</tr>
<tr>
<td colspan="5" align="left" rowspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e068.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Correlation
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e069.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e070.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e071.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e072.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e073.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">p-value</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e074.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e075.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e076.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e077.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td colspan="5" align="left" rowspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e078.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Correlation
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e079.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e080.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e081.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e082.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e083.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">p-value</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e084.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e085.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e086.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e087.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td colspan="5" align="left" rowspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e088.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Correlation
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e089.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e090.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e091.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e092.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e093.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">p-value</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e094.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e095.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e096.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e097.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label></label>
<p>*Values for linear correlation (better fit).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Although we show results here using a reference sequence, this method, as the others, can be applied without a reference. The reference has been used to validate our methodology, not as an essential step to obtain results. Note that, in absence of a reference, the pool of reads themselves can be used to obtain a background model, and that the positional analysis of
<italic>k</italic>
-mer frequencies is entirely independent of a reference in the first place.</p>
<p>As was the case for the first method we presented, this second simple statistic can provide useful quality information on a set of reads, and help guide later computational or chemical analysis. For example, our results illustrate that the biases are probably due to the chemical processing rather than the sample preparation, as the data sets were prepared in different laboratories. Although the choice of data to use as a background model may lead to apparent biases, our results here suggest there are other more significant causes. This is confirmed by the Kullback-Leibler analysis, as we next explain.</p>
</sec>
<sec id="s3c">
<title>Distribution analysis</title>
<p>Our third proposed analysis assesses the data in more depth. Given the observation in
<xref ref-type="fig" rid="pone-0012681-g003">Figure 3</xref>
and
<xref ref-type="fig" rid="pone-0012681-g004">4</xref>
and the uneven distribution over the read positions of distinct sequences, we analyse the overall
<italic>k</italic>
-mer distributions for each position in a read. We then compare distributions using the Kullback-Leibler (KL) divergence to get an intuition of how different the distributions are.</p>
<p>Given two probability distributions
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e098.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e099.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, the KL divergence of a set of
<italic>k</italic>
-mers
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e100.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is:
<disp-formula>
<graphic xlink:href="pone.0012681.e101.jpg" mimetype="image" position="float"></graphic>
<label>(4)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e102.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e103.jpg" mimetype="image"></inline-graphic>
</inline-formula>
denote the probability of
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e104.jpg" mimetype="image"></inline-graphic>
</inline-formula>
under
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e105.jpg" mimetype="image"></inline-graphic>
</inline-formula>
or
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e106.jpg" mimetype="image"></inline-graphic>
</inline-formula>
respectively. Intuitively, this function measures how different two distributions are, a higher value implying higher divergency. In information theory, the KL divergence measures the average amount of bits wasted per symbol when using distribution
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e107.jpg" mimetype="image"></inline-graphic>
</inline-formula>
for encoding when symbols are in fact distributed according to
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e108.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
<p>Let
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e109.jpg" mimetype="image"></inline-graphic>
</inline-formula>
be the distribution for all
<italic>6-mers</italic>
at position
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e110.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. We then compute the KL divergence between each possible pair of positions:
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e111.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e112.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e113.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of
<italic>6-mers</italic>
in a read, with
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e114.jpg" mimetype="image"></inline-graphic>
</inline-formula>
the read length.
<xref ref-type="fig" rid="pone-0012681-g005">Figure 5</xref>
shows a graphical representation of the divergence profile.</p>
<list list-type="bullet">
<list-item>
<p>Divergence is high when comparing the first position's distribution against any other. This might imply biases in the starting positions of reads and thus the existence of biased
<italic>6-mers</italic>
in the first bases.</p>
</list-item>
<list-item>
<p>Divergence is high when comparing the first with the last position's distribution. This observation is valid across all analysed datasets – and expected as explained later.</p>
</list-item>
<list-item>
<p>The main area of the graph contains small divergence measures of around
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e115.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. This ‘plain’ of small divergences seems to confirm the claim made earlier in the “Modelling” section, that for the majority of
<italic>6-mers</italic>
, their occurrences are consistent with the null hypothesis
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e116.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, thus also implying
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e117.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>There is a small but significant “bump” in the divergence when comparing any positions with those around 25 to 30. This can be seen as lighter coloured stripes crossing horizontally and vertically through the middle of
<xref ref-type="fig" rid="pone-0012681-g005">Figure 5</xref>
. We believe it to be caused by artefacts left in the data. Factors such as primers occur in a particular reading frame and can cause biases in the distributions at particular loci of the reads.</p>
</list-item>
<list-item>
<p>Besides the obvious extreme values for divergence stated above, there is a more subtle but clearly visible decline in divergence from the first few positions towards the end of the reads, in general, divergence is higher for early position's distributions than for later ones. This also coincides with the Figure for the bootstrap experiment presented shortly.</p>
</list-item>
<list-item>
<p>Across several datasets, the general shape of the graph representing the KL divergences was similar: maximal divergence occurred for the first position and was high compared to any other position. See
<xref ref-type="supplementary-material" rid="pone.0012681.s010">Figures S9</xref>
to
<xref ref-type="supplementary-material" rid="pone.0012681.s017">S16</xref>
for illustration.</p>
</list-item>
</list>
<fig id="pone-0012681-g005" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g005</object-id>
<label>Figure 5</label>
<caption>
<title>3d plot of the KL matrix for
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e118.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</title>
<p>Data points correspond to
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e119.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e120.jpg" mimetype="image"></inline-graphic>
</inline-formula>
shown on the x-axis,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e121.jpg" mimetype="image"></inline-graphic>
</inline-formula>
on the y-axis.</p>
</caption>
<graphic xlink:href="pone.0012681.g005"></graphic>
</fig>
<p>To assess the observed divergences, we use a form of bootstrap. We use the
<italic>maq</italic>
(maq.sourceforge.net) simulation tool to generate multiple synthetic read sets from the human genome, of the same size as our natural data set, using the assumptions underlying
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e122.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. We first train the simulator using the quality scores from our test data and then generate 100 different sets of reads (each with around 52 million reads) from the reference genome based on the adopted quality scores. For each read set, the KL divergence values are computed and then an average is taken over all read sets. With this approach we simulate the expected value and the variance of the KL divergence values under the uniform sampling null hypothesis
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e123.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
<p>The central limit theorem then implies that the divergence measure should be normally distributed around the distribution's mean. Mean and variance are derived directly from the simulated distribution.
<xref ref-type="fig" rid="pone-0012681-g006">Figure 6</xref>
shows the expected values for the KL divergence under the null hypothesis; as can be seen, these are much smaller than the values observed for the real data. (Note that the vertical axes are on different scales.)
<xref ref-type="table" rid="pone-0012681-t002">Table 2</xref>
shows further results; instead of p-values we provide effect size in distance from the mean as multiples of the standard deviation of the respective distribution.</p>
<fig id="pone-0012681-g006" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.g006</object-id>
<label>Figure 6</label>
<caption>
<title>3d plot of the mean KL values for the bootstrap approach (simulated data).</title>
<p>Data points correspond to
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e124.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e125.jpg" mimetype="image"></inline-graphic>
</inline-formula>
shown on the x-axis,
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e126.jpg" mimetype="image"></inline-graphic>
</inline-formula>
on the y-axis. Note that the scale is not the same as used in
<xref ref-type="fig" rid="pone-0012681-g005">Figure 5</xref>
.</p>
</caption>
<graphic xlink:href="pone.0012681.g006"></graphic>
</fig>
<table-wrap id="pone-0012681-t002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0012681.t002</object-id>
<label>Table 2</label>
<caption>
<title>Statistics for the bootstrap approach and comparison with the read data.</title>
</caption>
<alternatives>
<graphic id="pone-0012681-t002-2" xlink:href="pone.0012681.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">Statistic</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Average standard deviation from mean</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e127.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">95% confidence interval for</td>
<td align="left" rowspan="1" colspan="1">(lower bound)</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e128.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">mean standard deviation</td>
<td align="left" rowspan="1" colspan="1">(upper bound)</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e129.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td colspan="2" align="left" rowspan="1">Avg. distance of observed values from expected value
<xref ref-type="table-fn" rid="nt102">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e130.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td colspan="2" align="left" rowspan="1">Avg. effect size for 1st position distr. from expected values
<xref ref-type="table-fn" rid="nt102">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e131.jpg" mimetype="image"></inline-graphic>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt102">
<label></label>
<p>*In standard deviations.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The general ‘stingray’ shape of the graph in
<xref ref-type="fig" rid="pone-0012681-g006">Figure 6</xref>
is initially surprising, but is a direct consequence of the error model adapted by the simulation tool. Recall that we trained the simulator with the quality scores of the dataset (see
<xref ref-type="fig" rid="pone-0012681-g002">Figure 2</xref>
). The higher probabilities for errors at the end of reads leads to a higher diversity of the 6-mer distributions and such to the observed graph; note that the distribution of 6-mers in the human genome is biased, so that introducing errors using random substitutions tends to make the distribution become more even towards the end of the reads. Note further that increasing error rates under the assumption of
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e132.jpg" mimetype="image"></inline-graphic>
</inline-formula>
makes the
<italic>k</italic>
-mer distributions converge towards a uniform distribution, since eventually every position in a read is replaced by an error with equal probability for each base. Even though the error model adopted here does not capture all of the errors in real data, it does reflect the notion of increasing error rate towards the ends of reads. Thus, we expect higher divergence between early and late distributions, because the errors corrupt the pattern of 6-mers observed.</p>
<p>The divergence measure can be applied for any
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e133.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. Large
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e134.jpg" mimetype="image"></inline-graphic>
</inline-formula>
will result in low sampling of each
<italic>k</italic>
-mer and thus lower the statistical significance. Also, some
<italic>k</italic>
-mers might never be sampled at some positions in the read, whilst being contained at other positions, resulting in difficulties in calculating the KL divergence. Smaller
<inline-formula>
<inline-graphic xlink:href="pone.0012681.e135.jpg" mimetype="image"></inline-graphic>
</inline-formula>
results in little specificity of the
<italic>k</italic>
-mers to a region in the genome, and thus reduces the power of the method to discover regional biases. The choice for length six for this analysis however was arbitrary.</p>
</sec>
<sec id="s3d">
<title>Discussion</title>
<p>DNA sequencing is a complex process combining several stages of preparation, chemistry, and computational analysis. Biases for distinct
<italic>k</italic>
-mers or fragment lengths can be introduced at many points during this process: PCR can favour certain
<italic>k</italic>
-mers, for example, as can DNA fragmentation. The chemistry used inside the sequencing hardware and the interpretation of the optical reactions is sensitive to interferences of many kinds, such as light and temperature. Our observations imply that some unexpected, complex biases are present in data from the 1000 genomes project, and that these may affect how the data is interpreted.</p>
<p>
<xref ref-type="table" rid="pone-0012681-t001">Table 1</xref>
shows that some sequences' occurrences are highly correlated with the position in the reads, contradicting assumptions of how reads are obtained from a DNA sequence. This correlation could be due to the preparation steps of the sequencing library or biases in the sequencing step, or it could be a systematic error in the interpretation of the reaction in the sequencer. These kinds of errors are noted by Harismendy et al.
<xref rid="pone.0012681-Harismendy1" ref-type="bibr">[11]</xref>
and in more detail by Kircher et al.
<xref rid="pone.0012681-Kircher1" ref-type="bibr">[13]</xref>
. For example, this includes the tendency of G to be confused with T, and also a general T accumulation along the reads. The latter was observed in some but not all of our experiments (see
<xref ref-type="supplementary-material" rid="pone.0012681.s002">Figures S1</xref>
to
<xref ref-type="supplementary-material" rid="pone.0012681.s009">S8</xref>
).</p>
<p>On the other hand, it could be a correct image of the data and caused by biases in the starting positions of the reads. This is confirmed by results presented in the supplementary data: the data set
<italic>D4*</italic>
was mapped to the reference and quality filtered to ensure only high quality reads that stem from the actual organism with high probability.
<xref ref-type="supplementary-material" rid="pone.0012681.s008">Figures S7</xref>
and
<xref ref-type="supplementary-material" rid="pone.0012681.s016">S15</xref>
show the graphs for base call frequencies and the KL divergence measure. The results show no improvement in the observed biases.</p>
<p>Biases in the starting positions of reads become apparent when looking at the other analyses.
<xref ref-type="fig" rid="pone-0012681-g005">Figure 5</xref>
, representing the Kullback-Leibler divergence of different positions' distributions, shows that the starting positions of reads do not coincide with the general null hypothesis or with the general shape of distributions at other positions. One has to be careful about interpreting the possible biases, because adaptor sequences or any fragments that appear in a distinct reading frame in a read may lead to this observation. Quality filtering however suggests that this is not the case.
<xref ref-type="supplementary-material" rid="pone.0012681.s016">Figure S15</xref>
looks slightly improved over
<xref ref-type="supplementary-material" rid="pone.0012681.s012">Figure S11</xref>
, with smaller divergence for late positions in reads. The divergence at the start of reads however remains present. We thus believe that the underlying issues are not simple sequencing errors or fragments left in the data, but rather systematic biases in site selection in the read generation process.</p>
<p>However, we did filter the data (recall the “Datasets” section) to get a clearer image of the state and the observation persisted. We also note that, even if primers or other artefacts were somehow left in the data, the shape of the graph should look different if the reads were unbiased in their starting positions: A primer that occurs at the start of reads massively biases the distribution for the first position – but it also does the same for the second, third, and so on, for a large number of positions, as adaptor sequences or primers are typically long. That is, if the graph's shape is due to this kind of phenomena, the high divergence should stretch further into the reads.</p>
<p>The same argument applies to
<xref ref-type="fig" rid="pone-0012681-g001">Figure 1</xref>
, where the unusual
<italic>k</italic>
-mers should certainly exceed 10 bp. Thus use of our techniques can give insight into these biases in read starting positions: analysing the over- and under-represented sequences at the starts of reads by calculating p-values as described in Section “Analysing occurrences of k-mers” might indicate favoured and avoided positions in a genome on a sequence level.</p>
<p>A criticism of the analysis in Section “Distribution analysis” could be that the comparison to a reference mightn't be fair: the actual read data could be biased due to initial sample preparation from the genome and the sequence might simply be different from the reference. However, these issues should not affect the overall distributions significantly and, in particular, should not affect the general shape of the graph at all, which this is determined by our assumptions only and not by the sequence the reads stem from. Recall that we do not compare the read data with the reference genome in this step, but distributions along the reads themselves. Using the reference for the bootstrap however ensures maintenance of the same genome complexity and coverage ratio as in the test data.</p>
<p>Practical experience demonstrates that short read data is feasible for the common tasks of re-sequencing or assembly
<xref rid="pone.0012681-Wang1" ref-type="bibr">[5]</xref>
,
<xref rid="pone.0012681-Wheeler1" ref-type="bibr">[6]</xref>
. Yet we need to be aware of possible biases and try to understand the underlying characteristics of short-read data better to make the most of the information contained in it, and doing so may aid in construction of longer contigs with greater coverage, or in accurate determination of genome regions involved in gene expression.</p>
<p>Our statistical tests have practical implications for a wide variety of biological investigations. For example,</p>
<list list-type="bullet">
<list-item>
<p>Combining the results from base call and distribution analyses, reads can be trimmed in a guided manner: The trimming points can be chosen in a manner to maintain as much sequencing material as possible while minimising errors. The results of Qu et al.
<xref rid="pone.0012681-Qu1" ref-type="bibr">[16]</xref>
show that a significant volume of errors can be omitted this way. This increases the mappability of the data in case of resequencing, RNA-seq, and so on.</p>
</list-item>
<list-item>
<p>Based on the same observations, a guided kmer selection for kmer-based assembly algorithms can leverage performance for de novo assembly applications. Avoiding read regions of the data set that contain high error rates and bias, will benefit the assembly quality and performance, because avoiding errors makes assembly of the short read data easier, and it drastically reduces the memory consumption of assembly tools – one of the main problems for sequencing larger genomes.</p>
</list-item>
<list-item>
<p>With the results from distribution and
<italic>k-mer</italic>
analyses, a more accurate coverage estimation for quantitative analysis such as ChIP-seq or RNA-seq can be achieved. The statistical tests that are used for this kind of experiment are highly sensitive, and rely on accurate estimations of the gene (or RNA) coverage. Evaluating sampling biases and normalising for them could greatly improve the accuracy of gene expression studies with NGS data.</p>
</list-item>
</list>
<p>As new uses of short-read data continue to appear, we expect that precise knowledge of the data's statistical properties will continue to be of importance.</p>
</sec>
<sec id="s3e">
<title>Conclusions</title>
<p>We have presented strong evidence that the common assumptions made about short-read sequencing data are inaccurate. There seem to persist chemical or mechanical biases in the process that lead to surprising biases, such as overrepresentation of some k-mers in the middle of reads. We have to be aware of these biases when working with the data. When analysing methylation or expression characteristics, for example, biases in coverage can lead to mis-interpreted results if ignored. In terms of sequence assembly a notion of locality of
<italic>k</italic>
-mers stemming from particular positions could help improve the quality.</p>
<p>We presented new, simple tests and demonstrated that they provide insight into the sequencing data's state. The results pose questions about the quality and characteristics of high throughput sequencing data, and that of the 1000 Genomes Project in particular. We therefore recommend application of our techniques to maximise the use of information contained in the data and to better understand experimental results.</p>
<p>The base call analysis is easiest to apply and can give a good first impression of the data's state. A smooth graph will indicate the desired characteristics of the read data, while fragmented patterns indicate a problem. Counting occurrences of
<italic>k</italic>
-mers can help identifying such artefacts and filter them, but also aid understanding about more complex characteristics of the sequencing data such as positional biases. Applying the Kullback-Leibler measure helps to assess the state of the read data in more depth; a ‘smooth’ set of divergence values implies a homogenous read set, while any conspicuous patterns in the divergences identify biases and can help to direct further chemical and computational analysis.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="s4">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0012681.s001">
<label>Table S1</label>
<caption>
<p>(0.07 MB PDF)</p>
</caption>
<media xlink:href="pone.0012681.s001.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s002">
<label>Figure S1</label>
<caption>
<p>Basecalls for the read set D2 (NA06895) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(0.63 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s002.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s003">
<label>Figure S2</label>
<caption>
<p>Basecalls for the read set D6 (NA12272) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(0.95 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s003.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s004">
<label>Figure S3</label>
<caption>
<p>Basecalls for the read set D3 (NA11829) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(0.72 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s004.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s005">
<label>Figure S4</label>
<caption>
<p>Basecalls for the read set D4 (NA12155) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(1.63 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s005.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s006">
<label>Figure S5</label>
<caption>
<p>Basecalls for the read set D1 (SRX005986) from NCBI's Sequence Read Archive. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(1.78 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s006.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s007">
<label>Figure S6</label>
<caption>
<p>Basecalls for the read set D5 (NA10847) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(2.87 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s007.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s008">
<label>Figure S7</label>
<caption>
<p>Basecalls for the read set D4* (NA12155) from the 1000 Genomes Project. X-axis showing the position in the read, y-axis the relative base frequency.</p>
<p>(1.57 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s008.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s009">
<label>Figure S8</label>
<caption>
<p>Basecalls for the read set D7 (SRX017210) from NCBI's Sequence Read Archive. X-axis showing the position in the read, y-axis the relative base frequency. Note that the graph is cut off at position 361, because only a very small number of reads exceeds this read length.</p>
<p>(3.46 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s009.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s010">
<label>Figure S9</label>
<caption>
<p>Kullback-Leiber divergence for the read set D1 (SRX005986) from NCBI's Short Read Archive. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read. Note that the graph has been trimmed of the last position's distribution because of the high error rates.</p>
<p>(6.32 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s010.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s011">
<label>Figure S10</label>
<caption>
<p>Kullback-Leiber divergence for the read set D2 (NA06985) from the 1000 Genomes Project. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read.</p>
<p>(3.12 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s011.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s012">
<label>Figure S11</label>
<caption>
<p>Kullback-Leiber divergence for the read set D4 (NA12155) from the 1000 Genomes Project. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read. Note that the graph has been trimmed of the last position's distribution because of the high error rates.</p>
<p>(6.32 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s012.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s013">
<label>Figure S12</label>
<caption>
<p>Kullback-Leiber divergence for the read set D5 (NA10847) from the 1000 Genomes Project. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read.</p>
<p>(3.52 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s013.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s014">
<label>Figure S13</label>
<caption>
<p>Kullback-Leiber divergence for the read set D6 (NA12272) from the 1000 Genomes Project. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read.</p>
<p>(3.44 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s014.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s015">
<label>Figure S14</label>
<caption>
<p>Kullback-Leiber divergence for a chip-seq data set. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read.</p>
<p>(2.91 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s015.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s016">
<label>Figure S15</label>
<caption>
<p>Kullback-Leiber divergence for the read set D4* (NA12155) from the 1000 Genomes Project. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read.</p>
<p>(6.32 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s016.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0012681.s017">
<label>Figure S16</label>
<caption>
<p>Kullback-Leiber divergence for the read set D7 (SRX017210) from NCBI's Short Read Archive. Data point represent KL(P
<sub>i</sub>
∥P
<sub>j</sub>
), x-axis indexing the first distribtion, y-axis the latter. P
<sub>i</sub>
corresponds to the distribution of 6-mers at the ith position in a read. Note that the graph is only displayed up to postition 250, since the very low number of reads exceeding this read length makes comparison of distributions difficult and little meaningful.</p>
<p>(6.32 MB TIF)</p>
</caption>
<media xlink:href="pone.0012681.s017.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="COI-statement">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="financial-disclosure">
<p>
<bold>Funding: </bold>
This work was supported by the Australian Research Council, and by the NICTA Victorian Research Laboratory. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="pone.0012681-Sanger1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sanger</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Nicklen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Coulson</surname>
<given-names>AR</given-names>
</name>
</person-group>
<year>1977</year>
<article-title>DNA sequencing with chain-terminating inhibitors.</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>74</volume>
<fpage>5463</fpage>
<lpage>5467</lpage>
<pub-id pub-id-type="pmid">271968</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-vonBubnoff1">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>von Bubnoff</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Next-generation sequencing: The race is on.</article-title>
<source>Cell</source>
<volume>132</volume>
<fpage>721</fpage>
<lpage>723</lpage>
<pub-id pub-id-type="pmid">18329356</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Johnson1">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Johnson</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Mortazavi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Wold</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Genome-wide mapping of in vivo protein-DNA interactions.</article-title>
<source>Science (New York, NY)</source>
<volume>316</volume>
<fpage>1497</fpage>
<lpage>1502</lpage>
</element-citation>
</ref>
<ref id="pone.0012681-Mortazavi1">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mortazavi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>McCue</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Schaeffer</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Wold</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Mapping and quantifying mammalian transcriptomes by rna-seq.</article-title>
<source>Nat Meth</source>
<volume>5</volume>
<fpage>621</fpage>
<lpage>628</lpage>
</element-citation>
</ref>
<ref id="pone.0012681-Wang1">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>G</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>The diploid genome sequence of an asian individual.</article-title>
<source>Nature</source>
<volume>456</volume>
<fpage>60</fpage>
<lpage>65</lpage>
<pub-id pub-id-type="pmid">18987735</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Wheeler1">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wheeler</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Egholm</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>The complete genome of an individual by massively parallel DNA sequencing.</article-title>
<source>Nature</source>
<volume>452</volume>
<fpage>872</fpage>
<lpage>876</lpage>
<pub-id pub-id-type="pmid">18421352</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Hernandez1">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hernandez</surname>
<given-names>D</given-names>
</name>
<name>
<surname>François</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Farinelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Østerås</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schrenzel</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer.</article-title>
<source>Genome Research</source>
<volume>18</volume>
<fpage>802</fpage>
<lpage>809</lpage>
<pub-id pub-id-type="pmid">18332092</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Schrder1">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Shrec: A short-read error correction method.</article-title>
<source>Bioinformatics (Oxford, England)</source>
<fpage>btp379+</fpage>
</element-citation>
</ref>
<ref id="pone.0012681-Zerbino1">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Velvet: algorithms for de novo short read assembly using de bruijn graphs.</article-title>
<source>Genome research</source>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Dohm1">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dohm</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Lottaz</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Borodina</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Himmelbauer</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Substantial biases in ultra-short read data sets from high-throughput dna sequencing.</article-title>
<source>Nucl Acids Res</source>
<fpage>gkn425+</fpage>
</element-citation>
</ref>
<ref id="pone.0012681-Harismendy1">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harismendy</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Strausberg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Stockwell</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<year>2009</year>
<article-title>Evaluation of next generation sequencing platforms for population targeted sequencing studies.</article-title>
<source>Genome Biology</source>
<volume>10</volume>
<fpage>R32+</fpage>
<pub-id pub-id-type="pmid">19327155</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Erlich1">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Erlich</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Mitra</surname>
<given-names>PP</given-names>
</name>
<name>
<surname>delaBastide</surname>
<given-names>M</given-names>
</name>
<name>
<surname>McCombie</surname>
<given-names>WR</given-names>
</name>
<name>
<surname>Hannon</surname>
<given-names>GJ</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Alta-cyclic: a self-optimizing base caller for next-generation sequencing.</article-title>
<source>Nature methods</source>
<volume>5</volume>
<fpage>679</fpage>
<lpage>682</lpage>
<pub-id pub-id-type="pmid">18604217</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Kircher1">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kircher</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stenzel</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Kelso</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Improved base calling for the illumina genome analyzer using machine learning strategies.</article-title>
<source>Genome Biology</source>
<volume>10</volume>
<fpage>R83+</fpage>
<pub-id pub-id-type="pmid">19682367</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Rougemont1">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rougemont</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Amzallag</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Iseli</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Farinelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Xenarios</surname>
<given-names>I</given-names>
</name>
<etal></etal>
</person-group>
<year>2008</year>
<article-title>Probabilistic base calling of solexa sequencing data.</article-title>
<source>BMC Bioinformatics</source>
<volume>9</volume>
<fpage>431+</fpage>
<pub-id pub-id-type="pmid">18851737</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Chaisson1">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Brinza</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>De novo fragment assembly with short mate-paired reads: Does the read length matter?</article-title>
<source>Genome Research</source>
<volume>19</volume>
<fpage>336</fpage>
<lpage>346</lpage>
<pub-id pub-id-type="pmid">19056694</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Qu1">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qu</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hashimoto</surname>
<given-names>Si</given-names>
</name>
<name>
<surname>Morishita</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.</article-title>
<source>Genome Research</source>
<volume>19</volume>
<fpage>1309</fpage>
<lpage>1315</lpage>
<pub-id pub-id-type="pmid">19439514</pub-id>
</element-citation>
</ref>
<ref id="pone.0012681-Ewing1">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewing</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Base-calling of automated sequencer traces using phred. II. error probabilities.</article-title>
<source>Genome Research</source>
<volume>8</volume>
<fpage>186</fpage>
<lpage>194</lpage>
<pub-id pub-id-type="pmid">9521922</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0010619 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0010619 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021