Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Mining statistically-solid k-mers for accurate NGS error correction

Identifieur interne : 000300 ( Pmc/Corpus ); précédent : 000299; suivant : 000301

Mining statistically-solid k-mers for accurate NGS error correction

Auteurs : Liang Zhao ; Jin Xie ; Lin Bai ; Wen Chen ; Mingju Wang ; Zhonglei Zhang ; Yiqi Wang ; Zhe Zhao ; Jinyan Li

Source :

RBID : PMC:6311904

Abstract

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.


Url:
DOI: 10.1186/s12864-018-5272-y
PubMed: 30598110
PubMed Central: 6311904

Links to Exploration step

PMC:6311904

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</title>
<author>
<name sortKey="Zhao, Liang" sort="Zhao, Liang" uniqKey="Zhao L" first="Liang" last="Zhao">Liang Zhao</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xie, Jin" sort="Xie, Jin" uniqKey="Xie J" first="Jin" last="Xie">Jin Xie</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bai, Lin" sort="Bai, Lin" uniqKey="Bai L" first="Lin" last="Bai">Lin Bai</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chen, Wen" sort="Chen, Wen" uniqKey="Chen W" first="Wen" last="Chen">Wen Chen</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Mingju" sort="Wang, Mingju" uniqKey="Wang M" first="Mingju" last="Wang">Mingju Wang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhonglei" sort="Zhang, Zhonglei" uniqKey="Zhang Z" first="Zhonglei" last="Zhang">Zhonglei Zhang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yiqi" sort="Wang, Yiqi" uniqKey="Wang Y" first="Yiqi" last="Wang">Yiqi Wang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhe" sort="Zhao, Zhe" uniqKey="Zhao Z" first="Zhe" last="Zhao">Zhe Zhao</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Li, Jinyan" sort="Li, Jinyan" uniqKey="Li J" first="Jinyan" last="Li">Jinyan Li</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30598110</idno>
<idno type="pmc">6311904</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904</idno>
<idno type="RBID">PMC:6311904</idno>
<idno type="doi">10.1186/s12864-018-5272-y</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000300</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000300</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</title>
<author>
<name sortKey="Zhao, Liang" sort="Zhao, Liang" uniqKey="Zhao L" first="Liang" last="Zhao">Liang Zhao</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xie, Jin" sort="Xie, Jin" uniqKey="Xie J" first="Jin" last="Xie">Jin Xie</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bai, Lin" sort="Bai, Lin" uniqKey="Bai L" first="Lin" last="Bai">Lin Bai</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chen, Wen" sort="Chen, Wen" uniqKey="Chen W" first="Wen" last="Chen">Wen Chen</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Mingju" sort="Wang, Mingju" uniqKey="Wang M" first="Mingju" last="Wang">Mingju Wang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhonglei" sort="Zhang, Zhonglei" uniqKey="Zhang Z" first="Zhonglei" last="Zhang">Zhonglei Zhang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yiqi" sort="Wang, Yiqi" uniqKey="Wang Y" first="Yiqi" last="Wang">Yiqi Wang</name>
<affiliation>
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhe" sort="Zhao, Zhe" uniqKey="Zhao Z" first="Zhe" last="Zhao">Zhe Zhao</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Li, Jinyan" sort="Li, Jinyan" uniqKey="Li J" first="Jinyan" last="Li">Jinyan Li</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer is a
<italic>k</italic>
-mer frequently occurring in NGS reads. The other
<italic>k</italic>
-mers are called weak
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer does not likely contain errors, while a weak
<italic>k</italic>
-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff
<italic>f</italic>
<sub>0</sub>
to balance the numbers of solid and weak
<italic>k</italic>
-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid
<italic>k</italic>
-mers that are likely to contain errors, and (ii) add a small subset of weak
<italic>k</italic>
-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of
<italic>k</italic>
-mers can improve the correction performance.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose to use a Gamma distribution to model the frequencies of erroneous
<italic>k</italic>
-mers and a mixture of Gaussian distributions to model correct
<italic>k</italic>
-mers, and combine them to determine
<italic>f</italic>
<sub>0</sub>
. To identify the two special subsets of
<italic>k</italic>
-mers, we use the
<italic>z</italic>
-score of
<italic>k</italic>
-mers which measures the number of standard deviations a
<italic>k</italic>
-mer’s frequency is from the mean. Then these statistically-solid
<italic>k</italic>
-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The
<italic>z</italic>
-score is adequate to distinguish solid
<italic>k</italic>
-mers from weak
<italic>k</italic>
-mers, particularly useful for pinpointing out solid
<italic>k</italic>
-mers having very low frequency. Applying
<italic>z</italic>
-score on
<italic>k</italic>
-mer can markedly improve the error correction accuracy.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Alic, As" uniqKey="Alic A">AS Alic</name>
</author>
<author>
<name sortKey="Ruzafa, D" uniqKey="Ruzafa D">D Ruzafa</name>
</author>
<author>
<name sortKey="Dopazo, J" uniqKey="Dopazo J">J Dopazo</name>
</author>
<author>
<name sortKey="Blanquer, I" uniqKey="Blanquer I">I Blanquer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hackl, T" uniqKey="Hackl T">T Hackl</name>
</author>
<author>
<name sortKey="Hedrich, R" uniqKey="Hedrich R">R Hedrich</name>
</author>
<author>
<name sortKey="Schultz, J" uniqKey="Schultz J">J Schultz</name>
</author>
<author>
<name sortKey="Forster, F" uniqKey="Forster F">F Förster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goodwin, S" uniqKey="Goodwin S">S Goodwin</name>
</author>
<author>
<name sortKey="Gurtowski, J" uniqKey="Gurtowski J">J Gurtowski</name>
</author>
<author>
<name sortKey="Ethe Sayers, S" uniqKey="Ethe Sayers S">S Ethe-Sayers</name>
</author>
<author>
<name sortKey="Deshpande, P" uniqKey="Deshpande P">P Deshpande</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Mccombie, Wr" uniqKey="Mccombie W">WR McCombie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Zimin, A" uniqKey="Zimin A">A Zimin</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author>
<name sortKey="Yin, J" uniqKey="Yin J">J Yin</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Xiong, W" uniqKey="Xiong W">W Xiong</name>
</author>
<author>
<name sortKey="Zhan, Y" uniqKey="Zhan Y">Y Zhan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Dorman, Ks" uniqKey="Dorman K">KS Dorman</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
<author>
<name sortKey="Maskell, Dl" uniqKey="Maskell D">DL Maskell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Molnar, M" uniqKey="Molnar M">M Molnar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Florea, L" uniqKey="Florea L">L Florea</name>
</author>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greenfield, P" uniqKey="Greenfield P">P Greenfield</name>
</author>
<author>
<name sortKey="Kx, D" uniqKey="Kx D">D Kx</name>
</author>
<author>
<name sortKey="Ax, P" uniqKey="Ax P">P Ax</name>
</author>
<author>
<name sortKey="Cx, Bd" uniqKey="Cx B">BD Cx</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y Heo</name>
</author>
<author>
<name sortKey="Ramachandran, A" uniqKey="Ramachandran A">A Ramachandran</name>
</author>
<author>
<name sortKey="Hwu, W M" uniqKey="Hwu W">W-M Hwu</name>
</author>
<author>
<name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author>
<name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xiao, C L" uniqKey="Xiao C">C-L Xiao</name>
</author>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author>
<name sortKey="Xie, S Q" uniqKey="Xie S">S-Q Xie</name>
</author>
<author>
<name sortKey="Chen, K N" uniqKey="Chen K">K-N Chen</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Han, Y" uniqKey="Han Y">Y Han</name>
</author>
<author>
<name sortKey="Luo, F" uniqKey="Luo F">F Luo</name>
</author>
<author>
<name sortKey="Xie, Z" uniqKey="Xie Z">Z Xie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Fazayeli, F" uniqKey="Fazayeli F">F Fazayeli</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Holtgrewe, M" uniqKey="Holtgrewe M">M Holtgrewe</name>
</author>
<author>
<name sortKey="Dimitrova, V" uniqKey="Dimitrova V">V Dimitrova</name>
</author>
<author>
<name sortKey="Niu, S" uniqKey="Niu S">S Niu</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
<author>
<name sortKey="Hx, R" uniqKey="Hx R">R Hx</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kao, W C" uniqKey="Kao W">W-C Kao</name>
</author>
<author>
<name sortKey="Chan, Ah" uniqKey="Chan A">AH Chan</name>
</author>
<author>
<name sortKey="Song, Ys" uniqKey="Song Y">YS Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, L" uniqKey="Zhao L">L Zhao</name>
</author>
<author>
<name sortKey="Chen, Q" uniqKey="Chen Q">Q Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Jiang, P" uniqKey="Jiang P">P Jiang</name>
</author>
<author>
<name sortKey="Wong, L" uniqKey="Wong L">L Wong</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ross, Mg" uniqKey="Ross M">MG Ross</name>
</author>
<author>
<name sortKey="Russ, C" uniqKey="Russ C">C Russ</name>
</author>
<author>
<name sortKey="Costello, M" uniqKey="Costello M">M Costello</name>
</author>
<author>
<name sortKey="Hollinger, A" uniqKey="Hollinger A">A Hollinger</name>
</author>
<author>
<name sortKey="Lennon, Nj" uniqKey="Lennon N">NJ Lennon</name>
</author>
<author>
<name sortKey="Hegarty, R" uniqKey="Hegarty R">R Hegarty</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Jaffe, Db" uniqKey="Jaffe D">DB Jaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Myers, Jr" uniqKey="Myers J">JR Myers</name>
</author>
<author>
<name sortKey="Marth, Gt" uniqKey="Marth G">GT Marth</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30598110</article-id>
<article-id pub-id-type="pmc">6311904</article-id>
<article-id pub-id-type="publisher-id">5272</article-id>
<article-id pub-id-type="doi">10.1186/s12864-018-5272-y</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhao</surname>
<given-names>Liang</given-names>
</name>
<address>
<email>s080011@e.ntu.edu.sg</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Xie</surname>
<given-names>Jin</given-names>
</name>
<address>
<email>252589791@qq.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bai</surname>
<given-names>Lin</given-names>
</name>
<address>
<email>bailin@gxu.edu.cn</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chen</surname>
<given-names>Wen</given-names>
</name>
<address>
<email>taiheren007@163.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Mingju</given-names>
</name>
<address>
<email>wangmingju@taihehospital.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Zhonglei</given-names>
</name>
<address>
<email>zlzhang@taihehospital.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Yiqi</given-names>
</name>
<address>
<email>806643897@qq.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhao</surname>
<given-names>Zhe</given-names>
</name>
<address>
<email>zhaoyinzhao@126.com</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Li</surname>
<given-names>Jinyan</given-names>
</name>
<address>
<email>jinyan.li@uts.edu.au</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>31</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>31</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<issue>Suppl 10</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. YZ was not involved in the review of their own authored paper. The Supplement Editors declare that they have no other competing interests.</issue-sponsor>
<elocation-id>912</elocation-id>
<permissions>
<copyright-statement>© The Author(s) 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer is a
<italic>k</italic>
-mer frequently occurring in NGS reads. The other
<italic>k</italic>
-mers are called weak
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer does not likely contain errors, while a weak
<italic>k</italic>
-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff
<italic>f</italic>
<sub>0</sub>
to balance the numbers of solid and weak
<italic>k</italic>
-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid
<italic>k</italic>
-mers that are likely to contain errors, and (ii) add a small subset of weak
<italic>k</italic>
-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of
<italic>k</italic>
-mers can improve the correction performance.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose to use a Gamma distribution to model the frequencies of erroneous
<italic>k</italic>
-mers and a mixture of Gaussian distributions to model correct
<italic>k</italic>
-mers, and combine them to determine
<italic>f</italic>
<sub>0</sub>
. To identify the two special subsets of
<italic>k</italic>
-mers, we use the
<italic>z</italic>
-score of
<italic>k</italic>
-mers which measures the number of standard deviations a
<italic>k</italic>
-mer’s frequency is from the mean. Then these statistically-solid
<italic>k</italic>
-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The
<italic>z</italic>
-score is adequate to distinguish solid
<italic>k</italic>
-mers from weak
<italic>k</italic>
-mers, particularly useful for pinpointing out solid
<italic>k</italic>
-mers having very low frequency. Applying
<italic>z</italic>
-score on
<italic>k</italic>
-mer can markedly improve the error correction accuracy.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Error correction</kwd>
<kwd>Next-generation sequencing</kwd>
<kwd>
<italic>z</italic>
-score</kwd>
</kwd-group>
<conference xlink:href="http://datamining-web.it.uts.edu.au/giw2018/">
<conf-name>29th International Conference on Genome Informatics</conf-name>
<conf-loc>Yunnan, China</conf-loc>
<conf-date>3-5 December 2018</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The massively parallel next-generation sequencing (NGS) technology is revolutionizing a wide range of medical and biological research areas as well as their application domains, such as medical diagnosis, biotechnologies, virology, etc [
<xref ref-type="bibr" rid="CR1">1</xref>
]. It has been shown that the NGS data is so informative and powerful that some ever thorny problems can be effectively tackled through this technology, e.g., the genome wide association study [
<xref ref-type="bibr" rid="CR2">2</xref>
].</p>
<p>The information contained in NGS data is deep and broad, but the raw data is still error prone. Various kinds of errors exist in the raw sequencing data, including substitution, insertion and deletion. The substitution error rate can be as high as 1 to 2.5% for the data produced by the Illumina platform [
<xref ref-type="bibr" rid="CR3">3</xref>
]; and the collective insertion and deletion error rate can be as high as 10 to 40% for the PacBio and Oxford Nanopore platforms [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
]. It has been widely recognized that correcting these sequencing errors is the first and critical step for many downstream data analyses, such as de novo genome assembly [
<xref ref-type="bibr" rid="CR6">6</xref>
], variants calling from genome re-sequencing [
<xref ref-type="bibr" rid="CR7">7</xref>
], identification of single nucleotide polymorphism as well as sequence mapping [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. For instance, the number of nodes of the De Bruijn graph generated from the HapMap sample NA12878 (
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/ERR091571/">https://www.ncbi.nlm.nih.gov/sra/ERR091571/</ext-link>
) is 6.92 billion; however, this number can be reduced to only 1.98 billion after error correction. This reduction significantly alleviates the burden of graph manipulation.</p>
<p>Owing to the importance of error correction, dozens of approaches have been proposed to cope with various types of errors. Depending on the key ideas that have been used, existing approaches can be categorized into three major approaches: (i) the
<italic>k</italic>
-spectrum-based approach, including Quake [
<xref ref-type="bibr" rid="CR3">3</xref>
], Reptile [
<xref ref-type="bibr" rid="CR9">9</xref>
], DecGPU [
<xref ref-type="bibr" rid="CR10">10</xref>
], SGA [
<xref ref-type="bibr" rid="CR11">11</xref>
], RACER [
<xref ref-type="bibr" rid="CR12">12</xref>
], Musket [
<xref ref-type="bibr" rid="CR13">13</xref>
], Lighter [
<xref ref-type="bibr" rid="CR14">14</xref>
], Blue [
<xref ref-type="bibr" rid="CR15">15</xref>
], BFC [
<xref ref-type="bibr" rid="CR16">16</xref>
], BLESS2 [
<xref ref-type="bibr" rid="CR17">17</xref>
], MECAT [
<xref ref-type="bibr" rid="CR18">18</xref>
] (ii) the suffix tree/array-based approach, including SHREC [
<xref ref-type="bibr" rid="CR19">19</xref>
], HSHREC [
<xref ref-type="bibr" rid="CR20">20</xref>
], HiTEC [
<xref ref-type="bibr" rid="CR21">21</xref>
], Fiona [
<xref ref-type="bibr" rid="CR22">22</xref>
] and; (iii) the multiple sequence alignment-based approach, including ECHO [
<xref ref-type="bibr" rid="CR23">23</xref>
], Coral [
<xref ref-type="bibr" rid="CR8">8</xref>
], CloudRS [
<xref ref-type="bibr" rid="CR24">24</xref>
], MEC [
<xref ref-type="bibr" rid="CR25">25</xref>
]. Among these approaches, the most advanced ones are the
<italic>k</italic>
-spectrum-based. It provides a very good scalability and competitive performance. Scalability is crucial for NGS data analysis since the input volume is usually huge.</p>
<p>The performance of
<italic>k</italic>
-spectrum-based approach heavily depends on the selection of solid
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer is a
<italic>k</italic>
-mer frequently occurring in NGS reads. The other
<italic>k</italic>
-mers are called weak
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer often does not contain any sequencing error, but a weak
<italic>k</italic>
-mer often contains sequencing errors. An intensively investigated problem is to find a good frequency cutoff
<italic>f</italic>
<sub>0</sub>
to balance the numbers of solid and weak
<italic>k</italic>
-mers,
<italic>cf.</italic>
Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. It is clear that even a very carefully determined
<italic>f</italic>
<sub>0</sub>
cannot tidily differentiate erroneous
<italic>k</italic>
-mers from those
<italic>k</italic>
-mers that do not contain any error bases. The reason is that there are very often a small portion of solid
<italic>k</italic>
-mers that contain errors and there are very often a tiny portion of weak
<italic>k</italic>
-mers that do not have errors,
<italic>cf.</italic>
the shaded part in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. This discrepancy is caused by the skewed distribution of the coverage of the sequencing reads. For instance, Ross et al. [
<xref ref-type="bibr" rid="CR26">26</xref>
] has reported that the coverage of GC rich and poor regions is markedly lower than the average coverage. That is, the
<italic>k</italic>
-mers from these regions very likely have low frequency, even lower than
<italic>f</italic>
<sub>0</sub>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Frequency distribution of both error-free and error-containing
<italic>k</italic>
-mers for a NGS data set. The frequency distribution of erroneous
<italic>k</italic>
-mers is represented by the dash orange line, while the distribution of the correct ones is shown as the dash sky-blue line. The solid black line is the distribution of all the k-mers. The
<italic>α</italic>
-labeled area is the proportion of correct
<italic>k</italic>
-mers having frequency less than
<italic>f</italic>
<sub>0</sub>
, while the
<italic>β</italic>
-labeled area is the proportion of erroneous
<italic>k</italic>
-mers having frequency greater than
<italic>f</italic>
<sub>0</sub>
</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>In this research, we focus on a more challenging but less-studied problem: (i) remove a small subset of solid
<italic>k</italic>
-mers that are likely to contain errors, and (ii) add a small subset of weak
<italic>k</italic>
-mers that are likely to contain no errors, into the set of solid
<italic>k</italic>
-mers. This is achieved by using
<italic>f</italic>
<sub>0</sub>
as well as z-score of
<italic>k</italic>
-mer,
<italic>z</italic>
(
<italic>κ</italic>
). With the purified set of solid
<italic>k</italic>
-mers, the correction performance can be much improved.</p>
<p>Our approach starts with counting
<italic>k</italic>
-mer frequencies by using KMC2 [
<xref ref-type="bibr" rid="CR27">27</xref>
], then calculates the
<italic>z</italic>
-scores of
<italic>k</italic>
-mers. Later, the statistically-solid
<italic>k</italic>
-mers are mined by considering both frequency and
<italic>z</italic>
-score. After that, the Bloom filter is constructed by the statistically-solid
<italic>k</italic>
-mers, and the weak
<italic>k</italic>
-mers are corrected. The newly proposed approach is named as ZEC, short for
<italic>z</italic>
-score-based
<italic>e</italic>
rror
<italic>c</italic>
orrector.</p>
</sec>
<sec id="Sec2">
<title>Algorithm: mining statistically-solid
<italic>k</italic>
-mers</title>
<p>A solid
<italic>k</italic>
-mer is conventionally defined as a
<italic>k</italic>
-mer which occurs in a data set of NGS reads with high frequency. A solid
<italic>k</italic>
-mer is usually considered error-free, and taken as the template for error correction. If a
<italic>k</italic>
-mer is not solid, then it is defined as a weak
<italic>k</italic>
-mer considered as error-containing. Existing
<italic>k</italic>
-mer-based approaches use a frequency cutoff,
<italic>f</italic>
<sub>0</sub>
, to identify solid and weak
<italic>k</italic>
-mers from NGS reads, e.g., BLESS2 [
<xref ref-type="bibr" rid="CR17">17</xref>
], Musket [
<xref ref-type="bibr" rid="CR13">13</xref>
], and BFC [
<xref ref-type="bibr" rid="CR16">16</xref>
]. The main difference of these methods is how the
<italic>f</italic>
<sub>0</sub>
is determined.</p>
<p>In fact, a solid
<italic>k</italic>
-mer is not definitely error-free. Sometimes, it may contain errors with a small chance. It is also true for the weak
<italic>k</italic>
-mers — a weak
<italic>k</italic>
-mer can be absolutely error-free. The reason that a solid
<italic>k</italic>
-mer is not always correct is that the coverage is not under uniform distribution. Thus the cutoff
<italic>f</italic>
<sub>0</sub>
itself is unable to perfectly distinct correct
<italic>k</italic>
-mers from erroneous
<italic>k</italic>
-mers;
<italic>cf.</italic>
the part labeled as
<italic>α</italic>
and
<italic>β</italic>
in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. However, the purpose of the research is to obtain correct
<italic>k</italic>
-mers as many as possible.</p>
<p>In this study, we present a time and memory efficient algorithm to purify the solid
<italic>k</italic>
-mer set as well as the weak
<italic>k</italic>
-mer set, so that more correct
<italic>k</italic>
-mers can be identified.</p>
<p>Let
<italic>R</italic>
be the input set of NGS reads, and
<italic>K</italic>
be the set of
<italic>k</italic>
-mers contained in
<italic>R</italic>
. To determine whether a
<italic>k</italic>
-mer, say
<italic>κ</italic>
, of
<italic>K</italic>
is correct or not, the following metrics are examined:
<list list-type="bullet">
<list-item>
<p>
<italic>f</italic>
(
<italic>κ</italic>
), the frequency of
<italic>κ</italic>
;</p>
</list-item>
<list-item>
<p>
<italic>z</italic>
(
<italic>κ</italic>
), the z-score of
<italic>κ</italic>
.</p>
</list-item>
</list>
</p>
<sec id="Sec3">
<title>Calculating
<italic>f</italic>
(
<italic>κ</italic>
)</title>
<p>The straightforward approach to determine
<italic>f</italic>
(
<italic>κ</italic>
) is as follows: (i) scan each read
<italic>r</italic>
of
<italic>R</italic>
from the beginning to the end; (ii) sum over the occurrence that
<italic>κ</italic>
appears. Then the summation is
<italic>f</italic>
(
<italic>κ</italic>
). This approach works for one
<italic>k</italic>
-mer, but it cannot be applied to all the
<italic>k</italic>
-mers simultaneously as the number of
<italic>k</italic>
-mers can be very large, demanding a huge size of memory.</p>
<p>In this study, we make use of the k-mer counting algorithm, KMC2 [
<xref ref-type="bibr" rid="CR27">27</xref>
], to solve this problem. KMC2 can remarkably reduce the memory usage because: (i) it is disk-based; (ii) it uses (
<italic>k</italic>
,
<italic>x</italic>
)-mer; and (iii) it applies the minimizer idea to deal with
<italic>k</italic>
-mer.</p>
</sec>
<sec id="Sec4">
<title>Computing
<italic>z</italic>
(
<italic>κ</italic>
)</title>
<p>Given a
<italic>k</italic>
-mer
<italic>κ</italic>
, we define the neighbor of
<italic>κ</italic>
,
<italic>N</italic>
(
<italic>κ</italic>
), as
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${N}(\kappa) = \left\{\kappa^{\prime}: {D}\left(\kappa, \kappa^{\prime}\right) \leq d_{0}, \kappa^{\prime} \in {K} \right\}, $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>κ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>:</mml:mo>
<mml:mi>D</mml:mi>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mi>κ</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>κ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>κ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>D</italic>
(
<italic>κ</italic>
,
<italic>κ</italic>
<sup></sup>
) is the edit distance between
<italic>κ</italic>
and
<italic>κ</italic>
<sup></sup>
, and the
<italic>d</italic>
<sub>0</sub>
is the predefined maximum distance. The default value of
<italic>d</italic>
<sub>0</sub>
is 1 as used in this study, but user can adjust this value to any reasonable integer.</p>
<p>The
<italic>k</italic>
-mer cluster centered at
<italic>κ</italic>
is defined as
<disp-formula id="Equb">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${C}(\kappa) = \{\kappa\}\cup {N}(\kappa), $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>}</mml:mo>
<mml:mo></mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
and the set of frequencies associated with these
<italic>k</italic>
-mers is defined as
<disp-formula id="Equc">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${F}(\kappa) = \{f(\kappa): \kappa \in {C}(\kappa)\}. $$ \end{document}</tex-math>
<mml:math id="M6">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>:</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>The
<italic>z</italic>
-score of
<italic>κ</italic>
,
<italic>z</italic>
(
<italic>κ</italic>
), is computed by
<disp-formula id="Equd">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$z(\kappa) = \frac{f(\kappa) - \mu}{\sigma}, $$ \end{document}</tex-math>
<mml:math id="M8">
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>μ</italic>
is the averaged frequency of
<italic>F</italic>
(
<italic>κ</italic>
) and
<italic>σ</italic>
is the standard deviation of
<italic>F</italic>
(
<italic>κ</italic>
).</p>
<p>It is straightforward to calculate the
<italic>z</italic>
-score of each
<italic>k</italic>
-mer given the frequency of the
<italic>k</italic>
-mer as well as that of its neighbor that have been determined by the aforementioned approach.</p>
</sec>
<sec id="Sec5">
<title>Determining
<italic>f</italic>
<sub>0</sub>
</title>
<p>Unlike existing approaches that determining solid
<italic>k</italic>
-mers based on their frequency only, we examine their
<italic>z</italic>
-scores as well.</p>
<p>Traditionally, an optimal
<italic>f</italic>
<sub>0</sub>
is used to distinct weak and solid
<italic>k</italic>
-mers, which is determined as the count minimizing misclassification rates (see misclassified parts labeled as
<italic>α</italic>
and
<italic>β</italic>
in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
). To learn the optimal value, we model the frequency of erroneous
<italic>k</italic>
-mers by a Gamma distribution
<italic>P</italic>
<sub>
<italic>G</italic>
</sub>
(
<italic>X</italic>
), and those correct ones by a mixture of Gaussian distributions
<italic>P</italic>
<sub>
<italic>N</italic>
</sub>
(
<italic>X</italic>
). A Gamma distribution is defined as:
<disp-formula id="Eque">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$P_{G}(X=x; k, \theta) = \frac{1}{\Gamma(k)\theta^{k}}x^{k-1}e^{-\frac{x}{\theta}} $$ \end{document}</tex-math>
<mml:math id="M10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>θ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>Γ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>θ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:msup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>θ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Eque.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>k</italic>
accounts for the shape of the distribution,
<italic>θ</italic>
is for the scale of the distribution, i.e., how the data spread out,
<italic>Γ</italic>
(
<italic>k</italic>
) is the Gamma function evaluated at
<italic>k</italic>
;
<italic>cf</italic>
. the dash sky-blue line in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. While a mixture of Gaussian distributions is
<disp-formula id="Equf">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$P_{N}(X=x; \pi, \mu, \sigma) = \sum\limits_{i=1}^{K} \pi_{i} \cdot \mathcal{N}(\mu_{i}, \sigma_{i}), $$ \end{document}</tex-math>
<mml:math id="M12">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>μ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>σ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:mi mathvariant="script">N</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equf.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>π</italic>
<sub>
<italic>i</italic>
</sub>
is the mixture parameter,
<italic>μ</italic>
<sub>
<italic>i</italic>
</sub>
and
<italic>σ</italic>
<sub>
<italic>i</italic>
</sub>
represent the mean and standard deviation of the component
<italic>i</italic>
, and
<italic>K</italic>
is the number of Gaussian components. In this study,
<italic>K</italic>
is set as 2, with one accounting for
<italic>k</italic>
-mers that are from GC rich or poor regions, and the other for the rest correct
<italic>k</italic>
-mers.</p>
<p>The two distributions are estimated by using EM algorithm based on the frequencies of
<italic>k</italic>
-mers. An example of the two distributions are shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
, i.e., the sky-blue dash line and the orange dash line. Based on the two distributions, we can determine the threshold
<italic>f</italic>
<sub>0</sub>
, such that it can minimize the area marked as
<italic>α</italic>
and
<italic>β</italic>
. Note that, the threshold
<italic>f</italic>
<sub>0</sub>
determined in this way may not be the intersection point of the two density functions.</p>
</sec>
<sec id="Sec6">
<title>Mining solid
<italic>k</italic>
-mers</title>
<p>It is clear that the optimal
<italic>f</italic>
<sub>0</sub>
cannot perfectly distinct the solid
<italic>k</italic>
-mers from the weak
<italic>k</italic>
-mers. Taking Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
, the
<italic>k</italic>
-mers marked by
<italic>α</italic>
will be wrongly corrected although they do not have errors but just because their frequencies are lower than
<italic>f</italic>
<sub>0</sub>
; likely, the ones marked by
<italic>β</italic>
will keep unchanged although they have errors because they have high frequency. To further refine the
<italic>purity</italic>
as well as the
<italic>completeness</italic>
of solid k-mers, we borrow the statistical idea of using
<italic>z</italic>
-score to solve the problem. The
<italic>purity</italic>
is defined as
<disp-formula id="Equg">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\text{purity}=1-p^{2}_{\text{correct}}-p^{2}_{\text{erroneous}}~, $$ \end{document}</tex-math>
<mml:math id="M14">
<mml:mrow>
<mml:mtext>purity</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>correct</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>erroneous</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="1em"></mml:mspace>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equg.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>p</italic>
<sub>correct</sub>
is the proportion of correct
<italic>k</italic>
-mers in the solid
<italic>k</italic>
-mers, and
<italic>p</italic>
<sub>erroneous</sub>
is the proportion of erroneous
<italic>k</italic>
-mers in the solid
<italic>k</italic>
-mers. The
<italic>completeness</italic>
is calculated as
<disp-formula id="Equh">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\text{completeness}=N^{\text{solid}}_{\text{correct}}/N_{\text{correct}}~, $$ \end{document}</tex-math>
<mml:math id="M16">
<mml:mrow>
<mml:mtext>completeness</mml:mtext>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>correct</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>solid</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mo>/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>correct</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mspace width="1em"></mml:mspace>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equh.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$N^{\text {solid}}_{\text {correct}}$\end{document}</tex-math>
<mml:math id="M18">
<mml:msubsup>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>correct</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>solid</mml:mtext>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the number of correct
<italic>k</italic>
-mers in the solid
<italic>k</italic>
-mers, and
<italic>N</italic>
<sub>correct</sub>
is the total number of correct
<italic>k</italic>
-mers.</p>
<p>The
<italic>z</italic>
-score as well as the frequency are collectively incorporated into solid
<italic>k</italic>
-mer identification through the following two situations:
<list list-type="bullet">
<list-item>
<p>If
<italic>f</italic>
(
<italic>κ</italic>
)<
<italic>f</italic>
<sub>0</sub>
and
<italic>z</italic>
(
<italic>κ</italic>
)≥
<italic>z</italic>
<sub>0</sub>
, then
<italic>κ</italic>
is removed from the weak
<italic>k</italic>
-mers and added to the solid
<italic>k</italic>
-mers, i.e., increases the completeness.</p>
</list-item>
<list-item>
<p>If
<italic>f</italic>
(
<italic>κ</italic>
)≥
<italic>f</italic>
<sub>0</sub>
and
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$z(\kappa) < z^{'}_{0}$\end{document}</tex-math>
<mml:math id="M20">
<mml:mi>z</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>κ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo><</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
, then
<italic>κ</italic>
is removed from the solid
<italic>k</italic>
-mers and added to the weak
<italic>k</italic>
-mers, i.e., improves the purity.</p>
</list-item>
</list>
</p>
<p>The
<italic>f</italic>
<sub>0</sub>
is the minimum frequency that has been determined, while the
<italic>z</italic>
<sub>0</sub>
and
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$z^{'}_{0}$\end{document}</tex-math>
<mml:math id="M22">
<mml:msubsup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
are the maximum
<italic>z</italic>
-score and minimum
<italic>z</italic>
-score for weak
<italic>k</italic>
-mers and solid
<italic>k</italic>
-mers, respectively.</p>
<p>The
<italic>z</italic>
<sub>0</sub>
and
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$z^{'}_{0}$\end{document}</tex-math>
<mml:math id="M24">
<mml:msubsup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
are learned from the
<italic>z</italic>
-score distribution automatically. To obtain the optimal
<italic>z</italic>
<sub>0</sub>
, the z-scores of the
<italic>k</italic>
-mers having frequency less than
<italic>f</italic>
<sub>0</sub>
are collected. Later, the distribution of these
<italic>z</italic>
-scores is estimated and
<italic>z</italic>
<sub>0</sub>
is set as the value having the lowest density between two peaks (viz. the trough of the bimodal; see results for more details). Analogously,
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$z^{'}_{0}$\end{document}</tex-math>
<mml:math id="M26">
<mml:msubsup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
is determined on the
<italic>z</italic>
-scores of
<italic>k</italic>
-mers having frequency greater than
<italic>f</italic>
<sub>0</sub>
.</p>
</sec>
</sec>
<sec id="Sec7">
<title>Methods</title>
<p>Our error correction model contains two main steps: (i) build Bloom filter from solid
<italic>k</italic>
-mers and; (ii) correct errors in weak
<italic>k</italic>
-mers by the Bloom filter.</p>
<sec id="Sec8">
<title>Build bloom filter</title>
<p>Bloom filter [
<xref ref-type="bibr" rid="CR28">28</xref>
] is a probabilistic data structure that can check whether an item is contained in a set of items with very frugal memory consumption. Instead of storing each item as is, the Bloom filter maps the item into several bits of a bit vector. Each bit can be reused by many items, and the mapping is achieved by hash functions. To check whether an item exists in a set of items, one only need to check whether all the mapped bits are “1”s. In case any one of them is “0”, it indicates that the item is definitely not contained in the set. Since each bit can be reused, it is possible that an item is not contained in the set but all of its mapped bits are “1”s. The probability that it happens is false positive rate. The relation between the number of hash function
<italic>h</italic>
, the false positive rate
<italic>p</italic>
, the size of the bit vector
<italic>n</italic>
, and the actual number of elements
<italic>m</italic>
is
<disp-formula id="Equi">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$p = \left(1-\left(1-\frac{1}{n}\right)^{hm}\right)^{h} \approx \left(1-e^{-h\frac{m}{n}}\right)^{h}. $$ \end{document}</tex-math>
<mml:math id="M28">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">hm</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>h</mml:mi>
<mml:mfrac>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2018_5272_Article_Equi.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>In our study,
<italic>m</italic>
is the number of solid
<italic>k</italic>
-mers that have been determined from all the
<italic>k</italic>
-mers by means of the aforementioned algorithm. Per existing approaches,
<italic>p</italic>
is set to 1%. One can also tune
<italic>p</italic>
,
<italic>h</italic>
and
<italic>n</italic>
to fit the real hardware limitations.</p>
<p>It has been reported that the Bloom filter has been successfully used to correct NGS errors, such as BLESS2 [
<xref ref-type="bibr" rid="CR17">17</xref>
] and BFC [
<xref ref-type="bibr" rid="CR16">16</xref>
]. The major difference between our model and the existing models is that we dedicate to efficiently refine the solid
<italic>k</italic>
-mers that are used to construct Bloom filter, which directly improves the error correction performance in theory. Note that, the solid
<italic>k</italic>
-mers play the key role in error correction, as all the rest
<italic>k</italic>
-mers (viz. the weak
<italic>k</italic>
-mers) are to be corrected based on the
<italic>solid</italic>
ones.</p>
<p>Figure 
<xref rid="Fig2" ref-type="fig">2</xref>
illustrates the forward search and backward search.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Illustration of the forward and backward search to correct sequencing errors. The forward search starts from the first
<italic>k</italic>
-mer to the last
<italic>k</italic>
-mer. At each step the last base of the
<italic>k</italic>
-mer is substituted by its alternatives to check the solidity. Inversely, the backward search starts from the last
<italic>k</italic>
-mer to the first
<italic>k</italic>
-mer. On the contrary to the forward search, the first base of the
<italic>k</italic>
-mers are altered other than the last one</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec9">
<title>Correct errors</title>
<p>By using Bloom filter, the errors contained in each read can be correct as follows: (i) check the existence of each
<italic>k</italic>
-mer of the read from the beginning to the end sequentially. (ii) partition the
<italic>k</italic>
-mers into groups that each group contains only solid
<italic>k</italic>
-mers or weak
<italic>k</italic>
-mers, deemed as solid group
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
or weak group
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
, respectively. The order of the groups is kept according to their appearance in the read. (iii) correct the errors causing the weak group
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
according to the following situations:
<list list-type="order">
<list-item>
<p>If
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is the first group and there exists a successive group
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
that is solid, we iteratively change the first base of each
<italic>k</italic>
-mer of
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
to its alternatives and check the existence of the
<italic>k</italic>
-mers against the Bloom filter. Once there exist a solution that makes all the weak
<italic>k</italic>
-mers solid, the amendment of the bases is accepted, thus the correction of the error. This process is applied to the
<italic>k</italic>
-mers of
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
from the last one to the first one. In case the number of
<italic>k</italic>
-mers contained in
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is less than a predefined value, say
<italic>τ</italic>
, the processive solid
<italic>k</italic>
-mers that are extended from the corrected
<italic>k</italic>
-mers will be generated until the total number of
<italic>k</italic>
-mers in
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is
<italic>τ</italic>
. If this criterion cannot be satisfied, the solution is abandoned. On the other hand, if
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
does not exist, we will alter the bases to their alternatives of all the
<italic>k</italic>
-mers iteratively until a solution that make all the
<italic>k</italic>
-mers solid can be found.</p>
</list-item>
<list-item>
<p>If
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
has a solid processive group
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
and a solid successive group
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$G_{s}^{'}$\end{document}</tex-math>
<mml:math id="M30">
<mml:msubsup>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12864_2018_5272_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
, we substitute the last base of each
<italic>k</italic>
-mer in
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
by its alternatives from the first
<italic>k</italic>
-mer to the last
<italic>k</italic>
-mer, namely the forward search. Solutions that make all the
<italic>k</italic>
-mers solid till the current substitution are recorded. Similarly, the backward search is conducted on the first base of the
<italic>k</italic>
-mers from the last one to the first one. A solution is accepted if the forward search and the backward search meet and the
<italic>k</italic>
-mers contained in both of them are solid. In case the number of
<italic>k</italic>
-mers in
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is less than
<italic>k</italic>
, we will only alter the last base of the first
<italic>k</italic>
-mer.</p>
</list-item>
<list-item>
<p>If
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is the last group and there exists a solid processive group
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
, we will apply the backward search to obtain the solution. Analogously to the first situation, if the number of
<italic>k</italic>
-mers of
<italic>G</italic>
<sub>
<italic>w</italic>
</sub>
is less than
<italic>τ</italic>
, we will extend the
<italic>k</italic>
-mers toward their downstream until the number is satisfied. In case
<italic>G</italic>
<sub>
<italic>s</italic>
</sub>
does not exist, it is the same as the second part of the first situation, thus the same approach is applied.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="Sec10" sec-type="results">
<title>Results</title>
<sec id="Sec11">
<title>Datasets</title>
<p>We collected six data sets to test the performance of our proposed method in comparison with the state-of-art methods. Four of the six data sets are the NGS reads produced by the Illumina platform, including Staphylococcus aureus (S. aureus), Rhodobacter sphaeroides (R. sphaeroides), Human Chromosome 14 (H. chromosome 14) and Bombus impatiens (B. impatiens). These data sets are the gold standards used by GAGE [
<xref ref-type="bibr" rid="CR6">6</xref>
] for NGS data analysis. Besides these real data sets, two synthesized data sets have been generated by using ART [
<xref ref-type="bibr" rid="CR29">29</xref>
] based on the genomes H. chromosome 14 and B. impatiens. The two synthetic data sets contain exactly the same number of reads as the real ones. They are included because the ground truth of the synthesized errors are known, i.e., the positions of the errors as well as their bases are available. On the contrary, such information is unavailable for the real data sets. Typically, the raw reads of the real data sets are mapped to the corresponding reference, and those mapped are kept for performance evaluation. Although this is arguable as various deleterious situations can emerge from the mapping, e.g., unmapped reads, multi-mapped reads, wrongly mapped reads, it is necessary to carry out the mapping as only in this way can we perform the evaluation directly. This is another reason that the synthetic data should be included. Details of these data sets are shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>The data sets that are used for evaluating the performance of error correction models</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Data set</th>
<th align="left">Genome name</th>
<th align="left">Genome size (bp)</th>
<th align="left">Error rate (%)</th>
<th align="left">Read length (bp)</th>
<th align="left">Coverage</th>
<th align="left">Number of reads</th>
<th align="left">Insert length</th>
<th align="left">Is sythetic</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">R1</td>
<td align="left">S. aueus</td>
<td align="left">2,821,361</td>
<td align="left">1.28</td>
<td align="left">101</td>
<td align="left">46.3 ×</td>
<td align="left">1,294,104</td>
<td align="left">180</td>
<td align="left">No</td>
</tr>
<tr>
<td align="left">R2</td>
<td align="left">R. sphaeroides</td>
<td align="left">4,603,110</td>
<td align="left">1.08</td>
<td align="left">101</td>
<td align="left">45.0 ×</td>
<td align="left">2,050,868</td>
<td align="left">180</td>
<td align="left">No</td>
</tr>
<tr>
<td align="left">R3</td>
<td align="left">H. chromosome 14</td>
<td align="left">88,218,286</td>
<td align="left">0.52</td>
<td align="left">101</td>
<td align="left">41.8 ×</td>
<td align="left">36,504,800</td>
<td align="left">155</td>
<td align="left">No</td>
</tr>
<tr>
<td align="left">R4</td>
<td align="left">B. impatiens</td>
<td align="left">249,185,056</td>
<td align="left">0.86</td>
<td align="left">124</td>
<td align="left">150.8 ×</td>
<td align="left">303,118,594</td>
<td align="left">400</td>
<td align="left">No</td>
</tr>
<tr>
<td align="left">S1</td>
<td align="left">H. chromosome 14</td>
<td align="left">88,218,286</td>
<td align="left">0.97</td>
<td align="left">101</td>
<td align="left">41.8 ×</td>
<td align="left">36,504,800</td>
<td align="left">180</td>
<td align="left">Yes</td>
</tr>
<tr>
<td align="left">S2</td>
<td align="left">B. impatiens</td>
<td align="left">249,185,056</td>
<td align="left">0.98</td>
<td align="left">124</td>
<td align="left">150.8 ×</td>
<td align="left">303,118,594</td>
<td align="left">400</td>
<td align="left">Yes</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec12">
<title>Performance evaluation</title>
<p>The error correction performance is evaluated through the widely accepted procedure implemented by [
<xref ref-type="bibr" rid="CR30">30</xref>
]. Metrics that are considered include
<italic>gain</italic>
,
<italic>recall</italic>
,
<italic>precision</italic>
and per base error rate (pber). Gain is defined as (
<italic>T</italic>
<italic>P</italic>
<italic>F</italic>
<italic>P</italic>
)/(
<italic>T</italic>
<italic>P</italic>
+
<italic>F</italic>
<italic>N</italic>
), recall is
<italic>T</italic>
<italic>P</italic>
/(
<italic>T</italic>
<italic>P</italic>
+
<italic>F</italic>
<italic>N</italic>
), precision is
<italic>T</italic>
<italic>P</italic>
/(
<italic>T</italic>
<italic>P</italic>
+
<italic>F</italic>
<italic>P</italic>
) and pber is
<italic>N</italic>
<sup>
<italic>e</italic>
</sup>
/
<italic>N</italic>
, where
<italic>TP</italic>
stands for the number of corrected bases that are truly erroneous bases,
<italic>FP</italic>
represents the number of corrected bases that are not sequencing errors intrinsically,
<italic>FN</italic>
is the number of erroneous bases that remain untouched,
<italic>N</italic>
<sup>
<italic>e</italic>
</sup>
is the number of erroneous bases and
<italic>N</italic>
is the total number of bases. Among these metrics,
<italic>gain</italic>
is the most informative.</p>
<p>All experiments are carried out on a cluster having eight Intel Xeon E7 CPUs and 1Tb RAM. Each CPU has eight cores.</p>
<p>
<italic>Overall Performance of ZEC</italic>
. The experimental results of ZEC are presented in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. ZEC performs well on both of the real data sets and the synthetic data sets. Comparing the performance on H. chromosome 14 and B. impatiens, ZEC has a much better performance on S. aueus and R. sphaeroides. This is consistent with our understanding that the genomes of the former two data sets are much more complicated than the latter two, where the errors introduced in complicated genomes are more difficult to correct.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Error-correction performance comparison between ZEC, Lighter, Racer, BLESS2, Musket, BFC, SGA and MEC</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Data</th>
<th align="left">Corrector</th>
<th align="left">Gain</th>
<th align="left">Reca</th>
<th align="left">Prec</th>
<th align="left">Pber(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">R1</td>
<td align="left">ZEC</td>
<td align="left">0.908</td>
<td align="left">0.912</td>
<td align="left">0.996</td>
<td align="left">0.102</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.839</td>
<td align="left">0.845</td>
<td align="left">0.994</td>
<td align="left">0.163</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.760</td>
<td align="left">0.822</td>
<td align="left">0.929</td>
<td align="left">0.190</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">0.189</td>
<td align="left">0.409</td>
<td align="left">0.650</td>
<td align="left">0.879</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.499</td>
<td align="left">0.628</td>
<td align="left">0.830</td>
<td align="left">0.448</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.746</td>
<td align="left">0.815</td>
<td align="left">0.922</td>
<td align="left">0.202</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.753</td>
<td align="left">0.817</td>
<td align="left">0.927</td>
<td align="left">0.196</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">
<bold>0.909</bold>
</td>
<td align="left">0.911</td>
<td align="left">0.998</td>
<td align="left">0.102</td>
</tr>
<tr>
<td align="left">R2</td>
<td align="left">ZEC</td>
<td align="left">
<bold>0.584</bold>
</td>
<td align="left">0.663</td>
<td align="left">0.894</td>
<td align="left">0.537</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.226</td>
<td align="left">0.329</td>
<td align="left">0.762</td>
<td align="left">1.076</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.364</td>
<td align="left">0.450</td>
<td align="left">0.839</td>
<td align="left">0.780</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">0.318</td>
<td align="left">0.405</td>
<td align="left">0.806</td>
<td align="left">0.890</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.265</td>
<td align="left">0.364</td>
<td align="left">0.786</td>
<td align="left">0.984</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.331</td>
<td align="left">0.423</td>
<td align="left">0.822</td>
<td align="left">0.843</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.306</td>
<td align="left">0.400</td>
<td align="left">0.811</td>
<td align="left">0.893</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">0.570</td>
<td align="left">0.631</td>
<td align="left">0.912</td>
<td align="left">0.541</td>
</tr>
<tr>
<td align="left">R3</td>
<td align="left">ZEC</td>
<td align="left">
<bold>0.802</bold>
</td>
<td align="left">0.923</td>
<td align="left">0.884</td>
<td align="left">0.087</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.445</td>
<td align="left">0.764</td>
<td align="left">0.706</td>
<td align="left">0.256</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.562</td>
<td align="left">0.814</td>
<td align="left">0.764</td>
<td align="left">0.196</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">0.130</td>
<td align="left">0.641</td>
<td align="left">0.556</td>
<td align="left">0.438</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.533</td>
<td align="left">0.802</td>
<td align="left">0.749</td>
<td align="left">0.211</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.567</td>
<td align="left">0.818</td>
<td align="left">0.765</td>
<td align="left">0.194</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.603</td>
<td align="left">0.833</td>
<td align="left">0.783</td>
<td align="left">0.176</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">0.788</td>
<td align="left">0.852</td>
<td align="left">0.930</td>
<td align="left">0.117</td>
</tr>
<tr>
<td align="left">R4</td>
<td align="left">ZEC</td>
<td align="left">
<bold>0.746</bold>
</td>
<td align="left">0.833</td>
<td align="left">0.905</td>
<td align="left">0.137</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.126</td>
<td align="left">0.408</td>
<td align="left">0.591</td>
<td align="left">0.688</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.313</td>
<td align="left">0.541</td>
<td align="left">0.703</td>
<td align="left">0.484</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">-0.517</td>
<td align="left">0.018</td>
<td align="left">0.003</td>
<td align="left">0.862</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.502</td>
<td align="left">0.660</td>
<td align="left">0.807</td>
<td align="left">0.320</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.542</td>
<td align="left">0.690</td>
<td align="left">0.823</td>
<td align="left">0.289</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.195</td>
<td align="left">0.457</td>
<td align="left">0.636</td>
<td align="left">0.607</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">0.705</td>
<td align="left">0.806</td>
<td align="left">0.889</td>
<td align="left">0.201</td>
</tr>
<tr>
<td align="left">S1</td>
<td align="left">ZEC</td>
<td align="left">
<bold>0.918</bold>
</td>
<td align="left">0.935</td>
<td align="left">0.982</td>
<td align="left">0.056</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.791</td>
<td align="left">0.851</td>
<td align="left">0.934</td>
<td align="left">0.130</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.882</td>
<td align="left">0.916</td>
<td align="left">0.964</td>
<td align="left">0.071</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">0.634</td>
<td align="left">0.740</td>
<td align="left">0.875</td>
<td align="left">0.243</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.819</td>
<td align="left">0.871</td>
<td align="left">0.944</td>
<td align="left">0.111</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.810</td>
<td align="left">0.865</td>
<td align="left">0.940</td>
<td align="left">0.117</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.866</td>
<td align="left">0.903</td>
<td align="left">0.961</td>
<td align="left">0.081</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">0.899</td>
<td align="left">0.916</td>
<td align="left">0.982</td>
<td align="left">0.063</td>
</tr>
<tr>
<td align="left">S2</td>
<td align="left">ZEC</td>
<td align="left">
<bold>0.853</bold>
</td>
<td align="left">0.894</td>
<td align="left">0.956</td>
<td align="left">0.109</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Lighter</td>
<td align="left">0.058</td>
<td align="left">0.329</td>
<td align="left">0.548</td>
<td align="left">0.891</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Racer</td>
<td align="left">0.168</td>
<td align="left">0.408</td>
<td align="left">0.630</td>
<td align="left">0.720</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BLESS2</td>
<td align="left">0.311</td>
<td align="left">0.509</td>
<td align="left">0.719</td>
<td align="left">0.543</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Musket</td>
<td align="left">0.232</td>
<td align="left">0.453</td>
<td align="left">0.672</td>
<td align="left">0.636</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">SGA</td>
<td align="left">0.075</td>
<td align="left">0.342</td>
<td align="left">0.562</td>
<td align="left">0.862</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">BFC</td>
<td align="left">0.751</td>
<td align="left">0.822</td>
<td align="left">0.920</td>
<td align="left">0.157</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MEC</td>
<td align="left">0.849</td>
<td align="left">0.887</td>
<td align="left">0.959</td>
<td align="left">0.122</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The numbers in bold face are the best gain achieved for each data set</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>
<italic>Relation with GC-content</italic>
. A previous study by Ross et al. [
<xref ref-type="bibr" rid="CR26">26</xref>
] shows that the GC-content (GC poor and GC rich) regions have direct influence on the low sequencing coverage of NGS data. Hence, the
<italic>k</italic>
-mers obtained from the reads sequenced from these regions are more likely to be treated as weak. Figure 
<xref rid="Fig3" ref-type="fig">3</xref>
highlights an example of the relation between GC-content (GC poor and GC rich) and
<italic>k</italic>
-mer frequency derived from H. chromosome 14. It can be seen that the
<italic>k</italic>
-mers having a low frequency can spread out wider than those having a high frequency, and the wide range is coincident with the GC content. This result is in accordance with the performance shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
, meanwhile it also consolidates our intuition that refining the set of solid
<italic>k</italic>
-mers is necessary, particularly for the subset of
<italic>k</italic>
-mers that have a low frequency. More importantly, it empirically supports our idea of using mixture model to treat solid
<italic>k</italic>
-mers and weak
<italic>k</italic>
-mers separately.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>A relation between
<italic>k</italic>
-mer frequency and GC-content. The bottom left panel shows the smoothed scatter plot between
<italic>k</italic>
-mer frequency and GC-content, the top left is the distribution of
<italic>k</italic>
-mer frequency, and the bottom right is the distribution of GC-content. It is clear that GC-content
<italic>k</italic>
-mers have relatively low frequency. The data shown in this example is obtained from the H. chromosome 14 with
<italic>k</italic>
-mer size of 25</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>
<italic>Comparison with State-of-the-art</italic>
. The performance of ZEC is much superior to the state-of-the-art methods, including Lighter [
<xref ref-type="bibr" rid="CR14">14</xref>
], Racer [
<xref ref-type="bibr" rid="CR12">12</xref>
], BLESS2 [
<xref ref-type="bibr" rid="CR17">17</xref>
], Musket [
<xref ref-type="bibr" rid="CR13">13</xref>
], SGA [
<xref ref-type="bibr" rid="CR11">11</xref>
], BFC [
<xref ref-type="bibr" rid="CR16">16</xref>
]. See Table 
<xref rid="Tab2" ref-type="table">2</xref>
. ZEC markedly outperforms the existing error correctors in terms of the most informative evaluation metric—
<italic>gain</italic>
. For instance, on the dataset R4, the gain of ZEC is 0.746, while the best performance produced by the other methods is 0.705. For the synthetic datasets, ZEC also has higher gain than other methods. For example, on the dataset S2, the gain of ZEC is 0.853, while the best and worst gain generated by the other methods are 0.849 and 0.058, respectively. The lowest average per-base error rate of ZEC also consolidates its effectiveness.</p>
</sec>
<sec id="Sec13">
<title>Distinguishbility of z-score</title>
<p>The key to the performance improvement is the idea of using
<italic>z</italic>
-score for identifying the two special subsets of
<italic>k</italic>
-mers from the sets of solid
<italic>k</italic>
-mers and weak
<italic>k</italic>
-mers. An example of
<italic>z</italic>
-score distribution pertaining to
<italic>k</italic>
-mer frequency is shown in Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
, which is derived from B. impatiens. The highlighted
<italic>k</italic>
-mers shown in the figure have relatively low frequencies—less than 9, while the
<italic>z</italic>
-scores are pretty high—greater than 1. Interestingly, almost all the solid
<italic>k</italic>
-mers (the top right region) have the similar level of
<italic>z</italic>
-scores comparing to these highlighted ones. These observations indicate that the highlighted k-mers are very likely to be correct
<italic>k</italic>
-mers instead of erroneous
<italic>k</italic>
-mers although their frequencies are very low. The
<italic>z</italic>
-score distribution pertaining to the other three real data sets has similar patterns compared to the one shown here.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>A relation between
<italic>z</italic>
-score and
<italic>k</italic>
-mer frequency. The level of shade represents the density of the distribution. The darker the color is, the more
<italic>k</italic>
-mers are presented. The frequencies of the
<italic>k</italic>
-mers highlighted in the red box are less than nine, which are very likely to be treated as weak for all existing
<italic>k</italic>
-mer based approaches. However, the very high
<italic>z</italic>
-score reflects that they should be treated as solid
<italic>k</italic>
-mers. The data shown here is obtained from B. impatiens with
<italic>k</italic>
-mer size of 25</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>By exploring the four real data sets, we found that the proportion of
<italic>k</italic>
-mers that can be refined comparing to the solely frequency determined
<italic>k</italic>
-mers are 12.3%, 14.2%, 11.4%, 7.1% for the real data R1, R2, R3 and R4, respectively; see Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
. These refinement are the major contributions of the performance improvement.
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>The proportion of
<italic>k</italic>
-mers refined by
<italic>z</italic>
-score. The refinements come from two folds: weak
<italic>k</italic>
-mers having high
<italic>z</italic>
-score (moved to the solid
<italic>k</italic>
-mer set), and solid
<italic>k</italic>
-mers having low
<italic>z</italic>
-score (excluded from the solid
<italic>k</italic>
-mer set)</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
</sec>
<sec id="Sec14">
<title>Efficiency of
<italic>z</italic>
-score calculation</title>
<p>Calculating z-score of
<italic>k</italic>
-mers is not trivial for very large data sets, as the
<italic>k</italic>
-mers and their frequencies are usually too large to be hold by a main memory of a moderate computer. We designed a novel algorithm and solved this problem. The efficiency of the algorithm in terms of the memory usage and running speed are studied.</p>
<p>Figure 
<xref rid="Fig6" ref-type="fig">6</xref>
shows the relation between memory saving ratio and the percentage of input (
<italic>k</italic>
-mers as well as their frequencies) that can be held by only one bit vector. The memory saving ratio is calibrated as the ratio between the real memory allocation and the input data volume. For instance, the ratio of 0.01 pertaining to the data R4 means that the allocated memory is one percent of the input size of R4. That is, 70Mb memory is allocated for holding the 6.97Gb data. It is promising that, with one percent memory allocation, around 22 percent input data can be hold by only one bit vector. When the memory allocation increased to 2.5 percent, 30 percent input data and even more can be held by one bit vector. Obviously, keep increasing the size of allocated memory does not guarantee the linear scale of holding the input data. Based on the experiments, we set the memory allocation ratio to 2.5 percent through the whole study. Typically, three bit vectors are constructed for holding all the input. Note that, the size of bit vector decreases along with the reduced size of input. The ratios between the input and the total allocated memory are 20.0, 13.4, 7.0 and 7.9 for the four real datasets, respectively.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Memory saving analysis on the six data sets. The x-axis shows the memory saving ratio between the size of real memory allocation and raw input, while the y-axis shows how much proportion of an input held by a bit vector</p>
</caption>
<graphic xlink:href="12864_2018_5272_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
<p>Regarding the running speed, this algorithm is linearly scaled. Since locating each
<italic>k</italic>
-mer in a bit vector is O(1) pertaining to time complexity by using hash, this algorithm is pretty fast. For instance, based on our computing power, it only takes 387 s to construct the bit vectors and calculate the z-scores of all the
<italic>k</italic>
-mers of R4—the largest data set.</p>
<p>Since a Bloom Filter has false positives, this may cause the
<italic>z</italic>
-score of a
<italic>k</italic>
-mer different from its genuine value. However, the false positive rate is pretty small, usually less than 1%, thus this impact can be neglected.</p>
</sec>
</sec>
<sec id="Sec15" sec-type="discussion">
<title>Discussion</title>
<p>Our model effectively pinpoints out correct
<italic>k</italic>
-mers having low frequency, achieving an improvement of 11.25% on weak
<italic>k</italic>
-mers. However, some issues still remain further exploration, including neighbor inclusion and neighbor retrieval.</p>
<p>Neighbor inclusion means how neighbor
<italic>k</italic>
-mers are determined given a
<italic>k</italic>
-mer of interest, say
<italic>κ</italic>
. Our current approach takes
<italic>k</italic>
-mers having edit distance of 1 as neighbors of
<italic>κ</italic>
, but there still has a small chance that a true neighbor having edit distance larger than 1. Suppose the error rate is
<italic>e</italic>
, the probability of a
<italic>k</italic>
-mer having exactly one error is
<italic>k</italic>
·
<italic>e</italic>
(1−
<italic>e</italic>
)
<sup>
<italic>k</italic>
−1</sup>
/
<italic>k</italic>
·
<italic>e</italic>
=(1−
<italic>e</italic>
)
<sup>
<italic>k</italic>
−1</sup>
. When
<italic>e</italic>
=1% and
<italic>k</italic>
=1, the probability is (1−0.01)
<sup>31−1</sup>
=73.97
<italic>%</italic>
. That been said, about 26% real neighbors are excluded. However, even extending the minimum edit distance from 1 to 2 significantly elongates running time. This is because the number of candidate
<italic>k</italic>
-mers increases from 3∗
<italic>k</italic>
to 3∗
<italic>k</italic>
∗3∗(
<italic>k</italic>
−1).</p>
<p>Neighbor retrieval is another issue to be considered. Usually, the size of counted
<italic>k</italic>
-mers is too large to fit into a main memory. Hence, a more sophisticated approach is required to solve this problem. We use Bloom Filter to overcome the limitation. For
<italic>k</italic>
-mers having small count, say 5, we use classical Bloom Filters to save them, each Bloom Filter saves
<italic>k</italic>
-mers having the same count. For
<italic>k</italic>
-mers having large count, we use coupled-Bloom Filter to save them. One Bloom Filter for
<italic>k</italic>
-mer encoding, while the other is for count representation. This approach significantly reduces memory usage while achieving constant time complexity of
<italic>k</italic>
-mer retrieval. However, it may cause false positives although the probability is small. Hence, more effort is required to handle this problem.</p>
</sec>
<sec id="Sec16" sec-type="conclusion">
<title>Conclusions</title>
<p>We have proposed a novel method for correcting the NGS errors. The novel idea is the use of statistically-solid
<italic>k</italic>
-mers to construct the Bloom filter. These
<italic>k</italic>
-mers are mined from all the
<italic>k</italic>
-mers of a NGS data set by considering both their frequency and
<italic>z</italic>
-score, particular the latter one that can effectively fishing out the solid
<italic>k</italic>
-mers having low frequency. Pinpointing out such
<italic>k</italic>
-mers has been a very challenging problem. The experimental results show that our approach markedly outperforms the existing state-of-the-art methods in terms of error correction performance.</p>
</sec>
</body>
<back>
<ack>
<p>We thank the anonymous reviewers for their valuable comments and insightful suggestions.</p>
<sec id="d29e3993">
<title>Funding</title>
<p>This study is collectively supported by the National Natural Science Foundation of China (No. 31501070), the Natural Science Foundation of Hubei (No. 2017CFB137) and Guangxi (No. 2016GXNSFCA380006), the Scientific Research Foundation of GuangXi University (No. XGZ150316) and Taihe hospital (No. 2016JZ11), and the Australia Research Council (ARC) Discovery Project 180100120. Publication costs are funded by the National Natural Science Foundation of China (No. 31501070).</p>
</sec>
<sec id="d29e3998">
<title>Availability of data and materials</title>
<p>The source codes are available at
<ext-link ext-link-type="uri" xlink:href="https://www.github.com/lzhlab/zec/">github.com/lzhlab/zec/</ext-link>
.</p>
</sec>
<sec id="d29e4008">
<title>About this supplement</title>
<p>This article has been published as part of
<italic>BMC Genomics Volume 19 Supplement 10, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): genomics</italic>
. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="https://bmcgenomics.biomedcentral.com/articles/supplements/volume-19-supplement-10">https://bmcgenomics.biomedcentral.com/articles/supplements/volume-19-supplement-10</ext-link>
.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>LZ conceived and designed the experiments, LZ and JL wrote the manuscript. Program coding: LZ, YW and ZZ. Data analyses: LZ, JX, LB, WC, MW and ZZ. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alic</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Ruzafa</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Dopazo</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Blanquer</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Objective review of de novo stand-alone error correction methods for ngs data</article-title>
<source>WIREs Comput Mol Sci</source>
<year>2016</year>
<volume>6</volume>
<fpage>111</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="doi">10.1002/wcms.1239</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<mixed-citation publication-type="other">The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010; 467:1061–73.</mixed-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Quake: Quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>116</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-11-r116</pub-id>
<pub-id pub-id-type="pmid">20423531</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hackl</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hedrich</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schultz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Förster</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>proovread: large-scale high-accuracy pacbio correction through iterative short read consensus</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>21</issue>
<fpage>3004</fpage>
<lpage>11</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu392</pub-id>
<pub-id pub-id-type="pmid">25015988</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goodwin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gurtowski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ethe-Sayers</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Deshpande</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>McCombie</surname>
<given-names>WR</given-names>
</name>
</person-group>
<article-title>Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome</article-title>
<source>Genome Res</source>
<year>2015</year>
<volume>25</volume>
<issue>11</issue>
<fpage>1750</fpage>
<lpage>1756</lpage>
<pub-id pub-id-type="doi">10.1101/gr.191395.115</pub-id>
<pub-id pub-id-type="pmid">26447147</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Zimin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Magoc</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Treangen</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>GAGE: A critical evaluation of genome assemblies and assembly algorithms</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>22</volume>
<issue>3</issue>
<fpage>557</fpage>
<lpage>67</lpage>
<pub-id pub-id-type="doi">10.1101/gr.131383.111</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>An Efficient Hybrid Approach to Correcting Errors in Short Reads</article-title>
<source>Modeling Decision for Artificial Intelligence: 8th International Conference, MDAI 2011, Changsha, Hunan, China, July 28-30, 2011, Proceedings</source>
<year>2011</year>
<publisher-loc>Berlin</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Correcting errors in short reads by multiple alignments</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1455</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr170</pub-id>
<pub-id pub-id-type="pmid">21471014</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Dorman</surname>
<given-names>KS</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Reptile: Representative tiling for short read error correction</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>2526</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq468</pub-id>
<pub-id pub-id-type="pmid">20834037</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Maskell</surname>
<given-names>DL</given-names>
</name>
</person-group>
<article-title>DecGPU: Distributed error correction on massively parallel graphics processing units using CUDA and MPI</article-title>
<source>BMC Bioinforma</source>
<year>2011</year>
<volume>12</volume>
<fpage>85</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-85</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
<source>Genome Res</source>
<year>2012</year>
<volume>22</volume>
<issue>3</issue>
<fpage>549</fpage>
<lpage>56</lpage>
<pub-id pub-id-type="doi">10.1101/gr.126953.111</pub-id>
<pub-id pub-id-type="pmid">22156294</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Molnar</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Racer: Rapid and accurate correction of errors in reads</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>19</issue>
<fpage>2490</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt407</pub-id>
<pub-id pub-id-type="pmid">23853064</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Musket: A multistage k-mer spectrum-based error corrector for Illumina sequence data</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>3</issue>
<fpage>308</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts690</pub-id>
<pub-id pub-id-type="pmid">23202746</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Florea</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Lighter: fast and memory-efficient sequencing error correction without counting</article-title>
<source>Genome Biol</source>
<year>2014</year>
<volume>15</volume>
<issue>11</issue>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1186/s13059-014-0509-9</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Greenfield</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kx</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ax</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Cx</surname>
<given-names>BD</given-names>
</name>
</person-group>
<article-title>Blue: correcting sequencing errors using consensus and context</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>19</issue>
<fpage>2723</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu368</pub-id>
<pub-id pub-id-type="pmid">24919879</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<mixed-citation publication-type="other">Li H. Correcting Illumina sequencing errors for human data. arXiv preprint. 2015;:arXiv:1502.03744.</mixed-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heo</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ramachandran</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hwu</surname>
<given-names>W-M</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>BLESS 2: accurate, memory-efficient and fast error correction method</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>15</issue>
<fpage>2369</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw146</pub-id>
<pub-id pub-id-type="pmid">27153708</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xiao</surname>
<given-names>C-L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>S-Q</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K-N</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads</article-title>
<source>Nat Methods</source>
<year>2017</year>
<volume>14</volume>
<fpage>1072</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.4432</pub-id>
<pub-id pub-id-type="pmid">28945707</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>SHREC: A short-read error correction method</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>17</issue>
<fpage>2157</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp379</pub-id>
<pub-id pub-id-type="pmid">19542152</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Correction of sequencing errors in a mixed set of reads</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>12</issue>
<fpage>1284</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq151</pub-id>
<pub-id pub-id-type="pmid">20378555</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Fazayeli</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>HiTEC: Accurate error correction in high-throughput sequencing data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>3</issue>
<fpage>295</fpage>
<lpage>302</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq653</pub-id>
<pub-id pub-id-type="pmid">21115437</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Weese</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Holtgrewe</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Dimitrova</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Niu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Hx</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Fiona: a parallel and automatic strategy for read error correction</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>17</issue>
<fpage>356</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu440</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kao</surname>
<given-names>W-C</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>YS</given-names>
</name>
</person-group>
<article-title>ECHO: A reference-free short-read error correction algorithm</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<fpage>1181</fpage>
<lpage>92</lpage>
<pub-id pub-id-type="doi">10.1101/gr.111351.110</pub-id>
<pub-id pub-id-type="pmid">21482625</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<mixed-citation publication-type="other">Chen CC, Chang YJ, Chung WC, Lee DT, Ho JM. CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework. In: Big Data, 2013 IEEE International Conference On: 2013. p. 717–22.</mixed-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>MapReduce for accurate error correction of next-generation sequencing data</article-title>
<source>Bioinformatics</source>
<year>2017</year>
<volume>33</volume>
<issue>23</issue>
<fpage>3844</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btx089</pub-id>
<pub-id pub-id-type="pmid">28205674</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ross</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Russ</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Costello</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hollinger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lennon</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Hegarty</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>DB</given-names>
</name>
</person-group>
<article-title>Characterizing and measuring bias in sequence data</article-title>
<source>Genome Biol</source>
<year>2013</year>
<volume>14</volume>
<issue>5</issue>
<fpage>51</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-5-r51</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kokot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>KMC 2: fast and resource-frugal k-mer counting</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>10</issue>
<fpage>1569</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv022</pub-id>
<pub-id pub-id-type="pmid">25609798</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bloom</surname>
<given-names>BH</given-names>
</name>
</person-group>
<article-title>Space/Time Trade-offs in Hash Coding with Allowable Errors</article-title>
<source>Commun ACM</source>
<year>1970</year>
<volume>13</volume>
<issue>7</issue>
<fpage>422</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1145/362686.362692</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Marth</surname>
<given-names>GT</given-names>
</name>
</person-group>
<article-title>ART: a next-generation sequencing read simulator</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>4</issue>
<fpage>593</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr708</pub-id>
<pub-id pub-id-type="pmid">22199392</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<mixed-citation publication-type="other">Molnar M, Ilie L. Correcting illumina data. Brief Bioinform. 2014.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/doi:10.1093/bib/bbu029">https://doi.org/doi:10.1093/bib/bbu029</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000300 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000300 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:6311904
   |texte=   Mining statistically-solid k-mers for accurate NGS error correction
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:30598110" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021