Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Mining statistically-solid k-mers for accurate NGS error correction

Identifieur interne : 000300 ( Pmc/Curation ); précédent : 000299; suivant : 000301

Mining statistically-solid k-mers for accurate NGS error correction

Auteurs : Liang Zhao [République populaire de Chine] ; Jin Xie [République populaire de Chine] ; Lin Bai [République populaire de Chine] ; Wen Chen [République populaire de Chine] ; Mingju Wang [République populaire de Chine] ; Zhonglei Zhang [République populaire de Chine] ; Yiqi Wang [République populaire de Chine] ; Zhe Zhao [République populaire de Chine] ; Jinyan Li [Australie]

Source :

RBID : PMC:6311904

Abstract

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.


Url:
DOI: 10.1186/s12864-018-5272-y
PubMed: 30598110
PubMed Central: 6311904

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:6311904

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</title>
<author>
<name sortKey="Zhao, Liang" sort="Zhao, Liang" uniqKey="Zhao L" first="Liang" last="Zhao">Liang Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Xie, Jin" sort="Xie, Jin" uniqKey="Xie J" first="Jin" last="Xie">Jin Xie</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bai, Lin" sort="Bai, Lin" uniqKey="Bai L" first="Lin" last="Bai">Lin Bai</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chen, Wen" sort="Chen, Wen" uniqKey="Chen W" first="Wen" last="Chen">Wen Chen</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wang, Mingju" sort="Wang, Mingju" uniqKey="Wang M" first="Mingju" last="Wang">Mingju Wang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhonglei" sort="Zhang, Zhonglei" uniqKey="Zhang Z" first="Zhonglei" last="Zhang">Zhonglei Zhang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yiqi" sort="Wang, Yiqi" uniqKey="Wang Y" first="Yiqi" last="Wang">Yiqi Wang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhe" sort="Zhao, Zhe" uniqKey="Zhao Z" first="Zhe" last="Zhao">Zhe Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Li, Jinyan" sort="Li, Jinyan" uniqKey="Li J" first="Jinyan" last="Li">Jinyan Li</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NSW 2007</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30598110</idno>
<idno type="pmc">6311904</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311904</idno>
<idno type="RBID">PMC:6311904</idno>
<idno type="doi">10.1186/s12864-018-5272-y</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000300</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000300</idno>
<idno type="wicri:Area/Pmc/Curation">000300</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000300</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</title>
<author>
<name sortKey="Zhao, Liang" sort="Zhao, Liang" uniqKey="Zhao L" first="Liang" last="Zhao">Liang Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Xie, Jin" sort="Xie, Jin" uniqKey="Xie J" first="Jin" last="Xie">Jin Xie</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bai, Lin" sort="Bai, Lin" uniqKey="Bai L" first="Lin" last="Bai">Lin Bai</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chen, Wen" sort="Chen, Wen" uniqKey="Chen W" first="Wen" last="Chen">Wen Chen</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wang, Mingju" sort="Wang, Mingju" uniqKey="Wang M" first="Mingju" last="Wang">Mingju Wang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhonglei" sort="Zhang, Zhonglei" uniqKey="Zhang Z" first="Zhonglei" last="Zhang">Zhonglei Zhang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yiqi" sort="Wang, Yiqi" uniqKey="Wang Y" first="Yiqi" last="Wang">Yiqi Wang</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhe" sort="Zhao, Zhe" uniqKey="Zhao Z" first="Zhe" last="Zhao">Zhe Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nanning</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Li, Jinyan" sort="Li, Jinyan" uniqKey="Li J" first="Jinyan" last="Li">Jinyan Li</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NSW 2007</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer is a
<italic>k</italic>
-mer frequently occurring in NGS reads. The other
<italic>k</italic>
-mers are called weak
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer does not likely contain errors, while a weak
<italic>k</italic>
-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff
<italic>f</italic>
<sub>0</sub>
to balance the numbers of solid and weak
<italic>k</italic>
-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid
<italic>k</italic>
-mers that are likely to contain errors, and (ii) add a small subset of weak
<italic>k</italic>
-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of
<italic>k</italic>
-mers can improve the correction performance.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose to use a Gamma distribution to model the frequencies of erroneous
<italic>k</italic>
-mers and a mixture of Gaussian distributions to model correct
<italic>k</italic>
-mers, and combine them to determine
<italic>f</italic>
<sub>0</sub>
. To identify the two special subsets of
<italic>k</italic>
-mers, we use the
<italic>z</italic>
-score of
<italic>k</italic>
-mers which measures the number of standard deviations a
<italic>k</italic>
-mer’s frequency is from the mean. Then these statistically-solid
<italic>k</italic>
-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The
<italic>z</italic>
-score is adequate to distinguish solid
<italic>k</italic>
-mers from weak
<italic>k</italic>
-mers, particularly useful for pinpointing out solid
<italic>k</italic>
-mers having very low frequency. Applying
<italic>z</italic>
-score on
<italic>k</italic>
-mer can markedly improve the error correction accuracy.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Alic, As" uniqKey="Alic A">AS Alic</name>
</author>
<author>
<name sortKey="Ruzafa, D" uniqKey="Ruzafa D">D Ruzafa</name>
</author>
<author>
<name sortKey="Dopazo, J" uniqKey="Dopazo J">J Dopazo</name>
</author>
<author>
<name sortKey="Blanquer, I" uniqKey="Blanquer I">I Blanquer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hackl, T" uniqKey="Hackl T">T Hackl</name>
</author>
<author>
<name sortKey="Hedrich, R" uniqKey="Hedrich R">R Hedrich</name>
</author>
<author>
<name sortKey="Schultz, J" uniqKey="Schultz J">J Schultz</name>
</author>
<author>
<name sortKey="Forster, F" uniqKey="Forster F">F Förster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goodwin, S" uniqKey="Goodwin S">S Goodwin</name>
</author>
<author>
<name sortKey="Gurtowski, J" uniqKey="Gurtowski J">J Gurtowski</name>
</author>
<author>
<name sortKey="Ethe Sayers, S" uniqKey="Ethe Sayers S">S Ethe-Sayers</name>
</author>
<author>
<name sortKey="Deshpande, P" uniqKey="Deshpande P">P Deshpande</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Mccombie, Wr" uniqKey="Mccombie W">WR McCombie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Zimin, A" uniqKey="Zimin A">A Zimin</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author>
<name sortKey="Yin, J" uniqKey="Yin J">J Yin</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Xiong, W" uniqKey="Xiong W">W Xiong</name>
</author>
<author>
<name sortKey="Zhan, Y" uniqKey="Zhan Y">Y Zhan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Dorman, Ks" uniqKey="Dorman K">KS Dorman</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
<author>
<name sortKey="Maskell, Dl" uniqKey="Maskell D">DL Maskell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Molnar, M" uniqKey="Molnar M">M Molnar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Florea, L" uniqKey="Florea L">L Florea</name>
</author>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greenfield, P" uniqKey="Greenfield P">P Greenfield</name>
</author>
<author>
<name sortKey="Kx, D" uniqKey="Kx D">D Kx</name>
</author>
<author>
<name sortKey="Ax, P" uniqKey="Ax P">P Ax</name>
</author>
<author>
<name sortKey="Cx, Bd" uniqKey="Cx B">BD Cx</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y Heo</name>
</author>
<author>
<name sortKey="Ramachandran, A" uniqKey="Ramachandran A">A Ramachandran</name>
</author>
<author>
<name sortKey="Hwu, W M" uniqKey="Hwu W">W-M Hwu</name>
</author>
<author>
<name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author>
<name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xiao, C L" uniqKey="Xiao C">C-L Xiao</name>
</author>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author>
<name sortKey="Xie, S Q" uniqKey="Xie S">S-Q Xie</name>
</author>
<author>
<name sortKey="Chen, K N" uniqKey="Chen K">K-N Chen</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Han, Y" uniqKey="Han Y">Y Han</name>
</author>
<author>
<name sortKey="Luo, F" uniqKey="Luo F">F Luo</name>
</author>
<author>
<name sortKey="Xie, Z" uniqKey="Xie Z">Z Xie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Fazayeli, F" uniqKey="Fazayeli F">F Fazayeli</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Holtgrewe, M" uniqKey="Holtgrewe M">M Holtgrewe</name>
</author>
<author>
<name sortKey="Dimitrova, V" uniqKey="Dimitrova V">V Dimitrova</name>
</author>
<author>
<name sortKey="Niu, S" uniqKey="Niu S">S Niu</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
<author>
<name sortKey="Hx, R" uniqKey="Hx R">R Hx</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kao, W C" uniqKey="Kao W">W-C Kao</name>
</author>
<author>
<name sortKey="Chan, Ah" uniqKey="Chan A">AH Chan</name>
</author>
<author>
<name sortKey="Song, Ys" uniqKey="Song Y">YS Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, L" uniqKey="Zhao L">L Zhao</name>
</author>
<author>
<name sortKey="Chen, Q" uniqKey="Chen Q">Q Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Jiang, P" uniqKey="Jiang P">P Jiang</name>
</author>
<author>
<name sortKey="Wong, L" uniqKey="Wong L">L Wong</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ross, Mg" uniqKey="Ross M">MG Ross</name>
</author>
<author>
<name sortKey="Russ, C" uniqKey="Russ C">C Russ</name>
</author>
<author>
<name sortKey="Costello, M" uniqKey="Costello M">M Costello</name>
</author>
<author>
<name sortKey="Hollinger, A" uniqKey="Hollinger A">A Hollinger</name>
</author>
<author>
<name sortKey="Lennon, Nj" uniqKey="Lennon N">NJ Lennon</name>
</author>
<author>
<name sortKey="Hegarty, R" uniqKey="Hegarty R">R Hegarty</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Jaffe, Db" uniqKey="Jaffe D">DB Jaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Myers, Jr" uniqKey="Myers J">JR Myers</name>
</author>
<author>
<name sortKey="Marth, Gt" uniqKey="Marth G">GT Marth</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30598110</article-id>
<article-id pub-id-type="pmc">6311904</article-id>
<article-id pub-id-type="publisher-id">5272</article-id>
<article-id pub-id-type="doi">10.1186/s12864-018-5272-y</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Mining statistically-solid
<italic>k</italic>
-mers for accurate NGS error correction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhao</surname>
<given-names>Liang</given-names>
</name>
<address>
<email>s080011@e.ntu.edu.sg</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Xie</surname>
<given-names>Jin</given-names>
</name>
<address>
<email>252589791@qq.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bai</surname>
<given-names>Lin</given-names>
</name>
<address>
<email>bailin@gxu.edu.cn</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chen</surname>
<given-names>Wen</given-names>
</name>
<address>
<email>taiheren007@163.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Mingju</given-names>
</name>
<address>
<email>wangmingju@taihehospital.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Zhonglei</given-names>
</name>
<address>
<email>zlzhang@taihehospital.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Yiqi</given-names>
</name>
<address>
<email>806643897@qq.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhao</surname>
<given-names>Zhe</given-names>
</name>
<address>
<email>zhaoyinzhao@126.com</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Li</surname>
<given-names>Jinyan</given-names>
</name>
<address>
<email>jinyan.li@uts.edu.au</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2254 5798</institution-id>
<institution-id institution-id-type="GRID">grid.256609.e</institution-id>
<institution>School of Computing and Electronic Information, Guangxi University,</institution>
</institution-wrap>
Nanning, China</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 7611</institution-id>
<institution-id institution-id-type="GRID">grid.117476.2</institution-id>
<institution>Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney,</institution>
</institution-wrap>
NSW 2007, Australia</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>31</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>31</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<issue>Suppl 10</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. YZ was not involved in the review of their own authored paper. The Supplement Editors declare that they have no other competing interests.</issue-sponsor>
<elocation-id>912</elocation-id>
<permissions>
<copyright-statement>© The Author(s) 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer is a
<italic>k</italic>
-mer frequently occurring in NGS reads. The other
<italic>k</italic>
-mers are called weak
<italic>k</italic>
-mers. A solid
<italic>k</italic>
-mer does not likely contain errors, while a weak
<italic>k</italic>
-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff
<italic>f</italic>
<sub>0</sub>
to balance the numbers of solid and weak
<italic>k</italic>
-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid
<italic>k</italic>
-mers that are likely to contain errors, and (ii) add a small subset of weak
<italic>k</italic>
-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of
<italic>k</italic>
-mers can improve the correction performance.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose to use a Gamma distribution to model the frequencies of erroneous
<italic>k</italic>
-mers and a mixture of Gaussian distributions to model correct
<italic>k</italic>
-mers, and combine them to determine
<italic>f</italic>
<sub>0</sub>
. To identify the two special subsets of
<italic>k</italic>
-mers, we use the
<italic>z</italic>
-score of
<italic>k</italic>
-mers which measures the number of standard deviations a
<italic>k</italic>
-mer’s frequency is from the mean. Then these statistically-solid
<italic>k</italic>
-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The
<italic>z</italic>
-score is adequate to distinguish solid
<italic>k</italic>
-mers from weak
<italic>k</italic>
-mers, particularly useful for pinpointing out solid
<italic>k</italic>
-mers having very low frequency. Applying
<italic>z</italic>
-score on
<italic>k</italic>
-mer can markedly improve the error correction accuracy.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Error correction</kwd>
<kwd>Next-generation sequencing</kwd>
<kwd>
<italic>z</italic>
-score</kwd>
</kwd-group>
<conference xlink:href="http://datamining-web.it.uts.edu.au/giw2018/">
<conf-name>29th International Conference on Genome Informatics</conf-name>
<conf-loc>Yunnan, China</conf-loc>
<conf-date>3-5 December 2018</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000300 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000300 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:6311904
   |texte=   Mining statistically-solid k-mers for accurate NGS error correction
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:30598110" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021