MersV1, Pmc, Corpus, bibRecord, 000B96

***** Acces problem to record *****\

Identifieur interne : 000B96 ( Pmc/Corpus ); précédent : 000B959; suivant : 000B970 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A random forest classifier for detecting rare variants in NGS data from viral populations</title>
<author><name sortKey="Malhotra, Raunaq" sort="Malhotra, Raunaq" uniqKey="Malhotra R" first="Raunaq" last="Malhotra">Raunaq Malhotra</name>
<affiliation><nlm:aff id="af0005">The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jha, Manjari" sort="Jha, Manjari" uniqKey="Jha M" first="Manjari" last="Jha">Manjari Jha</name>
<affiliation><nlm:aff id="af0005">The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Poss, Mary" sort="Poss, Mary" uniqKey="Poss M" first="Mary" last="Poss">Mary Poss</name>
<affiliation><nlm:aff id="af0010">Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Acharya, Raj" sort="Acharya, Raj" uniqKey="Acharya R" first="Raj" last="Acharya">Raj Acharya</name>
<affiliation><nlm:aff id="af0015">School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">28819548</idno>
<idno type="pmc">5548337</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5548337</idno>
<idno type="RBID">PMC:5548337</idno>
<idno type="doi">10.1016/j.csbj.2017.07.001</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000B96</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B96</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A random forest classifier for detecting rare variants in NGS data from viral populations</title>
<author><name sortKey="Malhotra, Raunaq" sort="Malhotra, Raunaq" uniqKey="Malhotra R" first="Raunaq" last="Malhotra">Raunaq Malhotra</name>
<affiliation><nlm:aff id="af0005">The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jha, Manjari" sort="Jha, Manjari" uniqKey="Jha M" first="Manjari" last="Jha">Manjari Jha</name>
<affiliation><nlm:aff id="af0005">The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Poss, Mary" sort="Poss, Mary" uniqKey="Poss M" first="Mary" last="Poss">Mary Poss</name>
<affiliation><nlm:aff id="af0010">Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Acharya, Raj" sort="Acharya, Raj" uniqKey="Acharya R" first="Raj" last="Acharya">Raj Acharya</name>
<affiliation><nlm:aff id="af0015">School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Computational and Structural Biotechnology Journal</title>
<idno type="eISSN">2001-0370</idno>
<imprint><date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of <italic>k</italic>
-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies <italic>k</italic>
-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of <italic>k</italic>
-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that <italic>k</italic>
-mers of a given size constitute a frame.</p>
<p>We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives <italic>k</italic>
-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their <italic>de-novo</italic>
 assembly. It has high recall of the true <italic>k</italic>
-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link <ext-link ext-link-type="uri" xlink:href="https://github.com/raunaq-m/MultiRes" id="ir0005">https://github.com/raunaq-m/MultiRes</ext-link>
.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Nguyen, D X" uniqKey="Nguyen D">D.X. Nguyen</name>
</author>
<author><name sortKey="Massague, J" uniqKey="Massague J">J. Massagué</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mcelroy, K" uniqKey="Mcelroy K">K. McElroy</name>
</author>
<author><name sortKey="Thomas, T" uniqKey="Thomas T">T. Thomas</name>
</author>
<author><name sortKey="Luciani, F" uniqKey="Luciani F">F. Luciani</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Beerenwinkel, N" uniqKey="Beerenwinkel N">N. Beerenwinkel</name>
</author>
<author><name sortKey="Gunthard, H F" uniqKey="Gunthard H">H.F. Gunthard</name>
</author>
<author><name sortKey="Roth, V" uniqKey="Roth V">V. Roth</name>
</author>
<author><name sortKey="Metzner, K J" uniqKey="Metzner K">K.J. Metzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schirmer, M" uniqKey="Schirmer M">M. Schirmer</name>
</author>
<author><name sortKey="Ijaz, U Z" uniqKey="Ijaz U">U.Z. Ijaz</name>
</author>
<author><name sortKey="D More, R" uniqKey="D More R">R. D’Amore</name>
</author>
<author><name sortKey="Hall, N" uniqKey="Hall N">N. Hall</name>
</author>
<author><name sortKey="Sloan, W T" uniqKey="Sloan W">W.T. Sloan</name>
</author>
<author><name sortKey="Quince, C" uniqKey="Quince C">C. Quince</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Meacham, F" uniqKey="Meacham F">F. Meacham</name>
</author>
<author><name sortKey="Boffelli, D" uniqKey="Boffelli D">D. Boffelli</name>
</author>
<author><name sortKey="Dhahbi, J" uniqKey="Dhahbi J">J. Dhahbi</name>
</author>
<author><name sortKey="Martin, D I" uniqKey="Martin D">D.I. Martin</name>
</author>
<author><name sortKey="Singer, M" uniqKey="Singer M">M. Singer</name>
</author>
<author><name sortKey="Pachter, L" uniqKey="Pachter L">L. Pachter</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Topfer, A" uniqKey="Topfer A">A. Töpfer</name>
</author>
<author><name sortKey="Zagordi, O" uniqKey="Zagordi O">O. Zagordi</name>
</author>
<author><name sortKey="Prabhakaran, S" uniqKey="Prabhakaran S">S. Prabhakaran</name>
</author>
<author><name sortKey="Roth, V" uniqKey="Roth V">V. Roth</name>
</author>
<author><name sortKey="Halperin, E" uniqKey="Halperin E">E. Halperin</name>
</author>
<author><name sortKey="Beerenwinkel, N" uniqKey="Beerenwinkel N">N. Beerenwinkel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zagordi, O" uniqKey="Zagordi O">O. Zagordi</name>
</author>
<author><name sortKey="Bhattacharya, A" uniqKey="Bhattacharya A">A. Bhattacharya</name>
</author>
<author><name sortKey="Eriksson, N" uniqKey="Eriksson N">N. Eriksson</name>
</author>
<author><name sortKey="Beerenwinkel, N" uniqKey="Beerenwinkel N">N. Beerenwinkel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mangul, S" uniqKey="Mangul S">S. Mangul</name>
</author>
<author><name sortKey="Wu, N C" uniqKey="Wu N">N.C. Wu</name>
</author>
<author><name sortKey="Mancuso, N" uniqKey="Mancuso N">N. Mancuso</name>
</author>
<author><name sortKey="Zelikovsky, A" uniqKey="Zelikovsky A">A. Zelikovsky</name>
</author>
<author><name sortKey="Sun, R" uniqKey="Sun R">R. Sun</name>
</author>
<author><name sortKey="Eskin, E" uniqKey="Eskin E">E. Eskin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, X" uniqKey="Yang X">X. Yang</name>
</author>
<author><name sortKey="Charlebois, P" uniqKey="Charlebois P">P. Charlebois</name>
</author>
<author><name sortKey="Macalalad, A" uniqKey="Macalalad A">A. Macalalad</name>
</author>
<author><name sortKey="Henn, M" uniqKey="Henn M">M. Henn</name>
</author>
<author><name sortKey="Zody, M" uniqKey="Zody M">M. Zody</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wilm, A" uniqKey="Wilm A">A. Wilm</name>
</author>
<author><name sortKey="Aw, P P K" uniqKey="Aw P">P.P.K. Aw</name>
</author>
<author><name sortKey="Bertrand, D" uniqKey="Bertrand D">D. Bertrand</name>
</author>
<author><name sortKey="Yeo, G H T" uniqKey="Yeo G">G.H.T. Yeo</name>
</author>
<author><name sortKey="Ong, S H" uniqKey="Ong S">S.H. Ong</name>
</author>
<author><name sortKey="Wong, C H" uniqKey="Wong C">C.H. Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Topfer, A" uniqKey="Topfer A">A. Töpfer</name>
</author>
<author><name sortKey="Marschall, T" uniqKey="Marschall T">T. Marschall</name>
</author>
<author><name sortKey="Bull, R A" uniqKey="Bull R">R.A. Bull</name>
</author>
<author><name sortKey="Luciani, F" uniqKey="Luciani F">F. Luciani</name>
</author>
<author><name sortKey="Schonhuth, A" uniqKey="Schonhuth A">A. Schönhuth</name>
</author>
<author><name sortKey="Beerenwinkel, N" uniqKey="Beerenwinkel N">N. Beerenwinkel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kelley, D R" uniqKey="Kelley D">D.R. Kelley</name>
</author>
<author><name sortKey="Schatz, M C" uniqKey="Schatz M">M.C. Schatz</name>
</author>
<author><name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Heo, Y" uniqKey="Heo Y">Y. Heo</name>
</author>
<author><name sortKey="Wu, X L" uniqKey="Wu X">X.-L. Wu</name>
</author>
<author><name sortKey="Chen, D" uniqKey="Chen D">D. Chen</name>
</author>
<author><name sortKey="Ma, J" uniqKey="Ma J">J. Ma</name>
</author>
<author><name sortKey="Hwu, W M" uniqKey="Hwu W">W.-M. Hwu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Liu, Y" uniqKey="Liu Y">Y. Liu</name>
</author>
<author><name sortKey="Schroder, J" uniqKey="Schroder J">J. Schröder</name>
</author>
<author><name sortKey="Schmidt, B" uniqKey="Schmidt B">B. Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
<author><name sortKey="Scott, E" uniqKey="Scott E">E. Scott</name>
</author>
<author><name sortKey="Kakaradov, B" uniqKey="Kakaradov B">B. Kakaradov</name>
</author>
<author><name sortKey="Pevzner, P A" uniqKey="Pevzner P">P.A. Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Skums, P" uniqKey="Skums P">P. Skums</name>
</author>
<author><name sortKey="Dimitrova, Z" uniqKey="Dimitrova Z">Z. Dimitrova</name>
</author>
<author><name sortKey="Campo, D S" uniqKey="Campo D">D.S. Campo</name>
</author>
<author><name sortKey="Vaughan, G" uniqKey="Vaughan G">G. Vaughan</name>
</author>
<author><name sortKey="Rossi, L" uniqKey="Rossi L">L. Rossi</name>
</author>
<author><name sortKey="Forbi, J C" uniqKey="Forbi J">J.C. Forbi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rizk, G" uniqKey="Rizk G">G. Rizk</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D. Lavenier</name>
</author>
<author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S. Deorowicz</name>
</author>
<author><name sortKey="Kokot, M" uniqKey="Kokot M">M. Kokot</name>
</author>
<author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S. Grabowski</name>
</author>
<author><name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A. Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
<author><name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Feng, S" uniqKey="Feng S">S. Feng</name>
</author>
<author><name sortKey="Lo, C C" uniqKey="Lo C">C.-C. Lo</name>
</author>
<author><name sortKey="Li, P E" uniqKey="Li P">P.-E. Li</name>
</author>
<author><name sortKey="Chain, P S" uniqKey="Chain P">P.S. Chain</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ding, J" uniqKey="Ding J">J. Ding</name>
</author>
<author><name sortKey="Bashashati, A" uniqKey="Bashashati A">A. Bashashati</name>
</author>
<author><name sortKey="Roth, A" uniqKey="Roth A">A. Roth</name>
</author>
<author><name sortKey="Oloumi, A" uniqKey="Oloumi A">A. Oloumi</name>
</author>
<author><name sortKey="Tse, K" uniqKey="Tse K">K. Tse</name>
</author>
<author><name sortKey="Zeng, T" uniqKey="Zeng T">T. Zeng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Poplin, R" uniqKey="Poplin R">R. Poplin</name>
</author>
<author><name sortKey="Newburger, D" uniqKey="Newburger D">D. Newburger</name>
</author>
<author><name sortKey="Dijamco, J" uniqKey="Dijamco J">J. Dijamco</name>
</author>
<author><name sortKey="Nguyen, N" uniqKey="Nguyen N">N. Nguyen</name>
</author>
<author><name sortKey="Loy, D" uniqKey="Loy D">D. Loy</name>
</author>
<author><name sortKey="Gross, S S" uniqKey="Gross S">S.S. Gross</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ferreira, P" uniqKey="Ferreira P">P. Ferreira</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Duffin, R J" uniqKey="Duffin R">R.J. Duffin</name>
</author>
<author><name sortKey="Schaeffer, A C" uniqKey="Schaeffer A">A.C. Schaeffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Daubechies, I" uniqKey="Daubechies I">I. Daubechies</name>
</author>
<author><name sortKey="Grossmann, A" uniqKey="Grossmann A">A. Grossmann</name>
</author>
<author><name sortKey="Meyer, Y" uniqKey="Meyer Y">Y. Meyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Daubechies, I" uniqKey="Daubechies I">I. Daubechies</name>
</author>
<author><name sortKey="Han, B" uniqKey="Han B">B. Han</name>
</author>
<author><name sortKey="Ron, A" uniqKey="Ron A">A. Ron</name>
</author>
<author><name sortKey="Shen, Z" uniqKey="Shen Z">Z. Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Unser, M" uniqKey="Unser M">M. Unser</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ron, A" uniqKey="Ron A">A. Ron</name>
</author>
<author><name sortKey="Shen, Z" uniqKey="Shen Z">Z. Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nikolenko, S I" uniqKey="Nikolenko S">S.I. Nikolenko</name>
</author>
<author><name sortKey="Korobeynikov, A I" uniqKey="Korobeynikov A">A.I. Korobeynikov</name>
</author>
<author><name sortKey="Alekseyev, M A" uniqKey="Alekseyev M">M.A. Alekseyev</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Le, H S" uniqKey="Le H">H.-S. Le</name>
</author>
<author><name sortKey="Schulz, M H" uniqKey="Schulz M">M.H. Schulz</name>
</author>
<author><name sortKey="Mccauley, B M" uniqKey="Mccauley B">B.M. McCauley</name>
</author>
<author><name sortKey="Hinman, V F" uniqKey="Hinman V">V.F. Hinman</name>
</author>
<author><name sortKey="Bar Joseph, Z" uniqKey="Bar Joseph Z">Z. Bar-Joseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaiser, G" uniqKey="Kaiser G">G. Kaiser</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hussein, N" uniqKey="Hussein N">N. Hussein</name>
</author>
<author><name sortKey="Zekri, A Rn" uniqKey="Zekri A">A-RN Zekri</name>
</author>
<author><name sortKey="Abouelhoda, M" uniqKey="Abouelhoda M">M. Abouelhoda</name>
</author>
<author><name sortKey="El Din, H M A" uniqKey="El Din H">H.M.A. El-din</name>
</author>
<author><name sortKey="Ghamry, A A" uniqKey="Ghamry A">A.A. Ghamry</name>
</author>
<author><name sortKey="Amer, M A" uniqKey="Amer M">M.A. Amer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Angly, F E" uniqKey="Angly F">F.E. Angly</name>
</author>
<author><name sortKey="Willner, D" uniqKey="Willner D">D. Willner</name>
</author>
<author><name sortKey="Rohwer, F" uniqKey="Rohwer F">F. Rohwer</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P. Hugenholtz</name>
</author>
<author><name sortKey="Tyson, G W" uniqKey="Tyson G">G.W. Tyson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Giallonardo, F D" uniqKey="Giallonardo F">F.D. Giallonardo</name>
</author>
<author><name sortKey="Topfer, A" uniqKey="Topfer A">A. Töpfer</name>
</author>
<author><name sortKey="Rey, M" uniqKey="Rey M">M. Rey</name>
</author>
<author><name sortKey="Prabhakaran, S" uniqKey="Prabhakaran S">S. Prabhakaran</name>
</author>
<author><name sortKey="Duport, Y" uniqKey="Duport Y">Y. Duport</name>
</author>
<author><name sortKey="Leemann, C" uniqKey="Leemann C">C. Leemann</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, X" uniqKey="Yang X">X. Yang</name>
</author>
<author><name sortKey="Charlebois, P" uniqKey="Charlebois P">P. Charlebois</name>
</author>
<author><name sortKey="Gnerre, S" uniqKey="Gnerre S">S. Gnerre</name>
</author>
<author><name sortKey="Coole, M G" uniqKey="Coole M">M.G. Coole</name>
</author>
<author><name sortKey="Lennon, N J" uniqKey="Lennon N">N.J. Lennon</name>
</author>
<author><name sortKey="Levin, J Z" uniqKey="Levin J">J.Z. Levin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bankevich, A" uniqKey="Bankevich A">A. Bankevich</name>
</author>
<author><name sortKey="Nurk, S" uniqKey="Nurk S">S. Nurk</name>
</author>
<author><name sortKey="Antipov, D" uniqKey="Antipov D">D. Antipov</name>
</author>
<author><name sortKey="Gurevich, A A" uniqKey="Gurevich A">A.A. Gurevich</name>
</author>
<author><name sortKey="Dvorkin, M" uniqKey="Dvorkin M">M. Dvorkin</name>
</author>
<author><name sortKey="Kulikov, A S" uniqKey="Kulikov A">A.S. Kulikov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z. Iqbal</name>
</author>
<author><name sortKey="Caccamo, M" uniqKey="Caccamo M">M. Caccamo</name>
</author>
<author><name sortKey="Turner, I" uniqKey="Turner I">I. Turner</name>
</author>
<author><name sortKey="Flicek, P" uniqKey="Flicek P">P. Flicek</name>
</author>
<author><name sortKey="Mcvean, G" uniqKey="Mcvean G">G. McVean</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Comput Struct Biotechnol J</journal-id>
<journal-id journal-id-type="iso-abbrev">Comput Struct Biotechnol J</journal-id>
<journal-title-group><journal-title>Computational and Structural Biotechnology Journal</journal-title>
</journal-title-group>
<issn pub-type="epub">2001-0370</issn>
<publisher><publisher-name>Research Network of Computational and Structural Biotechnology</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">28819548</article-id>
<article-id pub-id-type="pmc">5548337</article-id>
<article-id pub-id-type="publisher-id">S2001-0370(17)30039-9</article-id>
<article-id pub-id-type="doi">10.1016/j.csbj.2017.07.001</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>A random forest classifier for detecting rare variants in NGS data from viral populations</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Malhotra</surname>
<given-names>Raunaq</given-names>
</name>
<email>raunaq.123@gmail.com</email>
<xref rid="af0005" ref-type="aff">a</xref>
<xref rid="cr0005" ref-type="corresp">*</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Jha</surname>
<given-names>Manjari</given-names>
</name>
<xref rid="af0005" ref-type="aff">a</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Poss</surname>
<given-names>Mary</given-names>
</name>
<xref rid="af0010" ref-type="aff">b</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Acharya</surname>
<given-names>Raj</given-names>
</name>
<xref rid="af0015" ref-type="aff">c</xref>
</contrib>
</contrib-group>
<aff id="af0005"><label>a</label>
The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA</aff>
<aff id="af0010"><label>b</label>
Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA</aff>
<aff id="af0015"><label>c</label>
School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA</aff>
<author-notes><corresp id="cr0005"><label>*</label>
Corresponding author. <email>raunaq.123@gmail.com</email>
</corresp>
</author-notes>
<pub-date pub-type="pmc-release"><day>19</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on .</pmc-comment>
      <pub-date pub-type="collection"><year>2017</year>
</pub-date>
<pub-date pub-type="epub"><day>19</day>
<month>7</month>
<year>2017</year>
</pub-date>
<volume>15</volume>
<fpage>388</fpage>
<lpage>395</lpage>
<history><date date-type="received"><day>14</day>
<month>3</month>
<year>2017</year>
</date>
<date date-type="rev-recd"><day>1</day>
<month>7</month>
<year>2017</year>
</date>
<date date-type="accepted"><day>3</day>
<month>7</month>
<year>2017</year>
</date>
</history>
<permissions><copyright-statement>© 2017 The Authors</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="CC BY" xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).</license-p>
</license>
</permissions>
<abstract id="ab0005"><p>We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of <italic>k</italic>
-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies <italic>k</italic>
-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of <italic>k</italic>
-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that <italic>k</italic>
-mers of a given size constitute a frame.</p>
<p>We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives <italic>k</italic>
-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their <italic>de-novo</italic>
 assembly. It has high recall of the true <italic>k</italic>
-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link <ext-link ext-link-type="uri" xlink:href="https://github.com/raunaq-m/MultiRes" id="ir0005">https://github.com/raunaq-m/MultiRes</ext-link>
.</p>
</abstract>
<kwd-group id="ks0005"><title>Keywords</title>
<kwd>Sequencing error detection</kwd>
<kwd>Reference free methods</kwd>
<kwd>Next-generation sequencing</kwd>
<kwd>Viral populations</kwd>
<kwd>Multi-resolution frames</kwd>
<kwd>Random forest classifier</kwd>
</kwd-group>
</article-meta>
</front>
<body><sec id="s0005"><label>1</label>
<title>Introduction</title>
<p>The sequence diversity present in a population of closely related genomes is important for their survival under environmental pressures. Viral population within a host is an example of such population of closely related genomes, where some viral strains survive even when large segments of their genome are deleted. The sequence variants that occur at low frequency in the population, also known as rare variants, have been known to impact the population's survival and understanding their prevalence is important for drug design and in therapeutics <xref rid="bb0005" ref-type="bibr">[1]</xref>
.</p>
<p>However, detection of rare variants from Next Generation Sequencing (NGS) data is still a challenge as the rare variants are tangled with errors in sequencing technologies due to their similar prevalence <xref rid="bb0010" ref-type="bibr">[2]</xref>
, <xref rid="bb0015" ref-type="bibr">[3]</xref>
. The NGS data technologies are error prone and even though their error profiles are well studied <xref rid="bb0020" ref-type="bibr">[4]</xref>
, <xref rid="bb0025" ref-type="bibr">[5]</xref>
, removing sequencing errors is essential before downstream processing of NGS data such as assembly of haplotypes in viral populations <xref rid="bb0010" ref-type="bibr">[2]</xref>
, <xref rid="bb0030" ref-type="bibr">[6]</xref>
, <xref rid="bb0035" ref-type="bibr">[7]</xref>
, <xref rid="bb0040" ref-type="bibr">[8]</xref>
, <xref rid="bb0045" ref-type="bibr">[9]</xref>
 and variant calling for viral populations <xref rid="bb0035" ref-type="bibr">[7]</xref>
, <xref rid="bb0045" ref-type="bibr">[9]</xref>
, <xref rid="bb0050" ref-type="bibr">[10]</xref>
.</p>
<p>In order to remove sequencing errors from NGS data, the first step is detecting the errors from true biological sequences and then correcting the errors to the true sequence. For NGS data obtained from a viral population, the reads are mapped to a reference genome to detect true variants from sequencing errors based on a probabilistic model <xref rid="bb0030" ref-type="bibr">[6]</xref>
, <xref rid="bb0035" ref-type="bibr">[7]</xref>
, <xref rid="bb0045" ref-type="bibr">[9]</xref>
, <xref rid="bb0050" ref-type="bibr">[10]</xref>
, <xref rid="bb0055" ref-type="bibr">[11]</xref>
, and then the sequencing errors are corrected to the sequence of the reference genome. However, as virus population contains a large diversity of true sequences, accurate mapping of reads to any one reference may not be possible.</p>
<p>Alternatively, sampled reads are broken into small fixed length sub-strings called <italic>k</italic>
-mers and their counts are used for error detection (e.g. <xref rid="bb0060" ref-type="bibr">[12]</xref>
, <xref rid="bb0065" ref-type="bibr">[13]</xref>
, <xref rid="bb0070" ref-type="bibr">[14]</xref>
, <xref rid="bb0075" ref-type="bibr">[15]</xref>
, <xref rid="bb0080" ref-type="bibr">[16]</xref>
). The erroneous <italic>k</italic>
-mers are corrected by changing minimum number of bases in the reads using the detected true <italic>k</italic>
-mers. These methods use a generative model for <italic>k</italic>
-mer counts to determine if an observed <italic>k</italic>
-mer is erroneous or a true <italic>k</italic>
-mer <xref rid="bb0060" ref-type="bibr">[12]</xref>
 based on a counts threshold <xref rid="bb0060" ref-type="bibr">[12]</xref>
, <xref rid="bb0065" ref-type="bibr">[13]</xref>
, <xref rid="bb0070" ref-type="bibr">[14]</xref>
, <xref rid="bb0085" ref-type="bibr">[17]</xref>
.</p>
<p>For <italic>k</italic>
-mer based error detection, the length of the <italic>k</italic>
-mer and the frequency threshold are important parameters. The size of a <italic>k</italic>
-mer can effect the performance of error detection method, as it either decreases the evidence for a segment of the genome for a large <italic>k</italic>
, or combines evidences from multiple segments for a small <italic>k</italic>
. However, a single appropriate <italic>k</italic>
-mer size for error detection in viral populations is restrictive in nature, as a combination of different sized overlapping <italic>k</italic>
-mers, although redundant, can provide richer information.</p>
<p>The error detection part in most <italic>k</italic>
-mer based error correction tools <xref rid="bb0060" ref-type="bibr">[12]</xref>
, <xref rid="bb0065" ref-type="bibr">[13]</xref>
, <xref rid="bb0070" ref-type="bibr">[14]</xref>
, <xref rid="bb0075" ref-type="bibr">[15]</xref>
 has been designed assuming the reads are sampled from a single diploid genome and rely on a single counts threshold. However, for viral populations a single threshold is not suitable as viral strains occur at different relative frequencies. Currently, a number of time and memory efficient <italic>k</italic>
-mer counting algorithms are available <xref rid="bb0090" ref-type="bibr">[18]</xref>
, <xref rid="bb0095" ref-type="bibr">[19]</xref>
. Thus, choosing an appropriate size of <italic>k</italic>
-mer is possible by performing <italic>k</italic>
-mer counts at multiple sizes <xref rid="bb0100" ref-type="bibr">[20]</xref>
.</p>
<p>With the availability of large amounts of data from NGS technologies, data driven classifiers have also been used for detection of sequencing errors <xref rid="bb0105" ref-type="bibr">[21]</xref>
 and for variant calling <xref rid="bb0025" ref-type="bibr">[5]</xref>
, <xref rid="bb0060" ref-type="bibr">[12]</xref>
, <xref rid="bb0110" ref-type="bibr">[22]</xref>
, <xref rid="bb0115" ref-type="bibr">[23]</xref>
. However, identifying the features for classification of rare variants and sequencing errors is still a challenge, due to their similar characteristics in the NGS data.</p>
<p>We propose, MultiRes, a reference-free <italic>k</italic>
-mer based error detection algorithm for a viral population. The algorithm uses <italic>k</italic>
-mer counts of different sizes to train a Random Forest Classifier that classifies <italic>k</italic>
-mers as erroneous or rare variant <italic>k</italic>
-mers. We also propose a mechanism for selecting the optimal combination of <italic>k</italic>
-mer sizes. The rare variant <italic>k</italic>
-mers along with high frequency <italic>k</italic>
-mers can be used as is in downstream tools for variant calling and for de novo assembly of viral populations.</p>
<p>MultiRes uses a collection of sizes of <italic>k</italic>
-mers as features for detecting sequencing errors and rare variants. Our rationale to choose a combination of sizes for <italic>k</italic>
-mers is rooted in signals processing, where analysis of signals at different resolutions has been used for noise removal <xref rid="bb0120" ref-type="bibr">[24]</xref>
. Signals are projected onto a series of non-orthonormal basis functions, known as a <bold>frame</bold>
, <xref rid="bb0125" ref-type="bibr">[25]</xref>
, <xref rid="bb0130" ref-type="bibr">[26]</xref>
, <xref rid="bb0135" ref-type="bibr">[27]</xref>
; these projections are used for error removal and signal recovery <xref rid="bb0140" ref-type="bibr">[28]</xref>
, <xref rid="bb0145" ref-type="bibr">[29]</xref>
.</p>
<p>The classifier in MultiRes is trained on a simulated dataset that models NGS data generated from a replicating viral population. We evaluate the performance of MultiRes on simulated and real datasets, and compare it to the error detection algorithms of error correction tools BLESS <xref rid="bb0065" ref-type="bibr">[13]</xref>
, Quake <xref rid="bb0060" ref-type="bibr">[12]</xref>
, BFC <xref rid="bb0070" ref-type="bibr">[14]</xref>
, and Musket <xref rid="bb0075" ref-type="bibr">[15]</xref>
. We also compare our results to BayesHammer <xref rid="bb0150" ref-type="bibr">[30]</xref>
 and Seecer <xref rid="bb0155" ref-type="bibr">[31]</xref>
, which can handle variable sequencing coverage across the genome and polymorphisms in the RNA sequencing data respectively.</p>
<p>MultiRes has a high recall of the true <italic>k</italic>
-mers, comparable to other methods and has 5 to 500 times better removal of erroneous <italic>k</italic>
-mers compared to other methods. Our results demonstrate that the classifier in MultiRes performs well for error detection on real sequencing data obtained from the same sequencing technology. Thus, the classifier in MultiRes is generalizable to viral population data from the same sequencing technology.</p>
<p>As MultiRes detects the rare variant <italic>k</italic>
-mers in an NGS data, its output can be directly used for identifying rare variants in a viral population. Variant calling for viral populations typically relies on a single reference genome or on a consensus genome generated from the population being studied <xref rid="bb0035" ref-type="bibr">[7]</xref>
, <xref rid="bb0045" ref-type="bibr">[9]</xref>
, <xref rid="bb0050" ref-type="bibr">[10]</xref>
. We compare the rare variants detected by MultiRes to variant calling methods VPhaser-2 <xref rid="bb0045" ref-type="bibr">[9]</xref>
, LoFreq <xref rid="bb0050" ref-type="bibr">[10]</xref>
 and the outputs from haplotype reconstruction method ShoRAH <xref rid="bb0035" ref-type="bibr">[7]</xref>
. MultiRes has the higher recall of true SNPs compared to the SNPs called by VPhaser-2, LoFreq and ShoRAH on both simulated and real datasets, and misses the least number of true SNPs amongst all methods. This demonstrates its applicability for rare variant detection in viral populations.</p>
</sec>
<sec id="s0010"><label>2</label>
<title>Methods</title>
<p>MultiRes is a classifier for detecting sequencing errors from rare variants. The counts of the <italic>k</italic>
-mers along with the counts of their sub-sequences (sub <italic>k</italic>
-mers within a <italic>k</italic>
-mer) are used as features for training a classifier. The true <italic>k</italic>
-mers observed in the viral haplotypes with counts in the reads less than a threshold <italic>T</italic>
<sub><italic>High</italic>
</sub>
 are defined to be rare variant <italic>k</italic>
-mers, while the rest of <italic>k</italic>
-mers with counts less than <italic>T</italic>
<sub><italic>High</italic>
</sub>
 are erroneous <italic>k</italic>
-mers. The <italic>k</italic>
-mers that occur at counts greater than <italic>T</italic>
<sub><italic>High</italic>
</sub>
 are known as common <italic>k</italic>
-mers, as they occur frequently in the viral haplotypes. The common <italic>k</italic>
-mers are assumed to be error-free and the classifier is trained only for the erroneous and rare variant <italic>k</italic>
-mers.</p>
<p>The premise of our method is that reads sequenced from a population of genomes can be modeled as discrete spatial signals. Discrete spatial signals can be projected on to a <bold>frame</bold>
 <xref rid="bb0125" ref-type="bibr">[25]</xref>
, <xref rid="bb0130" ref-type="bibr">[26]</xref>
, <xref rid="bb0135" ref-type="bibr">[27]</xref>
, <xref rid="bb0160" ref-type="bibr">[32]</xref>
 for their representation (See Supplementary Material for details), where the coefficients of projections characterize the discrete spatial signals. Similarly, we show that <italic>k</italic>
-mers (of a given size <italic>k</italic>
) form a <bold>frame</bold>
 and the maximal projection of <italic>k</italic>
-mers correspond to their counts in a sequencing run. Additionally, a <italic>k</italic>
-mer can be projected on to a collection of <bold>frames</bold>
, where each <bold>frame</bold>
 represents counts of <italic>k′</italic>
-mers (<italic>k′</italic>
 < <italic>k</italic>
) that are sub-strings of the given <italic>k</italic>
-mer.</p>
<p>The choice of <italic>k</italic>
 for a <bold>frame</bold>
 is important and should be large enough such that a <italic>k</italic>
-mer only occurs once in the haplotypes. On the other hand, it should be smaller than the read lengths so that <italic>k</italic>
-mer counting is still meaningful.</p>
<p>The minimum <italic>k</italic>
 can be approximated by ensuring that the probability of picking a string of length equal to the genome length (say |<italic>H</italic>
|) where all <italic>k</italic>
-mers in it occur only once is low <xref rid="bb0060" ref-type="bibr">[12]</xref>
. Thus the probability of picking approximately |<italic>H</italic>
| unique <italic>k</italic>
-mers out of a set of 4<sup><italic>k</italic>
/2</sup>
 (considering reverse complements) should be low. We set 2 ⋅|<italic>H</italic>
|/4<sup><italic>k</italic>
</sup>
 ≈ <italic>ϵ</italic>
, where <italic>ϵ</italic>
 is a small number, to determine the smallest possible choice of <italic>k</italic>
 (<italic>k</italic>
<sub><italic>min</italic>
</sub>
) for the <bold>frame</bold>
.</p>
<p>As an example, a <italic>k</italic>
-mer <italic>u</italic>
 occurring <italic>c</italic>
(<italic>u</italic>
) times in the reads when projected on a <bold>frame</bold>
 of size <italic>k</italic>
 is in-fact represented by its maximal projection <italic>c</italic>
(<italic>u</italic>
). The same <italic>k</italic>
-mer <italic>u</italic>
, can also be represented on <bold>frames</bold>
 of sizes (<italic>k′</italic>
,<italic>k</italic>
<sup><italic>′′</italic>
</sup>
 in the range [<italic>k</italic>
<sub><italic>min</italic>
</sub>
,<italic>k</italic>
]. Now the maximal projections for <italic>u</italic>
 in these <bold>frames</bold>
 are the counts of <italic>k′</italic>
-mers and <italic>k</italic>
<sup><italic>′′</italic>
</sup>
-mers present within <italic>u</italic>
. This representation of <italic>k</italic>
-mer <italic>u</italic>
 can be used to train a classifier for identifying erroneous versus rare variant <italic>k</italic>
-mers.</p>
<sec id="s0015"><label>2.1</label>
<title>MultiRes: Classification algorithm for detecting sequencing errors and rare variants</title>
<p>We define a classifier, <italic>EC</italic>
, for classifying a <italic>k</italic>
-mer as erroneous, a rare variant, or a common <italic>k</italic>
-mer in the dataset. <xref rid="enun0005" ref-type="statement">Algorithm 1</xref>
 describes MultiRes, the proposed algorithm for detecting rare variants and sequencing errors. The algorithm takes as input the sampled reads, the classifier <italic>EC</italic>
, an ordered array (<italic>k</italic>
,<italic>k′</italic>
,<italic>k</italic>
<sup><italic>′′</italic>
</sup>
), and a threshold parameter <italic>T</italic>
<sub><italic>High</italic>
</sub>
. It outputs for every <italic>k</italic>
-mer observed in the sampled reads a status: whether the <italic>k</italic>
-mer is erroneous or a rare variant.</p>
<p>It first computes the counts of <italic>k</italic>
-mers, <italic>k′</italic>
-mers, and <italic>k</italic>
<sup><italic>′′</italic>
</sup>
-mers using the dsk <italic>k</italic>
-mer counting software <xref rid="bb0090" ref-type="bibr">[18]</xref>
. The <italic>k</italic>
-mers <italic>u</italic>
 that have counts greater than <italic>T</italic>
<sub><italic>High</italic>
</sub>
 are marked as true <italic>k</italic>
-mers while the rest of the <italic>k</italic>
-mers are classified using the classifier <italic>EC</italic>
 based on their counts on <italic>k′</italic>
-mers, and <italic>k</italic>
<sup><italic>′′</italic>
</sup>
-mers.</p>
<p>The classifier <italic>EC</italic>
 captures the profile of erroneous versus rare variant <italic>k</italic>
-mers from Illumina sequencing of viral populations. We used the software dsk (version 1.6066) <xref rid="bb0090" ref-type="bibr">[18]</xref>
 for <italic>k</italic>
-mer counting, which can perform the <italic>k</italic>
-mer counts in a limited memory and disk space machine quickly. The run time of MultiRes is linearly dependent on the number of unique <italic>k</italic>
-mers in a dataset, as once the classifier <italic>EC</italic>
 is trained, it can be used for all datasets, and it can be easily parallelized.</p>
<p><statement id="enun0005"><label>Algorithm 1</label>
<p>MultiRes: Error detection in the sampled reads by <bold>frame</bold>
-based classification of <italic>k</italic>
-mers<fig id="f0020"><graphic xlink:href="fx1"></graphic>
</fig>
</p>
</statement>
</p>
<sec id="s0020"><label>2.1.1</label>
<title>Simulated data for classifier training</title>
<p>MultiRes assumes the availability of a classifier <italic>EC</italic>
 which can distinguish between the erroneous and rare variant <italic>k</italic>
-mers. We use simulated datasets to train a series of classifiers and set <italic>EC</italic>
 to the classifier which has the highest accuracy. The simulated viral population consists of 11 haplotypes and is generated by mutating 10% of positions on a known HIV-1 reference sequence of length 9.18 kb (NC_001802). These mutations also model the evolution of a viral population under a high mutation rate. The mutations introduced are randomly and uniformly distributed across the length of the genome so that the classifier is not biased towards the distribution of true variants. This introduces a total of 195,000 ground truth unique 35-mers in the simulated HIV-1 dataset.</p>
<p>We next simulate Illumina paired-end sequencing reads using the software dwgsim (<ext-link ext-link-type="uri" xlink:href="https://github.com/nh13/DWGSIM" id="ir0015">https://github.com/nh13/DWGSIM</ext-link>
) at 400x sequencing coverage from this viral population. The status of each <italic>k</italic>
-mer in this dataset is known as being erroneous, rare variant or a common <italic>k</italic>
-mer. We use close to 100,000 <italic>k</italic>
-mers in the training dataset. Thus, there is a test dataset of <italic>k</italic>
-mers left for evaluating the efficacy of the classifiers.</p>
<p>In order to train a classifier, we need to choose the size of the <italic>k</italic>
-mer, the sizes of <italic>k′</italic>
-mers for computing the projections of <italic>k</italic>
-mer signals, and the number of such projections needed. The choice of the smallest of {<italic>k</italic>
,<italic>k′</italic>
,…} should be above the minimum length <italic>k</italic>
<sub><italic>min</italic>
</sub>
 to ensure that each <italic>k</italic>
-mer still corresponds to a unique location on a viral genome.</p>
<p>For HIV populations, with genome length 9180 base pairs (9.1 kbp) and taking <italic>ϵ</italic>
 = 0.001 (a small value, as mentioned before), the minimum length of <italic>k</italic>
-mer is <italic>k</italic>
<sub><italic>min</italic>
</sub>
 = ⌈log <sub>4</sub>
2 ⋅ <italic>G</italic>
/<italic>ϵ</italic>
⌉ = ⌈12.06⌉ = 13. As in signal processing domain, we choose <italic>k</italic>
 ≈ 3 ⋅ <italic>k</italic>
<sub><italic>min</italic>
</sub>
 = 35 (an integral multiple of <italic>k</italic>
<sub><italic>min</italic>
</sub>
) as the largest <italic>k</italic>
-mer, and consider its projections on <bold>frames</bold>
 of sizes ranging from 13 to 35.</p>
<p>MultiRes assumes that <italic>k</italic>
-mers above the threshold count <italic>T</italic>
<sub><italic>high</italic>
</sub>
 are error-free, and only classifies the <italic>k</italic>
-mers with counts less than <italic>T</italic>
<sub><italic>high</italic>
</sub>
. The choice of <italic>T</italic>
<sub><italic>high</italic>
</sub>
 should ensure that the probability of erroneous <italic>k</italic>
-mers with counts above <italic>T</italic>
<sub><italic>high</italic>
</sub>
 is negligible. We use the gamma distribution model mentioned in the Quake error correction paper <xref rid="bb0060" ref-type="bibr">[12]</xref>
 for modeling erroneous <italic>k</italic>
-mers, as it approximates the observed distribution of errors. Based on this gamma distribution, we set <italic>T</italic>
<sub><italic>high</italic>
</sub>
 = 30 for the simulated HIV population data. The classifiers are therefore trained on 35-mers with counts less than 30.</p>
<p>Three training datasets consisting of both erroneous and rare variant 35-mers are generated. The features in the three datasets are the projection of the 35-mers onto (i) the frame of size 23, (ii) the frame of size 13, and (iii) a combination of both frames (<xref rid="f0005" ref-type="fig">Fig. 1</xref>
 a). The features in the three settings translate to the counts of the 13-mers and 23-mers observed within the 35-mer along with the counts of the 35-mer. We observed 11.9 million unique 35-mers in the simulated HIV-1 population, from which features from 76,000 erroneous 35-mers and 32,000 true variant 35-mers distributed uniformly over counts 1–30 were used for training the classifiers.<fig id="f0005"><label>Fig. 1</label>
<caption><p>Performance of classification algorithms for erroneous versus rare variant <italic>k</italic>
-mer classification. The performance of mentioned classification algorithms for classifying 35-mers are compared over two sets of features. 35-mers are either projected onto a family of (a) 23-mers, 13-mers, and a 13 + 23-mers, and (b) projections onto 15-mers, 15 + 20-mers, 15 + 20 + 25-mers, and 15 + 20 + 25 + 30-mers. The accuracy reported is over fivefold cross validation on 35-mers extracted from HIV viral population. Accuracy improves when 35-mers are projected onto smaller sized <italic>k′</italic>
- mers and as the number of projections increases. Random Forest Classifier has the best accuracy across different classification algorithms.</p>
</caption>
<graphic xlink:href="gr1"></graphic>
</fig>
</p>
</sec>
<sec id="s0025"><label>2.1.2</label>
<title>Classifier selection</title>
<p>Classifiers Nearest Neighbor, Decision Tree, Random Forest, Adaboost, Naive Bayes, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) are trained on the three training datasets and evaluated based on their test data accuracy over a 5-fold cross validation dataset. The classifiers are implemented in the scikit-learn library (version is 0.16.1) in python programming language (version 2.7.6). For all the classifiers, the accuracy improves as the 35-mers are projected onto 13-mers rather than 23-mers (higher resolution, lower size of <italic>k′</italic>
-mers), and improves even further when 35-mers are resolved onto both 13-mers and 23-mers (<xref rid="f0005" ref-type="fig">Fig. 1</xref>
). No further feature selection was performed when 35-mers were resolved onto 13-mers and 23-mers. The Random Forest Classifier performs the best on all three datasets, where the accuracy for dataset (iii) is 98.12<italic>%</italic>
. The accuracy for Naive Bayes and QDA classifiers are lower for all datasets, and also decreases when the projections in both 13-mers and 23-mers are considered, indicating that inadequacy of their models for the classification of 35-mers in these projections. The performance of other classifiers are comparable and follows similar trends.</p>
</sec>
<sec id="s0030"><label>2.1.3</label>
<title>Exploring additional feature spaces</title>
<p>Additionally, we generate a series of 4 projections of the 35-mers onto frames of sizes a) 15, b) {15 + 20}, c) {15 + 20 + 25}, and d) {15 + 20 + 25 + 30} to evaluate the effect of number of frames used for projection on the performance (<xref rid="f0005" ref-type="fig">Fig. 1</xref>
 b). Increasing the number of projections has no visible effect on increasing the accuracy of performance, although it increases the memory requirements and time complexity for computing counts of all five different values of <italic>k</italic>
. Based on this, we chose the Random Forest classifier with a resolution of 35-mers decomposed into a combination of 13-mers and 23-mers for other simulated and real datasets.</p>
</sec>
</sec>
</sec>
<sec id="s0035"><label>3</label>
<title>Results</title>
<sec id="s0040"><label>3.1</label>
<title>Error detection for reconstruction of haplotypes</title>
<p>We evaluate MultiRes on simulated HIV and HCV datasets and a laboratory mixture of HIV-1 strains. MultiRes is compared to the detection algorithms in the error correction tools Quake (last checked version Feb 2012) <xref rid="bb0060" ref-type="bibr">[12]</xref>
, BLESS (version 0.15) <xref rid="bb0065" ref-type="bibr">[13]</xref>
, Musket (last downloaded October 2015) <xref rid="bb0075" ref-type="bibr">[15]</xref>
, BFC (last downloaded October 2015) <xref rid="bb0070" ref-type="bibr">[14]</xref>
, BayesHammer (version 3.6.2) <xref rid="bb0150" ref-type="bibr">[30]</xref>
 and Seecer (version 0.1.3) <xref rid="bb0155" ref-type="bibr">[31]</xref>
. As these tools are traditionally designed for error correction, the error corrected reads or <italic>k</italic>
-mers from these methods were used for comparison with the rare variant <italic>k</italic>
-mers and common <italic>k</italic>
-mers predicted by MultiRes. ShoRAH <xref rid="bb0035" ref-type="bibr">[7]</xref>
 reconstructs a set of haplotypes as its final output rather than error corrected reads and thus was not evaluated for error correction. The error corrected reads, although available as an intermediate output, are not reported due to their low precision numbers, but ShoRAH is used for single nucleotide variant calling and comparison later in the text. Other recent error correction methods available for viral populations such as PredictHaplo <ext-link ext-link-type="uri" xlink:href="http://bmda.cs.unibas.ch/software.html" id="ir0020">http://bmda.cs.unibas.ch/software.html</ext-link>
, HaploClique <xref rid="bb0055" ref-type="bibr">[11]</xref>
, and Viral Genome Assembler (VGA) <xref rid="bb0040" ref-type="bibr">[8]</xref>
 were not evaluated in this study.</p>
<p>Three measures, defined in terms of the true and erroneous <italic>k</italic>
-mers, are used for comparing the detection algorithms in all methods. <italic>Precision</italic>
 is defined as the ratio of the known true <italic>k</italic>
-mers identified to the total number of <italic>k</italic>
-mers predicted as true variants by an algorithm. <italic>Recall</italic>
 is defined as the ratio of the true variant <italic>k</italic>
-mers identified to the total number of true <italic>k</italic>
-mers by an algorithm and measures the goodness of a method to retain true <italic>k</italic>
-mers for a dataset. <italic>False Positives to True Positives Ratio</italic>
 (FP/TP ratio) is the ratio of the erroneous <italic>k</italic>
-mers predicted as true variants to the true variant <italic>k</italic>
-mers identified by the algorithm. FP/TP ratio measures the number of erroneous <italic>k</italic>
-mers identified by an algorithm to detect a single true variant <italic>k</italic>
-mer and is a measure of the overall volume of <italic>k</italic>
-mers predicted by an algorithm.</p>
<sec id="s0045"><label>3.1.1</label>
<title>HIV simulated datasets</title>
<p>We first assess the performance of MultiRes on the reads simulated from the HIV-1 population containing 11 haplotypes, generated from a single HIV-1 reference sequence (NC_001802) as mentioned before. Two datasets are generated from the simulated reads: one with average haplotype coverage of 100x (denoted as HIV 100x), and second where the average coverage is 400x (denoted as HIV 400x) as increasing sequencing depth increases the absolute number of erroneous <italic>k</italic>
-mers introduced in the data.</p>
<p>The recall of MultiRes is 95% and 98% on HIV 100x and HIV 400x datasets, respectively, indicating that performance of MultiRes improves with increasing sequencing depth as expected. The recall numbers are comparable to around 98% recall of other methods on the HIV 100x dataset and 94% to 99% for HIV 400x dataset (<xref rid="t0005" ref-type="table">Table 1</xref>
).<table-wrap id="t0005" position="float"><label>Table 1</label>
<caption><p>Comparison of performance metrics of error detection on simulated HIV datasets. FP/TP ratio is the measure of false positive to true positive ratio, Recall measures the percentage of true <italic>k</italic>
-mers out of all true <italic>k</italic>
-mer predicted by an algorithm, Precision measures the percentage of predicted <italic>k</italic>
-mers by an algorithm that are true <italic>k</italic>
-mers.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Algorithm</th>
<th colspan="2" align="left">FP/TP ratio</th>
<th colspan="2" align="left">Recall</th>
<th colspan="2" align="left">Precision</th>
</tr>
</thead>
<tbody><tr><td align="left"></td>
<td align="left">HIV 100x</td>
<td align="left">HIV 400x</td>
<td align="left">HIV 100x</td>
<td align="left">HIV 400x</td>
<td align="left">HIV 100x</td>
<td align="left">HIV 400x</td>
</tr>
<tr><td align="left">Uncorrected</td>
<td align="left">53</td>
<td align="left">121</td>
<td align="left">98.91</td>
<td align="left">99.67</td>
<td align="left">1.85</td>
<td align="left">0.82</td>
</tr>
<tr><td align="left">Quake</td>
<td align="left">9.26</td>
<td align="left">29.5</td>
<td align="left"><bold>98.63</bold>
</td>
<td align="left">94.84</td>
<td align="left">9.74</td>
<td align="left">3.27</td>
</tr>
<tr><td align="left">BLESS</td>
<td align="left">0.71</td>
<td align="left">76.7</td>
<td align="left">98.38</td>
<td align="left">99.36</td>
<td align="left">58.48</td>
<td align="left">1.28</td>
</tr>
<tr><td align="left">Musket</td>
<td align="left">0.46</td>
<td align="left">121</td>
<td align="left">98.46</td>
<td align="left"><bold>99.67</bold>
</td>
<td align="left">68.48</td>
<td align="left">0.82</td>
</tr>
<tr><td align="left">BFC</td>
<td align="left">2.12</td>
<td align="left">112</td>
<td align="left">98.47</td>
<td align="left">99.57</td>
<td align="left">32.01</td>
<td align="left">0.89</td>
</tr>
<tr><td align="left">BayesHammer</td>
<td align="left">0.37</td>
<td align="left">69.1</td>
<td align="left">98.47</td>
<td align="left">98.59</td>
<td align="left">73.04</td>
<td align="left">1.42</td>
</tr>
<tr><td align="left">Seecer</td>
<td align="left">12.1</td>
<td align="left">110</td>
<td align="left">98.49</td>
<td align="left">98.31</td>
<td align="left">7.65</td>
<td align="left">0.90</td>
</tr>
<tr><td align="left">MultiRes</td>
<td align="left"><bold>0.11</bold>
</td>
<td align="left"><bold>0.048</bold>
</td>
<td align="left">95.01</td>
<td align="left">98.17</td>
<td align="left"><bold>89.34</bold>
</td>
<td align="left"><bold>95.39</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>The False positive/True Positive ratios (FP/TP ratios), Recall, and Precision are compared on two HIV datasets for the methods: Quake, BLESS, Musket, BFC, BayesHammer, Seecer, and the proposed method MultiRes. The error corrected reads from each method are broken into <italic>k</italic>
-mers and compared to the true <italic>k</italic>
-mers in the HIV-1 viral populations. Uncorrected denotes the statistics when no error correction is performed. Bold in each column indicates the best method for the dataset and the metric evaluated.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The precision of MultiRes is 89% in the HIV 100x while all other methods have low precisions for HIV 100x. While precision in all other methods is less than 5% for HIV 400x dataset, the precision of MultiRes is 95%, suggesting that precision decreases for other methods with increasing sequencing depth. As higher depth samples also have higher sequencing errors, the detection algorithms in these methods are not able to differentiate between rare variants and sequencing errors. Seecer and BayesHammer, methods which can handle variability in sequencing coverage, also have very low precision values compared to the proposed method.</p>
<p>The FP/TP ratio obtained by MultiRes are 4 to 500 times better than other methods and the number of <italic>k</italic>
-mers retained is close to the true set of <italic>k</italic>
-mers in the two datasets (FP/TP ratio is close to zero & recall close to 95–98 %).Thus, while all methods retain the true <italic>k</italic>
-mers to the same extent, only MultiRes reduces the number of false positive <italic>k</italic>
-mers. This is important as the memory requirements for <italic>de novo</italic>
 assembly tools linearly increases based on the number of <italic>k</italic>
-mers. Thus the <italic>k</italic>
-mers predicted by MultiRes would have a500 times reduction in memory consumption for downstream <italic>de novo</italic>
 assembly tools as compared to current error correction methods.</p>
</sec>
<sec id="s0050"><label>3.1.2</label>
<title>Generalizability: Testing MultiRes on a Hepatitis C virus dataset</title>
<p>We also evaluate our method on reads simulated from viral populations consisting of the E1/E2 gene of Hepatitis C virus (HCV). The purpose of using HCV strains is to understand the generalization of the MultiRes classifier on other viral population datasets. Two HCV populations observed in patients in previous studies are used as simulated viral populations. The first, denoted as HCV 1, consists of 36 HCV strains from E1/E2 region and are of length 1672 bps <xref rid="bb0165" ref-type="bibr">[33]</xref>
. The second, denoted as HCV 2, consists of 44 HCV strains from the E1/E2 regions of the HCV genome with lengths 1734 bps <xref rid="bb0040" ref-type="bibr">[8]</xref>
, <xref rid="bb0085" ref-type="bibr">[17]</xref>
. We simulate 500 K Illumina paired end reads from both datasets under a power law (with ratio 2) of reads distribution amongst the strains <xref rid="bb0170" ref-type="bibr">[34]</xref>
. The two simulated datasets are denoted as HCV1P and HCV2P respectively. The power law distribution of reads also helps in evaluating the performance of MultiRes when more than 50% of the haplotypes are present at less than 5% relative abundances.</p>
<p>All methods have recall greater than 90% on both datasets (<xref rid="t0010" ref-type="table">Table 2</xref>
). Again, the difference between MultiRes and other methods is evident from the FP/TP ratios and precision. The false positive to true positive ratios for MultiRes are less than other methods at least by a factor of 5 (<xref rid="t0010" ref-type="table">Table 2</xref>
). MultiRes still outperforms all other methods on predicting the smallest set of predicted <italic>k</italic>
-mers while maintaining high recall levels of true <italic>k</italic>
-mers.<table-wrap id="t0010" position="float"><label>Table 2</label>
<caption><p>Comparison of performance metrics of different methods on HCV population datasets.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Algorithm</th>
<th colspan="2" align="left">FP/TP ratio</th>
<th colspan="2" align="left">Recall</th>
<th colspan="2" align="left">Precision</th>
</tr>
</thead>
<tbody><tr><td align="left"></td>
<td align="left">HCV1P</td>
<td align="left">HCV2P</td>
<td align="left">HCV1P</td>
<td align="left">HCV2P</td>
<td align="left">HCV1P</td>
<td align="left">HCV2P</td>
</tr>
<tr><td align="left">Uncorrected</td>
<td align="left">1201</td>
<td align="left">571</td>
<td align="left">99.51</td>
<td align="left">99.88</td>
<td align="left">0.08</td>
<td align="left">0.17</td>
</tr>
<tr><td align="left">Quake</td>
<td align="left">303.3</td>
<td align="left">149</td>
<td align="left">96.41</td>
<td align="left">97.23</td>
<td align="left">0.32</td>
<td align="left">0.66</td>
</tr>
<tr><td align="left">BLESS</td>
<td align="left">202</td>
<td align="left">112</td>
<td align="left">98.35</td>
<td align="left">97.18</td>
<td align="left">0.49</td>
<td align="left">0.88</td>
</tr>
<tr><td align="left">Musket</td>
<td align="left">938</td>
<td align="left">463</td>
<td align="left">93.53</td>
<td align="left">89.17</td>
<td align="left">0.10</td>
<td align="left">0.21</td>
</tr>
<tr><td align="left">BFC</td>
<td align="left">352</td>
<td align="left">161</td>
<td align="left">99.32</td>
<td align="left">99.84</td>
<td align="left">0.28</td>
<td align="left">0.61</td>
</tr>
<tr><td align="left">BayesHammer</td>
<td align="left">699</td>
<td align="left">340</td>
<td align="left">98.12</td>
<td align="left">97.1</td>
<td align="left">0.14</td>
<td align="left">0.29</td>
</tr>
<tr><td align="left">Seecer</td>
<td align="left">1095</td>
<td align="left">528</td>
<td align="left"><bold>99.48</bold>
</td>
<td align="left"><bold>99.85</bold>
</td>
<td align="left">0.09</td>
<td align="left">0.19</td>
</tr>
<tr><td align="left">MultiRes</td>
<td align="left"><bold>37.4</bold>
</td>
<td align="left"><bold>19.54</bold>
</td>
<td align="left">96.5</td>
<td align="left">94.25</td>
<td align="left"><bold>2.6</bold>
</td>
<td align="left"><bold>4.87</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>The false positive to true positive ratios, Recall, and Precision of error correction methods on the two simulated HCV datasets are shown. Uncorrected refers to the statistics when no error correction is performed. Bold font in each column indicates the best method for each dataset on the evaluated measure.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The recall for MultiRes is respectively 96% and 94% on HCV1P and HCV2P datasets, which is less than the method Seecer that has recall values around 99%. Seecer marks more than 90% of the observed <italic>k</italic>
-mers as true, which explains the high recall values. However, this also leads to a large number of false positive <italic>k</italic>
-mers being predicted as true <italic>k</italic>
-mers in Seecer, leading to low precision values. All other methods also achieve high recall by retention of all large fraction of observed <italic>k</italic>
-mers, as indicated by their precision values being less than 1% and false positive to true positive ratios being greater than 100.</p>
<p>The similar performance of MultiRes on a dataset, such as the HCV population, which is diverse in genome composition from the simulated HIV-1 sequences used in simulation indicates the generalizability of the Random Forest Classifier in MultiRes. The classifier is capturing properties of the Illumina sequencing platform and the fact that both datasets contain a large number of rare variants occurring at <italic>k</italic>
-mer counts close to the sequencing errors. Thus, MultiRes can be used as it is for error and rare variant detection in diverse datasets.</p>
<p>As the performance of MultiRes on HCV population is not as impressive as on the HIV simulated populations, it is also important to understand the cause for this decrease in performance. It is possible that the decrease in performance is correlated to the large number of low-frequency variants that are being misclassified by MultiRes. In order to test this, we investigate MultiRes' classification as a function of the count of the 35-mer which is being classified. MultiRes predicts about one-fourth of the observed <italic>k</italic>
-mers as rare variants for <italic>k</italic>
-mer counts less than 15, and predicts more than 99% all of the observed <italic>k</italic>
-mers as true for counts greater 20 (<xref rid="f0010" ref-type="fig">Fig. 2</xref>
 (a)). This suggests that MultiRes predicts rare variant <italic>k</italic>
-mers for all observed counts and detects more rare variant <italic>k</italic>
-mers than a method based on a single threshold.<fig id="f0010"><label>Fig. 2</label>
<caption><p>Performance of MultiRes on HCV datasets under power law distributions of viral haplotypes with respect to count of <italic>k</italic>
-mer. 35-mer multiplicity plots for HCV1P and HCV2P datasets are shown. x-axis indicates the number of times a 35-mer was observed while y-axis indicates the number of 35-mers at a count. (a) The predicted true 35-mers from MultiRes (HCV1P red, HCV2P pink) compared to the uncorrected data (HCV1P blue,HCV2P green), and (b) The true positive rare variants 35-mers from MultiRes (HCV1P red, HCV2P pink) versus the ground truth 35-mers (HCV1P red, HCV2P pink). MultiRes predicts rare variants <italic>k</italic>
-mers at all counts greater than 3, with its accuracy improving as counts of <italic>k</italic>
-mer increases.</p>
</caption>
<graphic xlink:href="gr2"></graphic>
</fig>
</p>
<p>Most of the <italic>k</italic>
-mer based error correction methods use a single threshold over the <italic>k</italic>
-mer counts, which will clearly lose true rare variant <italic>k</italic>
-mers (<xref rid="f0010" ref-type="fig">Fig. 2</xref>
 (b)). On the other hand, MultiRes has a recall of 50% for <italic>k</italic>
-mers observed 3 times, while still correctly identifying more than 75% of the <italic>k</italic>
-mers as erroneous. The recall of MultiRes increases to 100% as the counts of the observed <italic>k</italic>
-mers increases to 35. This indicates the importance of not having a single threshold for distinguishing between sequencing errors and rare variants in viral population datasets, and our MultiRes bypasses a single threshold by training a Random Forest classifier.</p>
</sec>
<sec id="s0055"><label>3.1.3</label>
<title>Evaluation on population of 5 HIV-1 sequences</title>
<p>We also evaluate MultiRes on a laboratory mixture of five known HIV-1 strains <xref rid="bb0175" ref-type="bibr">[35]</xref>
, which captures the variability occurring during sample preparation, errors introduced in a real sequencing project, and mutations occurring during reverse transcription of RNA samples. Five HIV-1 strains (named YU2, HXB2, NL43, 89.6, and JRCSF) of lengths 9.1 kb were pooled and sequenced using Illumina paired end sequencing technology (Refer to <xref rid="bb0175" ref-type="bibr">[35]</xref>
 for details). Each HIV strain was also sequenced separately in their study and aligned to their known reference sequence (from Genbank) to generate a consensus sequence for each HIV-1 strain <xref rid="bb0175" ref-type="bibr">[35]</xref>
. This provides us with a dataset of actual sequence reads where the ground truth is known allowing us to assess the performance of MultiRes and other methods. We extracted 35-mers from the paired end sequencing data and classify them using the Random Forest classifier of MultiRes trained on the simulated HIV sequencing data.</p>
<p>All the error correction methods and MultiRes have recall values around 97%, indicating that the performance for recovery of true <italic>k</italic>
-mers is comparable across all methods (<xref rid="t0015" ref-type="table">Table 3</xref>
). The false positive to true positive ratio for MultiRes is 13 while all other methods have ratios more than 120. MultiRes predicts 359 thousand unique <italic>k</italic>
-mers in the set of true <italic>k</italic>
-mers while all other methods predict more than 5 million unique <italic>k</italic>
-mers. Even methods that take variance in sequencing depths while performing error correction, such as BayesHammer and Seecer, predict 11.3 million and 6.3 million unique <italic>k</italic>
-mers which is two orders more than the ground truth number of <italic>k</italic>
-mers in the consensus sequence of the 5 HIV-1 strains (53 thousand unique <italic>k</italic>
-mers). Thus, even considering the artifacts introduced in sequencing, MultiRes has by far the most compact set of predicted error free <italic>k</italic>
-mers amongst all methods while retaining high number of true <italic>k</italic>
-mers. As mentioned earlier, as the number of <italic>k</italic>
-mers linearly affects the memory requirements for downstream <italic>de novo</italic>
 assembly methods, the error detection from MultiRes would translate to a 10-fold reduction in memory.</p>
</sec>
<sec id="s0060"><label>3.1.4</label>
<title>Runtime and memory</title>
<p>MultiRes has comparable running times to BayesHammer on the five-viral mix dataset (<xref rid="f0015" ref-type="fig">Fig. 3</xref>
) on a Dell system with 8 GB main memory, and 2X Dual Core AMD Opteron 2216 CPU type. The performance on all other datasets was similar indicating that the timings are comparable. Additionally, while other methods have parallel implementations, the error detection classifier step in MultiRes is a single thread serial implementation. As the random forest classifier used by MultiRes is already trained and independent of the input <italic>k</italic>
-mers for classification, the runtime of MultiRes can be significantly improved via parallelization of the <italic>k</italic>
-mer classification step.<fig id="f0015"><label>Fig. 3</label>
<caption><p>Runtime comparison on five-viral mix dataset. Comparison of running times for different algorithms on 5-viral mix dataset on 8GB memory nodes of 2X Dual Core AMD Opteron 2216 systems from Dell. The time noted for BayesHammer is only the time reported for BayesHammer error correction step in SPADES (version 3.6.2). The time reported for MultiRes is the combined time for <italic>k</italic>
-mer counting, predicting <italic>k</italic>
-mers as erroneous and rare variants and generating the final output.</p>
</caption>
<graphic xlink:href="gr3"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="s0065"><label>3.2</label>
<title>Comparison of MultiRes to variant calling methods for viral populations</title>
<p>As one of the objectives in NGS studies of viruses is to identify the single nucleotide polymorphisms (SNPs) in a population <xref rid="bb0010" ref-type="bibr">[2]</xref>
, <xref rid="bb0035" ref-type="bibr">[7]</xref>
, <xref rid="bb0045" ref-type="bibr">[9]</xref>
 which is sensitive to erroneous reads, we evaluate the inference of SNPs from the <italic>k</italic>
-mers predicted by MultiRes, and compare it to known SNP profiling methods for viral populations. We first align the predicted <italic>k</italic>
-mers from MultiRes to a reference sequence of the viral population using the bwa mem aligner and a base is called as a SNP when its relative fraction amongst the <italic>k</italic>
-mers aligned at that position is greater than 0.01. All the variants that occur at a frequency greater than the error threshold at that position are reported as SNPs. The choice of the reference sequence is based on the viral population data being evaluated, and the same reference sequence is used for calling true SNPs and the SNPs predicted by a method.</p>
<p>Each SNP detected at a base position of the reference and detection of the reference base itself are treated as <italic>true positives</italic>
 for a method; thus the number of true positives can be greater than the length of the reference sequence. All the SNPs predicted by a method and the number of bases mapped to the reference sequence are known as the <italic>total SNP predictions</italic>
 of a method. We use three measures for evaluating the SNPs called by any method. <italic>Precision</italic>
 is defined as the ratio of the number of <italic>true positives</italic>
 to the <italic>total SNP predictions</italic>
 made by a method, while <italic>recall</italic>
 is defined as the ratio of the <italic>true positives</italic>
 to the total number of SNPs and reference bases in the viral population. Finally, <italic>false positive to true positive ratio</italic>
 is a ratio of the number of false SNP predictions to the number of <italic>true positives</italic>
 detected by a method.</p>
<p>We compare our results to state-of-the-art variant calling methods for viral populations VPhaser-2 <xref rid="bb0045" ref-type="bibr">[9]</xref>
, a rare variant calling method LoFreq <xref rid="bb0050" ref-type="bibr">[10]</xref>
, and viral haplotype reconstruction algorithm ShoRAH <xref rid="bb0035" ref-type="bibr">[7]</xref>
 using the above three measures. The reference sequence used by variant calling methods VPhaser-2 and LoFreq is the same as that used by <italic>samtools</italic>
 to determine the true SNPs, while the SNPs predicted by ShoRAH at default parameters are compared directly to the true SNPs. We only used the SNP calls from VPhaser-2 for evaluation, as length polymorphisms are not generated by the other methods, but the results from VPhaser-2 were not penalized when comparing the SNPs.</p>
<p>We report results for LoFreq <xref rid="bb0050" ref-type="bibr">[10]</xref>
, VPhaser <xref rid="bb0045" ref-type="bibr">[9]</xref>
, ShoRAH <xref rid="bb0035" ref-type="bibr">[7]</xref>
 and our method MultiRes on all datasets (<xref rid="t0020" ref-type="table">Table 4</xref>
). Overall, MultiRes has greater than 94% recall and precision values greater than 83% in all datasets. LoFreq and VPhaser have comparable recall but lower precision values and an increase in the FP/TP ratios on the HCV population datasets, indicating a decrease in performance. ShoRAH overall has lower recall values, nevertheless a 100% precision in all but the 5-viral mix dataset, suggesting that it misses true SNPs but is very accurate when it calls a base as SNP. Overall all methods have low values for FP/TP ratio as compared to before, indicating that the number of false positive SNP predictions are low. The metric where MultiRes outperforms others is the lowest number of true SNPs missed. This shows that even with a simplistic SNP prediction method used in MultiRes, it is able to capture the true variation of the sampled viral population and has the lowest false negatives of well established methods. This demonstrates that using error-free set of <italic>k</italic>
-mers can vastly increase the variant detection in viral populations.<table-wrap id="t0015" position="float"><label>Table 3</label>
<caption><p>Comparison of performance metrics on 5-viral mix HIV-1 dataset.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Algorithm</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">FP/TP ratio</th>
<th align="left"># of unique 35-mers</th>
</tr>
</thead>
<tbody><tr><td align="left">Uncorrected</td>
<td align="left">98.01</td>
<td align="left">0.2</td>
<td align="left">439</td>
<td align="left">11.4 M</td>
</tr>
<tr><td align="left">BLESS</td>
<td align="left">97.31</td>
<td align="left">0.4</td>
<td align="left">227</td>
<td align="left">5.89 M</td>
</tr>
<tr><td align="left">Musket</td>
<td align="left"><bold>97.91</bold>
</td>
<td align="left">0.3</td>
<td align="left">366</td>
<td align="left">11.2 M</td>
</tr>
<tr><td align="left">BFC</td>
<td align="left">97.55</td>
<td align="left">0.3</td>
<td align="left">316</td>
<td align="left">9.6 M</td>
</tr>
<tr><td align="left">BayesHammer</td>
<td align="left">97.49</td>
<td align="left">0.8</td>
<td align="left">122</td>
<td align="left">6.3M</td>
</tr>
<tr><td align="left">Seecer</td>
<td align="left">97.84</td>
<td align="left">0.5</td>
<td align="left">220</td>
<td align="left">11.3M</td>
</tr>
<tr><td align="left">MultiRes</td>
<td align="left">96.64</td>
<td align="left">7.1</td>
<td align="left"><bold>13</bold>
</td>
<td align="left"><bold>359 K</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>The recall, precision, and FP/TP ratios of each method are evaluated on the 5-viral mix HIV-1 dataset. The number of unique 35-mers indicates the number of unique 35-mers predicted by a method. There are 53 thousand true unique 35-mers in the consensus sequences of the 5 viral strains. Bold indicates the best method for the measure in each column.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="t0020" position="float"><label>Table 4</label>
<caption><p>Comparison with Variant Calling methods on all datasets.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Dataset</th>
<th align="left">Method</th>
<th align="left">Recall (%)</th>
<th align="left">FP/TP ratio</th>
<th align="left">Precision (%)</th>
<th align="left"># of False negatives</th>
<th align="left">Mapped reads (%)</th>
</tr>
</thead>
<tbody><tr><td align="left">HIV 100x</td>
<td align="left">LoFreq</td>
<td align="left">97.33</td>
<td align="left">0.004</td>
<td align="left">99.60</td>
<td align="left">444</td>
<td align="left">89.51</td>
</tr>
<tr><td align="left"></td>
<td align="left">Vphaser</td>
<td align="left">98.90</td>
<td align="left">0.007</td>
<td align="left">99.26</td>
<td align="left">183</td>
<td align="left">89.51</td>
</tr>
<tr><td align="left"></td>
<td align="left">ShoRAH</td>
<td align="left">55.21</td>
<td align="left"><bold>0</bold>
</td>
<td align="left"><bold>100</bold>
</td>
<td align="left">7746</td>
<td align="left"><bold>98.04</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">MultiRes</td>
<td align="left"><bold>99.69</bold>
</td>
<td align="left">0.011</td>
<td align="left">98.88</td>
<td align="left"><bold>51</bold>
</td>
<td align="left">97.89</td>
</tr>
<tr><td align="left">HIV 400x</td>
<td align="left">LoFreq</td>
<td align="left">84.83</td>
<td align="left">0</td>
<td align="left">99.99</td>
<td align="left">2522</td>
<td align="left">99.55</td>
</tr>
<tr><td align="left"></td>
<td align="left">Vphaser</td>
<td align="left"><bold>95.92</bold>
</td>
<td align="left">0.292</td>
<td align="left">77.37</td>
<td align="left"><bold>678</bold>
</td>
<td align="left">99.55</td>
</tr>
<tr><td align="left"></td>
<td align="left">ShoRAH</td>
<td align="left">55.21</td>
<td align="left"><bold>0</bold>
</td>
<td align="left"><bold>100</bold>
</td>
<td align="left">7746</td>
<td align="left"><bold>99.95</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">MultiRes</td>
<td align="left">95.57</td>
<td align="left">0.007</td>
<td align="left">99.33</td>
<td align="left">736</td>
<td align="left">97.34</td>
</tr>
<tr><td align="left">HCV1P</td>
<td align="left">LoFreq</td>
<td align="left"><bold>98.30</bold>
</td>
<td align="left">1.282</td>
<td align="left">43.82</td>
<td align="left"><bold>31</bold>
</td>
<td align="left"><bold>99.99</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">Vphaser</td>
<td align="left">93.51</td>
<td align="left">1.628</td>
<td align="left">38.05</td>
<td align="left">118</td>
<td align="left"><bold>99.99</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">ShoRAH</td>
<td align="left">91.92</td>
<td align="left"><bold>0</bold>
</td>
<td align="left"><bold>100</bold>
</td>
<td align="left">147</td>
<td align="left"><bold>99.99</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">MultiRes</td>
<td align="left">98.24</td>
<td align="left">0.597</td>
<td align="left">62.64</td>
<td align="left">32</td>
<td align="left">97.32</td>
</tr>
<tr><td align="left">HCV2P</td>
<td align="left">LoFreq</td>
<td align="left">97.10</td>
<td align="left">1.046</td>
<td align="left">48.87</td>
<td align="left">60</td>
<td align="left"><bold>100</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">Vphaser</td>
<td align="left">95.65</td>
<td align="left">1.492</td>
<td align="left">40.13</td>
<td align="left">90</td>
<td align="left"><bold>100</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">ShoRAH</td>
<td align="left">83.73</td>
<td align="left"><bold>0</bold>
</td>
<td align="left"><bold>100</bold>
</td>
<td align="left">337</td>
<td align="left">99.95</td>
</tr>
<tr><td align="left"></td>
<td align="left">MultiRes</td>
<td align="left"><bold>98.79</bold>
</td>
<td align="left">0.201</td>
<td align="left">83.27</td>
<td align="left"><bold>25</bold>
</td>
<td align="left">85.14</td>
</tr>
<tr><td align="left">5-viral mix</td>
<td align="left">LoFreq</td>
<td align="left">99.06</td>
<td align="left">0.085</td>
<td align="left">92.15</td>
<td align="left">101</td>
<td align="left">98.59</td>
</tr>
<tr><td align="left"></td>
<td align="left">Vphaser</td>
<td align="left">92.68</td>
<td align="left">0.039</td>
<td align="left">96.25</td>
<td align="left">789</td>
<td align="left">98.59</td>
</tr>
<tr><td align="left"></td>
<td align="left">ShoRAH</td>
<td align="left">98.66</td>
<td align="left"><bold>0.014</bold>
</td>
<td align="left"><bold>98.99</bold>
</td>
<td align="left">109</td>
<td align="left"><bold>99.3</bold>
</td>
</tr>
<tr><td align="left"></td>
<td align="left">MultiRes</td>
<td align="left"><bold>99.39</bold>
</td>
<td align="left">0.077</td>
<td align="left">92.82</td>
<td align="left"><bold>66</bold>
</td>
<td align="left">96.29</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>The Recall, false positive to true positive ratios (FP/TP), Precision, number of false negatives, and % of mapped reads by methods LoFreq, VPhaser-2, ShoRAH, and MultiRes are computed for listed datasets. All reads from a sample were aligned using bwa-mem tool for LoFreq and VPhaser-2 under default settings. ShoRAH uses its own aligner for read alignment and variant calling, while <italic>k</italic>
-mers detected by MultiRes were aligned using bwa-mem for MultiRes. Outputs from LoFreq (version 2.1.2), VPhaser-2 (last downloaded version October 2015), and ShoRAH (last downloaded version from November 2013) are compared against known variants for simulated datasets. For 5-viral mix, the consensus reference provided by [<xref rid="bb0175" ref-type="bibr">35</xref>
] was used to determine ground truth variants. MultiRes variants are determined by aligning 35-mers to a reference sequence and bases occurring at more than 0.01 frequency as variants. Bold for each dataset indicates the best method for the performance measures.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The number of reads or <italic>k</italic>
-mers aligned to the reference sequence are comparable across the methods, except for HCV2P dataset where MultiRes has 85% <italic>k</italic>
-mers mapped compared to 100% read mapping (<xref rid="t0020" ref-type="table">Table 4</xref>
). It is possible that the unmapped <italic>k</italic>
-mers correspond to the length variants and could be verified by haplotype reconstruction using the predicted <italic>k</italic>
-mers, but that was not the focus in this paper.</p>
</sec>
</sec>
<sec id="s0070"><label>4</label>
<title>Discussion and conclusions</title>
<p>We have proposed a classifier MultiRes for detecting rare variant and erroneous <italic>k</italic>
-mers obtained from Illumina sequencing of viral populations. Our method does not rely on a reference sequence and uses concepts from signal processing to justify using the counts of sets of <italic>k</italic>
-mers of different sizes. We utilize the projections of sampled reads signals onto multiple <bold>frames</bold>
 as features for our classifier MultiRes.</p>
<p>We demonstrated the performance of MultiRes on simulated HIV and HCV viral populations and real HIV viral populations containing viral haplotypes at varying relative frequencies, where it outperformed the error detection algorithms used in error correction methods in terms of recall and the total number of predicted <italic>k</italic>
-mers. Though, the error detection algorithms in the error correction methods evaluated assumed that sequenced reads originated from a single genome sequenced at uniform coverage, our method also works better than the method BayesHammer, which can tackle non-uniform sequencing coverage, and the method Seecer, which additionally incorporates methods for detecting alternative splicing and polymorphisms.</p>
<p>The error-free <italic>k</italic>
-mers predicted by MultiRes enable the usage of <italic>de novo</italic>
 assembly methods for viral genomes. A major challenge for using De Bruijn graph based methods for viral populations has been the increased complexity of the graph due to the presence of large number of sequencing errors <xref rid="bb0180" ref-type="bibr">[36]</xref>
. Moreover, the memory footprint of a De Bruijn assembly graph increases linearly with the number of <italic>k</italic>
-mers in the NGS data. Thus the low false positives along with high recall of <italic>k</italic>
-mers predicted by MultiRes drastically reduce the memory requirements for De Bruijn graphs. An edge-centric De Bruijn graph of size <italic>k</italic>
 − 1 can be directly generated from error free <italic>k</italic>
-mers, such as in <italic>de novo</italic>
 assembly tools SPADES, Cortex <xref rid="bb0185" ref-type="bibr">[37]</xref>
, <xref rid="bb0190" ref-type="bibr">[38]</xref>
 for reconstruction of viral haplotypes in a viral population. The graph can be used for calling structural variants in the viral population data. MultiRes has high recall of true <italic>k</italic>
-mers while outputting the least number of false positive <italic>k</italic>
-mers, thereby making <italic>de novo</italic>
 assembly graphs manageable.</p>
<p>MultiRes also can be directly used for SNP calling as the predicted error-free <italic>k</italic>
-mers can be aligned to an existing reference genome or a consensus sequence of the current viral population. The SNPs called by MultiRes' data has either the highest or the second highest recall of the SNPs compared to other methods for viral population variant calling.</p>
<p>MultiRes relies on the counts of multiple sizes of <italic>k</italic>
-mers observed in the sequenced reads, and the choice of <italic>k</italic>
-mer length is an important parameter. The minimum value of <italic>k</italic>
 chosen should be such that a <italic>k</italic>
-mer can only be sampled from a single location in the genome. This is possible in viral populations where there are small repeats present. Choosing the number of <italic>k</italic>
-mer sizes used is another parameter, and while accuracy can be improved by increasing it, additional <italic>k</italic>
-mer counting increased the number of computations. As demonstrated by our experiments, choosing three different values of <italic>k</italic>
, namely (<italic>k</italic>
,2 ⋅ <italic>k</italic>
,3 ⋅ <italic>k</italic>
) was sufficient for accurate results.</p>
<p>MultiRes also has applications for studying the large scale variation in closely related genomes, including as viral populations. The complexity of De Bruijn graphs, useful for studying structural variants and rearrangements in the population, increases because of sequencing errors. Our method can provide a compact set of <italic>k</italic>
-mers while still retaining high recall of the true <italic>k</italic>
-mers, which can be utilized for constructing the graph. Additionally, the error-free <italic>k</italic>
-mers predicted by MultiRes can be directly used for understanding the SNPs observed in the viral population to a high degree of accuracy.</p>
<p>MultiRes' classifier also has its limitations. The model, although trained to model the features of an Illumina sequencing machine, does have a decreased performance on different viral populations with a large number of rare variants, as is evident from its 50% accuracy for HCV2P population for <italic>k</italic>
-mers observed only 3 times. Although it is able to eliminate a large number of false positive <italic>k</italic>
-mers (more than 75% of <italic>k</italic>
-mers at counts of 3), the classifier model can be improved with additional training data and an ensemble of classifier models.</p>
<p>MultiRes was primarily developed for detection of sequencing errors and rare variants in viral populations, which have small genomes. Extending our method for larger genomes may require additional tuning of the parameters via re-training of the classifier, but the concepts developed here are applicable to studying variation in closely related genomes such as cancer cell lines. It is also applicable for understanding somatic variation in sequences as their variation frequency is close to the sequencing error rates. The technique can also be explored for newer sequencing machines, such as PacBio sequences and Oxford Nanopore long read sequencing, where the type of sequencing errors are different, but the concepts of projections of signals are still applicable. The software is available for download from the github link (<ext-link ext-link-type="uri" xlink:href="https://github.com/raunaq-m/MultiRes" id="ir0025">https://github.com/raunaq-m/MultiRes</ext-link>
).</p>
</sec>
<sec id="s9030"><title>Conflict of interest</title>
<p>The authors declare no conflict of interest.</p>
</sec>
</body>
<back><ref-list id="bi0005"><title>References</title>
<ref id="bb0005"><label>1</label>
<element-citation publication-type="journal" id="rf0005"><person-group person-group-type="author"><name><surname>Nguyen</surname>
<given-names>D.X.</given-names>
</name>
<name><surname>Massagué</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Genetic determinants of cancer metastasis</article-title>
<source>Nat Rev Genet</source>
<volume>8</volume>
<issue>5</issue>
<year>2007</year>
<fpage>341</fpage>
<lpage>352</lpage>
<pub-id pub-id-type="pmid">17440531</pub-id>
</element-citation>
</ref>
<ref id="bb0010"><label>2</label>
<element-citation publication-type="journal" id="rf0010"><person-group person-group-type="author"><name><surname>McElroy</surname>
<given-names>K.</given-names>
</name>
<name><surname>Thomas</surname>
<given-names>T.</given-names>
</name>
<name><surname>Luciani</surname>
<given-names>F.</given-names>
</name>
</person-group>
<article-title>Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions</article-title>
<source>Microb Inform Exp</source>
<volume>4</volume>
<issue>1</issue>
<year>2014</year>
</element-citation>
</ref>
<ref id="bb0015"><label>3</label>
<element-citation publication-type="journal" id="rf0015"><person-group person-group-type="author"><name><surname>Beerenwinkel</surname>
<given-names>N.</given-names>
</name>
<name><surname>Gunthard</surname>
<given-names>H.F.</given-names>
</name>
<name><surname>Roth</surname>
<given-names>V.</given-names>
</name>
<name><surname>Metzner</surname>
<given-names>K.J.</given-names>
</name>
</person-group>
<article-title>Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data</article-title>
<source>Front Microbiol</source>
<volume>3</volume>
<issue>329</issue>
<year>2012</year>
</element-citation>
</ref>
<ref id="bb0020"><label>4</label>
<element-citation publication-type="journal" id="rf0020"><person-group person-group-type="author"><name><surname>Schirmer</surname>
<given-names>M.</given-names>
</name>
<name><surname>Ijaz</surname>
<given-names>U.Z.</given-names>
</name>
<name><surname>D’Amore</surname>
<given-names>R.</given-names>
</name>
<name><surname>Hall</surname>
<given-names>N.</given-names>
</name>
<name><surname>Sloan</surname>
<given-names>W.T.</given-names>
</name>
<name><surname>Quince</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform</article-title>
<source>Nucleic Acids Res</source>
<volume>43</volume>
<issue>6</issue>
<year>2015</year>
<fpage>e37</fpage>
<pub-id pub-id-type="pmid">25586220</pub-id>
</element-citation>
</ref>
<ref id="bb0025"><label>5</label>
<element-citation publication-type="journal" id="rf0025"><person-group person-group-type="author"><name><surname>Meacham</surname>
<given-names>F.</given-names>
</name>
<name><surname>Boffelli</surname>
<given-names>D.</given-names>
</name>
<name><surname>Dhahbi</surname>
<given-names>J.</given-names>
</name>
<name><surname>Martin</surname>
<given-names>D.I.</given-names>
</name>
<name><surname>Singer</surname>
<given-names>M.</given-names>
</name>
<name><surname>Pachter</surname>
<given-names>L.</given-names>
</name>
</person-group>
<article-title>Identification and correction of systematic error in high-throughput sequence data</article-title>
<source>BMC Bioinf</source>
<volume>12</volume>
<issue>1</issue>
<year>2011</year>
<fpage>451</fpage>
</element-citation>
</ref>
<ref id="bb0030"><label>6</label>
<element-citation publication-type="journal" id="rf0030"><person-group person-group-type="author"><name><surname>Töpfer</surname>
<given-names>A.</given-names>
</name>
<name><surname>Zagordi</surname>
<given-names>O.</given-names>
</name>
<name><surname>Prabhakaran</surname>
<given-names>S.</given-names>
</name>
<name><surname>Roth</surname>
<given-names>V.</given-names>
</name>
<name><surname>Halperin</surname>
<given-names>E.</given-names>
</name>
<name><surname>Beerenwinkel</surname>
<given-names>N.</given-names>
</name>
</person-group>
<article-title>Probabilistic inference of viral quasispecies subject to recombination</article-title>
<source>J Comput Biol</source>
<volume>20</volume>
<issue>2</issue>
<year>2013</year>
<fpage>113</fpage>
<lpage>123</lpage>
<pub-id pub-id-type="pmid">23383997</pub-id>
</element-citation>
</ref>
<ref id="bb0035"><label>7</label>
<element-citation publication-type="journal" id="rf0035"><person-group person-group-type="author"><name><surname>Zagordi</surname>
<given-names>O.</given-names>
</name>
<name><surname>Bhattacharya</surname>
<given-names>A.</given-names>
</name>
<name><surname>Eriksson</surname>
<given-names>N.</given-names>
</name>
<name><surname>Beerenwinkel</surname>
<given-names>N.</given-names>
</name>
</person-group>
<article-title>Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data</article-title>
<source>BMC Bioinf</source>
<volume>12</volume>
<issue>1</issue>
<year>2011</year>
<fpage>119</fpage>
</element-citation>
</ref>
<ref id="bb0040"><label>8</label>
<element-citation publication-type="journal" id="rf0040"><person-group person-group-type="author"><name><surname>Mangul</surname>
<given-names>S.</given-names>
</name>
<name><surname>Wu</surname>
<given-names>N.C.</given-names>
</name>
<name><surname>Mancuso</surname>
<given-names>N.</given-names>
</name>
<name><surname>Zelikovsky</surname>
<given-names>A.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>R.</given-names>
</name>
<name><surname>Eskin</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>Accurate viral population assembly from ultra-deep sequencing data</article-title>
<source>Bioinformatics</source>
<volume>30</volume>
<issue>12</issue>
<year>2014</year>
<fpage>i329</fpage>
<lpage>i337</lpage>
<pub-id pub-id-type="pmid">24932001</pub-id>
</element-citation>
</ref>
<ref id="bb0045"><label>9</label>
<element-citation publication-type="journal" id="rf0045"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name><surname>Charlebois</surname>
<given-names>P.</given-names>
</name>
<name><surname>Macalalad</surname>
<given-names>A.</given-names>
</name>
<name><surname>Henn</surname>
<given-names>M.</given-names>
</name>
<name><surname>Zody</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>V-phaser 2: variant inference for viral populations</article-title>
<source>BMC Genomics</source>
<volume>14</volume>
<issue>1</issue>
<year>2013</year>
<fpage>674</fpage>
<pub-id pub-id-type="pmid">24088188</pub-id>
</element-citation>
</ref>
<ref id="bb0050"><label>10</label>
<element-citation publication-type="journal" id="rf0050"><person-group person-group-type="author"><name><surname>Wilm</surname>
<given-names>A.</given-names>
</name>
<name><surname>Aw</surname>
<given-names>P.P.K.</given-names>
</name>
<name><surname>Bertrand</surname>
<given-names>D.</given-names>
</name>
<name><surname>Yeo</surname>
<given-names>G.H.T.</given-names>
</name>
<name><surname>Ong</surname>
<given-names>S.H.</given-names>
</name>
<name><surname>Wong</surname>
<given-names>C.H.</given-names>
</name>
</person-group>
<article-title>Lofreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
</element-citation>
</ref>
<ref id="bb0055"><label>11</label>
<element-citation publication-type="journal" id="rf0055"><person-group person-group-type="author"><name><surname>Töpfer</surname>
<given-names>A.</given-names>
</name>
<name><surname>Marschall</surname>
<given-names>T.</given-names>
</name>
<name><surname>Bull</surname>
<given-names>R.A.</given-names>
</name>
<name><surname>Luciani</surname>
<given-names>F.</given-names>
</name>
<name><surname>Schönhuth</surname>
<given-names>A.</given-names>
</name>
<name><surname>Beerenwinkel</surname>
<given-names>N.</given-names>
</name>
</person-group>
<article-title>Viral quasispecies assembly via maximal clique enumeration</article-title>
<source>PLoS Comput Biol</source>
<volume>10</volume>
<issue>3</issue>
<year>2014</year>
</element-citation>
</ref>
<ref id="bb0060"><label>12</label>
<element-citation publication-type="journal" id="rf0060"><person-group person-group-type="author"><name><surname>Kelley</surname>
<given-names>D.R.</given-names>
</name>
<name><surname>Schatz</surname>
<given-names>M.C.</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<volume>11</volume>
<issue>11</issue>
<year>2010</year>
<fpage>R116</fpage>
<pub-id pub-id-type="pmid">21114842</pub-id>
</element-citation>
</ref>
<ref id="bb0065"><label>13</label>
<element-citation publication-type="journal" id="rf0065"><person-group person-group-type="author"><name><surname>Heo</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Wu</surname>
<given-names>X.-L.</given-names>
</name>
<name><surname>Chen</surname>
<given-names>D.</given-names>
</name>
<name><surname>Ma</surname>
<given-names>J.</given-names>
</name>
<name><surname>Hwu</surname>
<given-names>W.-M.</given-names>
</name>
</person-group>
<article-title>Bless: bloom filter-based error correction solution for high-throughput sequencing reads</article-title>
<source>Bioinformatics</source>
<year>2014</year>
</element-citation>
</ref>
<ref id="bb0070"><label>14</label>
<element-citation publication-type="journal" id="rf0070"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>Bfc: correcting illumina sequencing errors</article-title>
<source>Bioinformatics</source>
<year>2015</year>
</element-citation>
</ref>
<ref id="bb0075"><label>15</label>
<element-citation publication-type="journal" id="rf0075"><person-group person-group-type="author"><name><surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Schröder</surname>
<given-names>J.</given-names>
</name>
<name><surname>Schmidt</surname>
<given-names>B.</given-names>
</name>
</person-group>
<article-title>Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data</article-title>
<source>Bioinformatics</source>
<volume>29</volume>
<issue>3</issue>
<year>2013</year>
<fpage>308</fpage>
<lpage>315</lpage>
<pub-id pub-id-type="pmid">23202746</pub-id>
</element-citation>
</ref>
<ref id="bb0080"><label>16</label>
<element-citation publication-type="journal" id="rf0080"><person-group person-group-type="author"><name><surname>Medvedev</surname>
<given-names>P.</given-names>
</name>
<name><surname>Scott</surname>
<given-names>E.</given-names>
</name>
<name><surname>Kakaradov</surname>
<given-names>B.</given-names>
</name>
<name><surname>Pevzner</surname>
<given-names>P.A.</given-names>
</name>
</person-group>
<article-title>Error correction of high-throughput sequencing datasets with non-uniform coverage</article-title>
<source>Bioinformatics [ISMB/ECCB]</source>
<volume>27</volume>
<issue>13</issue>
<year>2011</year>
<fpage>137</fpage>
<lpage>141</lpage>
</element-citation>
</ref>
<ref id="bb0085"><label>17</label>
<element-citation publication-type="journal" id="rf0085"><person-group person-group-type="author"><name><surname>Skums</surname>
<given-names>P.</given-names>
</name>
<name><surname>Dimitrova</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Campo</surname>
<given-names>D.S.</given-names>
</name>
<name><surname>Vaughan</surname>
<given-names>G.</given-names>
</name>
<name><surname>Rossi</surname>
<given-names>L.</given-names>
</name>
<name><surname>Forbi</surname>
<given-names>J.C.</given-names>
</name>
</person-group>
<article-title>Efficient error correction for next-generation sequencing of viral amplicons</article-title>
<source>BMC Bioinf</source>
<volume>13</volume>
<issue>Suppl 10</issue>
<year>2012</year>
<fpage>S6</fpage>
</element-citation>
</ref>
<ref id="bb0090"><label>18</label>
<element-citation publication-type="journal" id="rf0090"><person-group person-group-type="author"><name><surname>Rizk</surname>
<given-names>G.</given-names>
</name>
<name><surname>Lavenier</surname>
<given-names>D.</given-names>
</name>
<name><surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
<article-title>Dsk: k-mer counting with very low memory usage</article-title>
<source>Bioinformatics</source>
<volume>29</volume>
<issue>5</issue>
<year>2013</year>
<fpage>652</fpage>
<lpage>653</lpage>
<pub-id pub-id-type="pmid">23325618</pub-id>
</element-citation>
</ref>
<ref id="bb0095"><label>19</label>
<element-citation publication-type="journal" id="rf0095"><person-group person-group-type="author"><name><surname>Deorowicz</surname>
<given-names>S.</given-names>
</name>
<name><surname>Kokot</surname>
<given-names>M.</given-names>
</name>
<name><surname>Grabowski</surname>
<given-names>S.</given-names>
</name>
<name><surname>Debudaj-Grabysz</surname>
<given-names>A.</given-names>
</name>
</person-group>
<article-title>Kmc 2: fast and resource-frugal k-mer counting</article-title>
<source>Bioinformatics</source>
<volume>31</volume>
<issue>10</issue>
<year>2015</year>
<fpage>1569</fpage>
<lpage>1576</lpage>
<pub-id pub-id-type="pmid">25609798</pub-id>
</element-citation>
</ref>
<ref id="bb0100"><label>20</label>
<element-citation publication-type="journal" id="rf0100"><person-group person-group-type="author"><name><surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
<name><surname>Medvedev</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>Informed and automated k-mer size selection for genome assembly</article-title>
<source>Bioinformatics</source>
<year>2013</year>
</element-citation>
</ref>
<ref id="bb0105"><label>21</label>
<element-citation publication-type="journal" id="rf0105"><person-group person-group-type="author"><name><surname>Feng</surname>
<given-names>S.</given-names>
</name>
<name><surname>Lo</surname>
<given-names>C.-C.</given-names>
</name>
<name><surname>Li</surname>
<given-names>P.-E.</given-names>
</name>
<name><surname>Chain</surname>
<given-names>P.S.</given-names>
</name>
</person-group>
<article-title>Adept, a dynamic next generation sequencing data error-detection program with trimming</article-title>
<source>BMC Bioinf</source>
<volume>17</volume>
<issue>1</issue>
<year>2016</year>
<fpage>109</fpage>
</element-citation>
</ref>
<ref id="bb0110"><label>22</label>
<element-citation publication-type="journal" id="rf0110"><person-group person-group-type="author"><name><surname>Ding</surname>
<given-names>J.</given-names>
</name>
<name><surname>Bashashati</surname>
<given-names>A.</given-names>
</name>
<name><surname>Roth</surname>
<given-names>A.</given-names>
</name>
<name><surname>Oloumi</surname>
<given-names>A.</given-names>
</name>
<name><surname>Tse</surname>
<given-names>K.</given-names>
</name>
<name><surname>Zeng</surname>
<given-names>T.</given-names>
</name>
</person-group>
<article-title>Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data</article-title>
<source>Bioinformatics</source>
<volume>28</volume>
<issue>2</issue>
<year>2011</year>
<fpage>167</fpage>
<lpage>175</lpage>
<pub-id pub-id-type="pmid">22084253</pub-id>
</element-citation>
</ref>
<ref id="bb0115"><label>23</label>
<element-citation publication-type="journal" id="rf0115"><person-group person-group-type="author"><name><surname>Poplin</surname>
<given-names>R.</given-names>
</name>
<name><surname>Newburger</surname>
<given-names>D.</given-names>
</name>
<name><surname>Dijamco</surname>
<given-names>J.</given-names>
</name>
<name><surname>Nguyen</surname>
<given-names>N.</given-names>
</name>
<name><surname>Loy</surname>
<given-names>D.</given-names>
</name>
<name><surname>Gross</surname>
<given-names>S.S.</given-names>
</name>
</person-group>
<article-title>Creating a universal snp and small indel variant caller with deep neural networks</article-title>
<source>BioRxiv</source>
<year>2016</year>
<fpage>092890</fpage>
</element-citation>
</ref>
<ref id="bb0120"><label>24</label>
<element-citation publication-type="journal" id="rf0120"><person-group person-group-type="author"><name><surname>Ferreira</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>Mathematics for multimedia signal processing II: discrete finite frames and signal reconstruction</article-title>
<source>Nato ASI Series of Computer and Systems Sciences</source>
<volume>174</volume>
<year>1999</year>
<fpage>35</fpage>
<lpage>54</lpage>
</element-citation>
</ref>
<ref id="bb0125"><label>25</label>
<element-citation publication-type="journal" id="rf0125"><person-group person-group-type="author"><name><surname>Duffin</surname>
<given-names>R.J.</given-names>
</name>
<name><surname>Schaeffer</surname>
<given-names>A.C.</given-names>
</name>
</person-group>
<article-title>A class of nonharmonic fourier series</article-title>
<source>Trans Am Math Soc</source>
<year>1952</year>
<fpage>341</fpage>
<lpage>366</lpage>
</element-citation>
</ref>
<ref id="bb0130"><label>26</label>
<element-citation publication-type="journal" id="rf0130"><person-group person-group-type="author"><name><surname>Daubechies</surname>
<given-names>I.</given-names>
</name>
<name><surname>Grossmann</surname>
<given-names>A.</given-names>
</name>
<name><surname>Meyer</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>Painless nonorthogonal expansions</article-title>
<source>J Math Phys</source>
<volume>27</volume>
<issue>5</issue>
<year>1986</year>
<fpage>1271</fpage>
<lpage>1283</lpage>
</element-citation>
</ref>
<ref id="bb0135"><label>27</label>
<element-citation publication-type="journal" id="rf0135"><person-group person-group-type="author"><name><surname>Daubechies</surname>
<given-names>I.</given-names>
</name>
<name><surname>Han</surname>
<given-names>B.</given-names>
</name>
<name><surname>Ron</surname>
<given-names>A.</given-names>
</name>
<name><surname>Shen</surname>
<given-names>Z.</given-names>
</name>
</person-group>
<article-title>Framelets: MRA-based constructions of wavelet frames</article-title>
<source>Appl Comput Harmon Anal</source>
<volume>14</volume>
<issue>1</issue>
<year>2003</year>
<fpage>1</fpage>
<lpage>46</lpage>
</element-citation>
</ref>
<ref id="bb0140"><label>28</label>
<element-citation publication-type="journal" id="rf0140"><person-group person-group-type="author"><name><surname>Unser</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Texture classification and segmentation using wavelet frames</article-title>
<source>IEEE Trans Image Process</source>
<volume>4</volume>
<issue>11</issue>
<year>1995</year>
<fpage>1549</fpage>
<lpage>1560</lpage>
<pub-id pub-id-type="pmid">18291987</pub-id>
</element-citation>
</ref>
<ref id="bb0145"><label>29</label>
<element-citation publication-type="journal" id="rf0145"><person-group person-group-type="author"><name><surname>Ron</surname>
<given-names>A.</given-names>
</name>
<name><surname>Shen</surname>
<given-names>Z.</given-names>
</name>
</person-group>
<article-title>Frames and stable bases for shift-invariant subspaces of <italic>l</italic>
<sup>2</sup>
(<italic>r</italic>
<sup><italic>d</italic>
</sup>
)</article-title>
<source>Can J Math</source>
<volume>47</volume>
<issue>5</issue>
<year>1995</year>
<fpage>1051</fpage>
<lpage>1094</lpage>
</element-citation>
</ref>
<ref id="bb0150"><label>30</label>
<element-citation publication-type="journal" id="rf0150"><person-group person-group-type="author"><name><surname>Nikolenko</surname>
<given-names>S.I.</given-names>
</name>
<name><surname>Korobeynikov</surname>
<given-names>A.I.</given-names>
</name>
<name><surname>Alekseyev</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<article-title>Bayeshammer: Bayesian clustering for error correction in single-cell sequencing</article-title>
<source>BMC Genomics</source>
<volume>14</volume>
<issue>Suppl 1</issue>
<year>2013</year>
<fpage>S7</fpage>
</element-citation>
</ref>
<ref id="bb0155"><label>31</label>
<element-citation publication-type="journal" id="rf0155"><person-group person-group-type="author"><name><surname>Le</surname>
<given-names>H.-S.</given-names>
</name>
<name><surname>Schulz</surname>
<given-names>M.H.</given-names>
</name>
<name><surname>McCauley</surname>
<given-names>B.M.</given-names>
</name>
<name><surname>Hinman</surname>
<given-names>V.F.</given-names>
</name>
<name><surname>Bar-Joseph</surname>
<given-names>Z.</given-names>
</name>
</person-group>
<article-title>Probabilistic error correction for rna sequencing</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
</element-citation>
</ref>
<ref id="bb0160"><label>32</label>
<element-citation publication-type="book" id="rf0160"><person-group person-group-type="author"><name><surname>Kaiser</surname>
<given-names>G.</given-names>
</name>
</person-group>
<chapter-title>A friendly guide to wavelets</chapter-title>
<year>2010</year>
<publisher-name>Springer Science & Business Media</publisher-name>
</element-citation>
</ref>
<ref id="bb0165"><label>33</label>
<element-citation publication-type="journal" id="rf0165"><person-group person-group-type="author"><name><surname>Hussein</surname>
<given-names>N.</given-names>
</name>
<name><surname>Zekri</surname>
<given-names>A-RN</given-names>
</name>
<name><surname>Abouelhoda</surname>
<given-names>M.</given-names>
</name>
<name><surname>El-din</surname>
<given-names>H.M.A.</given-names>
</name>
<name><surname>Ghamry</surname>
<given-names>A.A.</given-names>
</name>
<name><surname>Amer</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<article-title>New insight into hcv e1/e2 region of genotype 4a</article-title>
<source>Virol J</source>
<volume>11</volume>
<issue>1</issue>
<year>2014</year>
<fpage>2512</fpage>
</element-citation>
</ref>
<ref id="bb0170"><label>34</label>
<element-citation publication-type="journal" id="rf0170"><person-group person-group-type="author"><name><surname>Angly</surname>
<given-names>F.E.</given-names>
</name>
<name><surname>Willner</surname>
<given-names>D.</given-names>
</name>
<name><surname>Rohwer</surname>
<given-names>F.</given-names>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P.</given-names>
</name>
<name><surname>Tyson</surname>
<given-names>G.W.</given-names>
</name>
</person-group>
<article-title>Grinder: a versatile amplicon and shotgun sequence simulator</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/content/early/2012/03/19/nar.gks251.abstract" id="ir0060">http://nar.oxfordjournals.org/content/early/2012/03/19/nar.gks251.abstract</ext-link>
</element-citation>
</ref>
<ref id="bb0175"><label>35</label>
<element-citation publication-type="journal" id="rf0175"><person-group person-group-type="author"><name><surname>Giallonardo</surname>
<given-names>F.D.</given-names>
</name>
<name><surname>Töpfer</surname>
<given-names>A.</given-names>
</name>
<name><surname>Rey</surname>
<given-names>M.</given-names>
</name>
<name><surname>Prabhakaran</surname>
<given-names>S.</given-names>
</name>
<name><surname>Duport</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Leemann</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations</article-title>
<source>Nucleic Acids Res</source>
<volume>42</volume>
<issue>14</issue>
<year>2014</year>
<fpage>e115</fpage>
<comment>arXiv: <ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/content/42/14/e115.abstract" id="ir0070">http://nar.oxfordjournals.org/content/42/14/e115.abstract</ext-link>
</comment>
<pub-id pub-id-type="pmid">24972832</pub-id>
</element-citation>
</ref>
<ref id="bb0180"><label>36</label>
<element-citation publication-type="journal" id="rf0180"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name><surname>Charlebois</surname>
<given-names>P.</given-names>
</name>
<name><surname>Gnerre</surname>
<given-names>S.</given-names>
</name>
<name><surname>Coole</surname>
<given-names>M.G.</given-names>
</name>
<name><surname>Lennon</surname>
<given-names>N.J.</given-names>
</name>
<name><surname>Levin</surname>
<given-names>J.Z.</given-names>
</name>
</person-group>
<article-title>De novo assembly of highly diverse viral populations</article-title>
<source>BMC Genomics</source>
<volume>13</volume>
<issue>1</issue>
<year>2012</year>
<fpage>475</fpage>
<pub-id pub-id-type="pmid">22974120</pub-id>
</element-citation>
</ref>
<ref id="bb0185"><label>37</label>
<element-citation publication-type="journal" id="rf0185"><person-group person-group-type="author"><name><surname>Bankevich</surname>
<given-names>A.</given-names>
</name>
<name><surname>Nurk</surname>
<given-names>S.</given-names>
</name>
<name><surname>Antipov</surname>
<given-names>D.</given-names>
</name>
<name><surname>Gurevich</surname>
<given-names>A.A.</given-names>
</name>
<name><surname>Dvorkin</surname>
<given-names>M.</given-names>
</name>
<name><surname>Kulikov</surname>
<given-names>A.S.</given-names>
</name>
</person-group>
<article-title>Spades: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
<source>J Comput Biol</source>
<volume>19</volume>
<issue>5</issue>
<year>2012</year>
<fpage>455</fpage>
<lpage>477</lpage>
<pub-id pub-id-type="pmid">22506599</pub-id>
</element-citation>
</ref>
<ref id="bb0190"><label>38</label>
<element-citation publication-type="journal" id="rf0190"><person-group person-group-type="author"><name><surname>Iqbal</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Caccamo</surname>
<given-names>M.</given-names>
</name>
<name><surname>Turner</surname>
<given-names>I.</given-names>
</name>
<name><surname>Flicek</surname>
<given-names>P.</given-names>
</name>
<name><surname>McVean</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>De novo assembly and genotyping of variants using colored de Bruijn graphs</article-title>
<source>Nat Genet</source>
<year>2012</year>
</element-citation>
</ref>
</ref-list>
<sec id="s0075" sec-type="supplementary-material"><label>Appendix A</label>
<title>Supplementary data</title>
<p><supplementary-material content-type="local-data" id="ec0005"><caption><p>Additional Methods.</p>
</caption>
<media xlink:href="mmc1.pdf"></media>
</supplementary-material>
</p>
</sec>
<ack id="ac0005"><title>Acknowledgments</title>
<p>This research was funded by the National Science Foundation (<funding-source id="gts0005">NSF</funding-source>
) Award #1421908.</p>
</ack>
<fn-group><fn id="s0080" fn-type="supplementary-material"><label>Appendix A</label>
<p>A supplementary document describing the frame based representation for reads and the rationale for using maximal projections is attached. Supplementary data to this article can be found online at <ext-link ext-link-type="doi" xlink:href="10.1016/j.csbj.2017.07.001" id="ir0030">http://dx.doi.org/10.1016/j.csbj.2017.07.001</ext-link>
.</p>
</fn>
</fn-group>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B96  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B96  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri