Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A random forest classifier for detecting rare variants in NGS data from viral populations.

Identifieur interne : 000B94 ( PubMed/Corpus ); précédent : 000B93; suivant : 000B95

A random forest classifier for detecting rare variants in NGS data from viral populations.

Auteurs : Raunaq Malhotra ; Manjari Jha ; Mary Poss ; Raj Acharya

Source :

RBID : pubmed:28819548

Abstract

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.

DOI: 10.1016/j.csbj.2017.07.001
PubMed: 28819548

Links to Exploration step

pubmed:28819548

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A random forest classifier for detecting rare variants in NGS data from viral populations.</title>
<author>
<name sortKey="Malhotra, Raunaq" sort="Malhotra, Raunaq" uniqKey="Malhotra R" first="Raunaq" last="Malhotra">Raunaq Malhotra</name>
<affiliation>
<nlm:affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Jha, Manjari" sort="Jha, Manjari" uniqKey="Jha M" first="Manjari" last="Jha">Manjari Jha</name>
<affiliation>
<nlm:affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Poss, Mary" sort="Poss, Mary" uniqKey="Poss M" first="Mary" last="Poss">Mary Poss</name>
<affiliation>
<nlm:affiliation>Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Acharya, Raj" sort="Acharya, Raj" uniqKey="Acharya R" first="Raj" last="Acharya">Raj Acharya</name>
<affiliation>
<nlm:affiliation>School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.</nlm:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2017">2017</date>
<idno type="RBID">pubmed:28819548</idno>
<idno type="pmid">28819548</idno>
<idno type="doi">10.1016/j.csbj.2017.07.001</idno>
<idno type="wicri:Area/PubMed/Corpus">000B94</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000B94</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A random forest classifier for detecting rare variants in NGS data from viral populations.</title>
<author>
<name sortKey="Malhotra, Raunaq" sort="Malhotra, Raunaq" uniqKey="Malhotra R" first="Raunaq" last="Malhotra">Raunaq Malhotra</name>
<affiliation>
<nlm:affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Jha, Manjari" sort="Jha, Manjari" uniqKey="Jha M" first="Manjari" last="Jha">Manjari Jha</name>
<affiliation>
<nlm:affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Poss, Mary" sort="Poss, Mary" uniqKey="Poss M" first="Mary" last="Poss">Mary Poss</name>
<affiliation>
<nlm:affiliation>Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Acharya, Raj" sort="Acharya, Raj" uniqKey="Acharya R" first="Raj" last="Acharya">Raj Acharya</name>
<affiliation>
<nlm:affiliation>School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.</nlm:affiliation>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Computational and structural biotechnology journal</title>
<idno type="ISSN">2001-0370</idno>
<imprint>
<date when="2017" type="published">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of
<i>k</i>
-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies
<i>k</i>
-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of
<i>k</i>
-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that
<i>k</i>
-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives
<i>k</i>
-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their
<i>de-novo</i>
assembly. It has high recall of the true
<i>k</i>
-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
<PMID Version="1">28819548</PMID>
<DateRevised>
<Year>2019</Year>
<Month>11</Month>
<Day>20</Day>
</DateRevised>
<Article PubModel="Electronic-eCollection">
<Journal>
<ISSN IssnType="Print">2001-0370</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>15</Volume>
<PubDate>
<Year>2017</Year>
</PubDate>
</JournalIssue>
<Title>Computational and structural biotechnology journal</Title>
<ISOAbbreviation>Comput Struct Biotechnol J</ISOAbbreviation>
</Journal>
<ArticleTitle>A random forest classifier for detecting rare variants in NGS data from viral populations.</ArticleTitle>
<Pagination>
<MedlinePgn>388-395</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1016/j.csbj.2017.07.001</ELocationID>
<Abstract>
<AbstractText>We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of
<i>k</i>
-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies
<i>k</i>
-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of
<i>k</i>
-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that
<i>k</i>
-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives
<i>k</i>
-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their
<i>de-novo</i>
assembly. It has high recall of the true
<i>k</i>
-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Malhotra</LastName>
<ForeName>Raunaq</ForeName>
<Initials>R</Initials>
<AffiliationInfo>
<Affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Jha</LastName>
<ForeName>Manjari</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Poss</LastName>
<ForeName>Mary</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Acharya</LastName>
<ForeName>Raj</ForeName>
<Initials>R</Initials>
<AffiliationInfo>
<Affiliation>School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2017</Year>
<Month>07</Month>
<Day>19</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>Netherlands</Country>
<MedlineTA>Comput Struct Biotechnol J</MedlineTA>
<NlmUniqueID>101585369</NlmUniqueID>
<ISSNLinking>2001-0370</ISSNLinking>
</MedlineJournalInfo>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">Multi-resolution frames</Keyword>
<Keyword MajorTopicYN="N">Next-generation sequencing</Keyword>
<Keyword MajorTopicYN="N">Random forest classifier</Keyword>
<Keyword MajorTopicYN="N">Reference free methods</Keyword>
<Keyword MajorTopicYN="N">Sequencing error detection</Keyword>
<Keyword MajorTopicYN="N">Viral populations</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2017</Year>
<Month>03</Month>
<Day>14</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="revised">
<Year>2017</Year>
<Month>07</Month>
<Day>01</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2017</Year>
<Month>07</Month>
<Day>03</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2017</Year>
<Month>8</Month>
<Day>19</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2017</Year>
<Month>8</Month>
<Day>19</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2017</Year>
<Month>8</Month>
<Day>19</Day>
<Hour>6</Hour>
<Minute>1</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">28819548</ArticleId>
<ArticleId IdType="doi">10.1016/j.csbj.2017.07.001</ArticleId>
<ArticleId IdType="pii">S2001-0370(17)30039-9</ArticleId>
<ArticleId IdType="pmc">PMC5548337</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Nat Rev Genet. 2007 May;8(5):341-52</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17440531</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>IEEE Trans Image Process. 1995;4(11):1549-60</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18291987</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2010;11(11):R116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21114842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2011 Apr 26;12:119</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21521499</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Jul 1;27(13):i137-41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21685062</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2012 Jan 15;28(2):167-75</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22084253</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2011 Nov 21;12:451</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22099972</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Genet. 2012 Jan 08;44(2):226-32</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22231483</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Jul;40(12):e94</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22434876</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2012 May;19(5):455-77</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22506599</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2012 Jun 25;13 Suppl 10:S6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22759430</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Front Microbiol. 2012 Sep 11;3:329</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22973268</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2012 Sep 13;13:475</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22974120</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Dec;40(22):11189-201</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23066108</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2013 Feb 1;29(3):308-15</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23202746</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2013 Mar 1;29(5):652-3</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23325618</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2013;14 Suppl 1:S7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23368723</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2013 Feb;20(2):113-23</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23383997</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jan 1;30(1):31-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23732276</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2013 Oct 03;14:674</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24088188</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Microb Inform Exp. 2014 Jan 15;4(1):1</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24428920</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 May 15;30(10):1354-62</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24451628</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Comput Biol. 2014 Mar 27;10(3):e1003515</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24675810</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jun 15;30(12):i329-37</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24932001</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2014 Aug;42(14):e115</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24972832</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Virol J. 2014 Dec 30;11:231</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25547228</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2015 Mar 31;43(6):e37</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25586220</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2015 May 15;31(10):1569-76</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25609798</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2015 Sep 1;31(17):2885-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25953801</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2016 Feb 29;17:109</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">26928302</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2018 Nov;36(10):983-987</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">30247488</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B94 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 000B94 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:28819548
   |texte=   A random forest classifier for detecting rare variants in NGS data from viral populations.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:28819548" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021