Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.

Identifieur interne : 001048 ( PubMed/Corpus ); précédent : 001047; suivant : 001049

Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.

Auteurs : Troy Hernandez ; Jie Yang

Source :

RBID : pubmed:27409298

English descriptors

Abstract

The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request.

DOI: 10.1089/cmb.2013.0132
PubMed: 27409298

Links to Exploration step

pubmed:27409298

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.</title>
<author>
<name sortKey="Hernandez, Troy" sort="Hernandez, Troy" uniqKey="Hernandez T" first="Troy" last="Hernandez">Troy Hernandez</name>
<affiliation>
<nlm:affiliation>1 Mathematical Sciences Center, Tsinghua University , Beijing, China .</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Yang, Jie" sort="Yang, Jie" uniqKey="Yang J" first="Jie" last="Yang">Jie Yang</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2016">2016</date>
<idno type="RBID">pubmed:27409298</idno>
<idno type="pmid">27409298</idno>
<idno type="doi">10.1089/cmb.2013.0132</idno>
<idno type="wicri:Area/PubMed/Corpus">001048</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001048</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.</title>
<author>
<name sortKey="Hernandez, Troy" sort="Hernandez, Troy" uniqKey="Hernandez T" first="Troy" last="Hernandez">Troy Hernandez</name>
<affiliation>
<nlm:affiliation>1 Mathematical Sciences Center, Tsinghua University , Beijing, China .</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Yang, Jie" sort="Yang, Jie" uniqKey="Yang J" first="Jie" last="Yang">Jie Yang</name>
</author>
</analytic>
<series>
<title level="j">Journal of computational biology : a journal of computational molecular cell biology</title>
<idno type="eISSN">1557-8666</idno>
<imprint>
<date when="2016" type="published">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Genome, Viral</term>
<term>Phylogeny</term>
<term>Sequence Alignment (methods)</term>
<term>Software</term>
<term>Viruses (classification)</term>
<term>Viruses (genetics)</term>
</keywords>
<keywords scheme="MESH" qualifier="classification" xml:lang="en">
<term>Viruses</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Viruses</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Alignment</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Genome, Viral</term>
<term>Phylogeny</term>
<term>Software</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request. </div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">27409298</PMID>
<DateCompleted>
<Year>2017</Year>
<Month>10</Month>
<Day>31</Day>
</DateCompleted>
<DateRevised>
<Year>2017</Year>
<Month>10</Month>
<Day>31</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1557-8666</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>23</Volume>
<Issue>10</Issue>
<PubDate>
<Year>2016</Year>
<Month>Oct</Month>
</PubDate>
</JournalIssue>
<Title>Journal of computational biology : a journal of computational molecular cell biology</Title>
<ISOAbbreviation>J. Comput. Biol.</ISOAbbreviation>
</Journal>
<ArticleTitle>Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.</ArticleTitle>
<Pagination>
<MedlinePgn>810-20</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1089/cmb.2013.0132</ELocationID>
<Abstract>
<AbstractText>The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request. </AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Hernandez</LastName>
<ForeName>Troy</ForeName>
<Initials>T</Initials>
<AffiliationInfo>
<Affiliation>1 Mathematical Sciences Center, Tsinghua University , Beijing, China .</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Yang</LastName>
<ForeName>Jie</ForeName>
<Initials>J</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2016</Year>
<Month>07</Month>
<Day>13</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>J Comput Biol</MedlineTA>
<NlmUniqueID>9433358</NlmUniqueID>
<ISSNLinking>1066-5277</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016679" MajorTopicYN="Y">Genome, Viral</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010802" MajorTopicYN="Y">Phylogeny</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="Y">Software</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014780" MajorTopicYN="N">Viruses</DescriptorName>
<QualifierName UI="Q000145" MajorTopicYN="Y">classification</QualifierName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">alignment-free</Keyword>
<Keyword MajorTopicYN="N">classification</Keyword>
<Keyword MajorTopicYN="N">machine learning</Keyword>
<Keyword MajorTopicYN="N">phylogenetics</Keyword>
<Keyword MajorTopicYN="N">virology</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2016</Year>
<Month>7</Month>
<Day>14</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2016</Year>
<Month>7</Month>
<Day>14</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2017</Year>
<Month>11</Month>
<Day>1</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">27409298</ArticleId>
<ArticleId IdType="doi">10.1089/cmb.2013.0132</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001048 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 001048 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:27409298
   |texte=   Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:27409298" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021