Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Nephele: genotyping via complete composition vectors and MapReduce.

Identifieur interne : 001E51 ( PubMed/Curation ); précédent : 001E50; suivant : 001E52

Nephele: genotyping via complete composition vectors and MapReduce.

Auteurs : Marc E. Colosimo [États-Unis] ; Matthew W. Peterson ; Scott Mardis ; Lynette Hirschman

Source :

RBID : pubmed:21851626

Abstract

Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.

DOI: 10.1186/1751-0473-6-13
PubMed: 21851626

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:21851626

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Nephele: genotyping via complete composition vectors and MapReduce.</title>
<author>
<name sortKey="Colosimo, Marc E" sort="Colosimo, Marc E" uniqKey="Colosimo M" first="Marc E" last="Colosimo">Marc E. Colosimo</name>
<affiliation wicri:level="1">
<nlm:affiliation>The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA. mcolosimo@mitre.org.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Peterson, Matthew W" sort="Peterson, Matthew W" uniqKey="Peterson M" first="Matthew W" last="Peterson">Matthew W. Peterson</name>
</author>
<author>
<name sortKey="Mardis, Scott" sort="Mardis, Scott" uniqKey="Mardis S" first="Scott" last="Mardis">Scott Mardis</name>
</author>
<author>
<name sortKey="Hirschman, Lynette" sort="Hirschman, Lynette" uniqKey="Hirschman L" first="Lynette" last="Hirschman">Lynette Hirschman</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2011">2011</date>
<idno type="RBID">pubmed:21851626</idno>
<idno type="pmid">21851626</idno>
<idno type="doi">10.1186/1751-0473-6-13</idno>
<idno type="wicri:Area/PubMed/Corpus">001E51</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001E51</idno>
<idno type="wicri:Area/PubMed/Curation">001E51</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001E51</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Nephele: genotyping via complete composition vectors and MapReduce.</title>
<author>
<name sortKey="Colosimo, Marc E" sort="Colosimo, Marc E" uniqKey="Colosimo M" first="Marc E" last="Colosimo">Marc E. Colosimo</name>
<affiliation wicri:level="1">
<nlm:affiliation>The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA. mcolosimo@mitre.org.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Peterson, Matthew W" sort="Peterson, Matthew W" uniqKey="Peterson M" first="Matthew W" last="Peterson">Matthew W. Peterson</name>
</author>
<author>
<name sortKey="Mardis, Scott" sort="Mardis, Scott" uniqKey="Mardis S" first="Scott" last="Mardis">Scott Mardis</name>
</author>
<author>
<name sortKey="Hirschman, Lynette" sort="Hirschman, Lynette" uniqKey="Hirschman L" first="Lynette" last="Hirschman">Lynette Hirschman</name>
</author>
</analytic>
<series>
<title level="j">Source code for biology and medicine</title>
<idno type="eISSN">1751-0473</idno>
<imprint>
<date when="2011" type="published">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
<PMID Version="1">21851626</PMID>
<DateCompleted>
<Year>2011</Year>
<Month>11</Month>
<Day>10</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>11</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1751-0473</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>6</Volume>
<PubDate>
<Year>2011</Year>
<Month>Aug</Month>
<Day>18</Day>
</PubDate>
</JournalIssue>
<Title>Source code for biology and medicine</Title>
<ISOAbbreviation>Source Code Biol Med</ISOAbbreviation>
</Journal>
<ArticleTitle>Nephele: genotyping via complete composition vectors and MapReduce.</ArticleTitle>
<Pagination>
<MedlinePgn>13</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1186/1751-0473-6-13</ELocationID>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Colosimo</LastName>
<ForeName>Marc E</ForeName>
<Initials>ME</Initials>
<AffiliationInfo>
<Affiliation>The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA. mcolosimo@mitre.org.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Peterson</LastName>
<ForeName>Matthew W</ForeName>
<Initials>MW</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Mardis</LastName>
<ForeName>Scott</ForeName>
<Initials>S</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Hirschman</LastName>
<ForeName>Lynette</ForeName>
<Initials>L</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2011</Year>
<Month>08</Month>
<Day>18</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Source Code Biol Med</MedlineTA>
<NlmUniqueID>101276533</NlmUniqueID>
<ISSNLinking>1751-0473</ISSNLinking>
</MedlineJournalInfo>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2011</Year>
<Month>04</Month>
<Day>05</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2011</Year>
<Month>08</Month>
<Day>18</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2011</Year>
<Month>8</Month>
<Day>20</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2011</Year>
<Month>8</Month>
<Day>20</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2011</Year>
<Month>8</Month>
<Day>20</Day>
<Hour>6</Hour>
<Minute>1</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">21851626</ArticleId>
<ArticleId IdType="pii">1751-0473-6-13</ArticleId>
<ArticleId IdType="doi">10.1186/1751-0473-6-13</ArticleId>
<ArticleId IdType="pmc">PMC3182884</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Nucleic Acids Res. 2006 Mar 23;34(6):1692-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16556910</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Gen Virol. 2000 Jan;81(Pt 1):67-74</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10640543</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Curr Protoc Bioinformatics. 2003 Feb;Chapter 6:Unit 6.4</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18428704</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 2000 Sep 8;302(1):205-17</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10964570</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Vaccine. 2001 Aug 14;19(31):4385-95</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11483263</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Mol Biol Evol. 1987 Jul;4(4):406-25</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3447015</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W394-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16845035</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Science. 2002 Jun 14;296(5575):1976-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12004075</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Sci China C Life Sci. 2007 Oct;50(5):587-99</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17879055</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Biomol Struct Dyn. 1986 Aug;4(1):11-21</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3078230</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2001 Jul;17(7):662-3</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11448888</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2004 Jul 8;430(6996):209-13</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15241415</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2004 Aug 19;5:113</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15318951</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2005 Feb;15(2):330-40</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15687296</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2009 Jan;37(Database issue):D499-508</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18835847</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2007 Jul 15;23(14):1744-52</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17495995</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Infect Dis. 1997 Jun;175(6):1285-93</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9180165</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Science. 2007 Feb 16;315(5814):972-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17218491</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2010 Sep;20(9):1297-303</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20644199</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2002 Nov 26;99(24):15687-92</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12429852</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Biol Phys. 2002 Sep;28(3):439-47</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23345787</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2004 Apr 29;5:48</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15117420</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Biol. 2005 Sep;3(9):e300</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16026181</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Source Code Biol Med. 2007 Oct 31;2:7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17974028</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2010 Jan 18;11 Suppl 1:S15</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20122186</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2009 Jun 1;25(11):1363-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19357099</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Syst Biol. 2007 Apr;56(2):321-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17464886</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Methods Mol Biol. 2000;132:243-58</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10547839</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Gen Virol. 2008 Jan;89(Pt 1):48-59</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18089728</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2002 Jan;18(1):109-14</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11836218</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Cladistics. 2001 Mar;17(1 Pt 2):S60-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12240678</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2006 Nov 16;7:293</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17109759</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2008 May 29;453(7195):615-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18418375</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Rev Sci Tech. 2006 Apr;25(1):329-39</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16796058</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Appl Environ Microbiol. 2005 May;71(5):2209-13</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15870301</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2005 May 26;435(7041):423-4</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15917781</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Curr Opin Struct Biol. 2006 Jun;16(3):368-73</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16679011</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 1994 Nov 11;22(22):4673-80</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">7984417</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2008 Jun 13;9:279</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18554404</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Biosens Bioelectron. 2007 Apr 15;22(9-10):1853-60</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16891109</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2007 Jul;35(Web Server issue):W275-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17537820</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2007 Sep 15;23(18):2368-75</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17623701</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Mol Biol Evol. 2004 Jan;21(1):200-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">14595102</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001E51 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 001E51 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:21851626
   |texte=   Nephele: genotyping via complete composition vectors and MapReduce.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:21851626" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021