MersV1, Pmc, Corpus, bibRecord, 000935

Nephele: genotyping via complete composition vectors and MapReduce

Identifieur interne : 000935 ( Pmc/Corpus ); précédent : 000934; suivant : 000936

Nephele: genotyping via complete composition vectors and MapReduce

Auteurs : Marc E. Colosimo ; Matthew W. Peterson ; Scott Mardis ; Lynette Hirschman

Source :

Source Code for Biology and Medicine [ 1751-0473 ] ; 2011.

RBID : PMC:3182884

Abstract

Background

Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.

Results

Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.

Conclusions

We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3182884

DOI: 10.1186/1751-0473-6-13
PubMed: 21851626
PubMed Central: 3182884

Links to Exploration step

PMC:3182884

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Nephele: genotyping via complete composition vectors and MapReduce</title>
<author><name sortKey="Colosimo, Marc E" sort="Colosimo, Marc E" uniqKey="Colosimo M" first="Marc E" last="Colosimo">Marc E. Colosimo</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Peterson, Matthew W" sort="Peterson, Matthew W" uniqKey="Peterson M" first="Matthew W" last="Peterson">Matthew W. Peterson</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mardis, Scott" sort="Mardis, Scott" uniqKey="Mardis S" first="Scott" last="Mardis">Scott Mardis</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hirschman, Lynette" sort="Hirschman, Lynette" uniqKey="Hirschman L" first="Lynette" last="Hirschman">Lynette Hirschman</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">21851626</idno>
<idno type="pmc">3182884</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3182884</idno>
<idno type="RBID">PMC:3182884</idno>
<idno type="doi">10.1186/1751-0473-6-13</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000935</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000935</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Nephele: genotyping via complete composition vectors and MapReduce</title>
<author><name sortKey="Colosimo, Marc E" sort="Colosimo, Marc E" uniqKey="Colosimo M" first="Marc E" last="Colosimo">Marc E. Colosimo</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Peterson, Matthew W" sort="Peterson, Matthew W" uniqKey="Peterson M" first="Matthew W" last="Peterson">Matthew W. Peterson</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mardis, Scott" sort="Mardis, Scott" uniqKey="Mardis S" first="Scott" last="Mardis">Scott Mardis</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hirschman, Lynette" sort="Hirschman, Lynette" uniqKey="Hirschman L" first="Lynette" last="Hirschman">Lynette Hirschman</name>
<affiliation><nlm:aff id="I1">The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Source Code for Biology and Medicine</title>
<idno type="eISSN">1751-0473</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p>
</sec>
<sec><title>Results</title>
<p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p>
</sec>
<sec><title>Conclusions</title>
<p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Li, Ks" uniqKey="Li K">KS Li</name>
</author>
<author><name sortKey="Guan, Y" uniqKey="Guan Y">Y Guan</name>
</author>
<author><name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author><name sortKey="Smith, Gj" uniqKey="Smith G">GJ Smith</name>
</author>
<author><name sortKey="Xu, Km" uniqKey="Xu K">KM Xu</name>
</author>
<author><name sortKey="Duan, L" uniqKey="Duan L">L Duan</name>
</author>
<author><name sortKey="Rahardjo, Ap" uniqKey="Rahardjo A">AP Rahardjo</name>
</author>
<author><name sortKey="Puthavathana, P" uniqKey="Puthavathana P">P Puthavathana</name>
</author>
<author><name sortKey="Buranathai, C" uniqKey="Buranathai C">C Buranathai</name>
</author>
<author><name sortKey="Nguyen, Td" uniqKey="Nguyen T">TD Nguyen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Campitelli, L" uniqKey="Campitelli L">L Campitelli</name>
</author>
<author><name sortKey="Di Martino, A" uniqKey="Di Martino A">A Di Martino</name>
</author>
<author><name sortKey="Spagnolo, D" uniqKey="Spagnolo D">D Spagnolo</name>
</author>
<author><name sortKey="Smith, Gj" uniqKey="Smith G">GJ Smith</name>
</author>
<author><name sortKey="Di Trani, L" uniqKey="Di Trani L">L Di Trani</name>
</author>
<author><name sortKey="Facchini, M" uniqKey="Facchini M">M Facchini</name>
</author>
<author><name sortKey="De Marco, Ma" uniqKey="De Marco M">MA De Marco</name>
</author>
<author><name sortKey="Foni, E" uniqKey="Foni E">E Foni</name>
</author>
<author><name sortKey="Chiapponi, C" uniqKey="Chiapponi C">C Chiapponi</name>
</author>
<author><name sortKey="Martin, Am" uniqKey="Martin A">AM Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rambaut, A" uniqKey="Rambaut A">A Rambaut</name>
</author>
<author><name sortKey="Pybus, Og" uniqKey="Pybus O">OG Pybus</name>
</author>
<author><name sortKey="Nelson, Mi" uniqKey="Nelson M">MI Nelson</name>
</author>
<author><name sortKey="Viboud, C" uniqKey="Viboud C">C Viboud</name>
</author>
<author><name sortKey="Taubenberger, Jk" uniqKey="Taubenberger J">JK Taubenberger</name>
</author>
<author><name sortKey="Holmes, Ec" uniqKey="Holmes E">EC Holmes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="De Groot, As" uniqKey="De Groot A">AS De Groot</name>
</author>
<author><name sortKey="Bosma, A" uniqKey="Bosma A">A Bosma</name>
</author>
<author><name sortKey="Chinai, N" uniqKey="Chinai N">N Chinai</name>
</author>
<author><name sortKey="Frost, J" uniqKey="Frost J">J Frost</name>
</author>
<author><name sortKey="Jesdale, Bm" uniqKey="Jesdale B">BM Jesdale</name>
</author>
<author><name sortKey="Gonzalez, Ma" uniqKey="Gonzalez M">MA Gonzalez</name>
</author>
<author><name sortKey="Martin, W" uniqKey="Martin W">W Martin</name>
</author>
<author><name sortKey="Saint Aubin, C" uniqKey="Saint Aubin C">C Saint-Aubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, Hl" uniqKey="Yang H">HL Yang</name>
</author>
<author><name sortKey="Zhu, Yz" uniqKey="Zhu Y">YZ Zhu</name>
</author>
<author><name sortKey="Qin, Jh" uniqKey="Qin J">JH Qin</name>
</author>
<author><name sortKey="He, P" uniqKey="He P">P He</name>
</author>
<author><name sortKey="Jiang, Xc" uniqKey="Jiang X">XC Jiang</name>
</author>
<author><name sortKey="Zhao, Gp" uniqKey="Zhao G">GP Zhao</name>
</author>
<author><name sortKey="Guo, Xk" uniqKey="Guo X">XK Guo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Macken, C" uniqKey="Macken C">C Macken</name>
</author>
<author><name sortKey="Lu, H" uniqKey="Lu H">H Lu</name>
</author>
<author><name sortKey="Goodman, J" uniqKey="Goodman J">J Goodman</name>
</author>
<author><name sortKey="Boykin, L" uniqKey="Boykin L">L Boykin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cummings, Ca" uniqKey="Cummings C">CA Cummings</name>
</author>
<author><name sortKey="Relman, Da" uniqKey="Relman D">DA Relman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Budowle, B" uniqKey="Budowle B">B Budowle</name>
</author>
<author><name sortKey="Schutzer, Se" uniqKey="Schutzer S">SE Schutzer</name>
</author>
<author><name sortKey="Ascher, Ms" uniqKey="Ascher M">MS Ascher</name>
</author>
<author><name sortKey="Atlas, Rm" uniqKey="Atlas R">RM Atlas</name>
</author>
<author><name sortKey="Burans, Jp" uniqKey="Burans J">JP Burans</name>
</author>
<author><name sortKey="Chakraborty, R" uniqKey="Chakraborty R">R Chakraborty</name>
</author>
<author><name sortKey="Dunn, Jj" uniqKey="Dunn J">JJ Dunn</name>
</author>
<author><name sortKey="Fraser, Cm" uniqKey="Fraser C">CM Fraser</name>
</author>
<author><name sortKey="Franz, Dr" uniqKey="Franz D">DR Franz</name>
</author>
<author><name sortKey="Leighton, Tj" uniqKey="Leighton T">TJ Leighton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mcewen, Sa" uniqKey="Mcewen S">SA McEwen</name>
</author>
<author><name sortKey="Wilson, Tm" uniqKey="Wilson T">TM Wilson</name>
</author>
<author><name sortKey="Ashford, Da" uniqKey="Ashford D">DA Ashford</name>
</author>
<author><name sortKey="Heegaard, Ed" uniqKey="Heegaard E">ED Heegaard</name>
</author>
<author><name sortKey="Kournikakis, B" uniqKey="Kournikakis B">B Kournikakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, D" uniqKey="Wang D">D Wang</name>
</author>
<author><name sortKey="Coscoy, L" uniqKey="Coscoy L">L Coscoy</name>
</author>
<author><name sortKey="Zylberberg, M" uniqKey="Zylberberg M">M Zylberberg</name>
</author>
<author><name sortKey="Avila, Pc" uniqKey="Avila P">PC Avila</name>
</author>
<author><name sortKey="Boushey, Ha" uniqKey="Boushey H">HA Boushey</name>
</author>
<author><name sortKey="Ganem, D" uniqKey="Ganem D">D Ganem</name>
</author>
<author><name sortKey="Derisi, Jl" uniqKey="Derisi J">JL DeRisi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ghindilis, Al" uniqKey="Ghindilis A">AL Ghindilis</name>
</author>
<author><name sortKey="Smith, Mw" uniqKey="Smith M">MW Smith</name>
</author>
<author><name sortKey="Schwarzkopf, Kr" uniqKey="Schwarzkopf K">KR Schwarzkopf</name>
</author>
<author><name sortKey="Roth, Km" uniqKey="Roth K">KM Roth</name>
</author>
<author><name sortKey="Peyvan, K" uniqKey="Peyvan K">K Peyvan</name>
</author>
<author><name sortKey="Munro, Sb" uniqKey="Munro S">SB Munro</name>
</author>
<author><name sortKey="Lodes, Mj" uniqKey="Lodes M">MJ Lodes</name>
</author>
<author><name sortKey="Stover, Ag" uniqKey="Stover A">AG Stover</name>
</author>
<author><name sortKey="Bernards, K" uniqKey="Bernards K">K Bernards</name>
</author>
<author><name sortKey="Dill, K" uniqKey="Dill K">K Dill</name>
</author>
<author><name sortKey="Mcshea, A" uniqKey="Mcshea A">A McShea</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lindh, M" uniqKey="Lindh M">M Lindh</name>
</author>
<author><name sortKey="Andersson, As" uniqKey="Andersson A">AS Andersson</name>
</author>
<author><name sortKey="Gusdal, A" uniqKey="Gusdal A">A Gusdal</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lin, G" uniqKey="Lin G">G Lin</name>
</author>
<author><name sortKey="Cai, Z" uniqKey="Cai Z">Z Cai</name>
</author>
<author><name sortKey="Wu, J" uniqKey="Wu J">J Wu</name>
</author>
<author><name sortKey="Wan, Xf" uniqKey="Wan X">XF Wan</name>
</author>
<author><name sortKey="Xu, L" uniqKey="Xu L">L Xu</name>
</author>
<author><name sortKey="Goebel, R" uniqKey="Goebel R">R Goebel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lu, G" uniqKey="Lu G">G Lu</name>
</author>
<author><name sortKey="Rowley, T" uniqKey="Rowley T">T Rowley</name>
</author>
<author><name sortKey="Garten, R" uniqKey="Garten R">R Garten</name>
</author>
<author><name sortKey="Donis, Ro" uniqKey="Donis R">RO Donis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wan, Xf" uniqKey="Wan X">XF Wan</name>
</author>
<author><name sortKey="Chen, G" uniqKey="Chen G">G Chen</name>
</author>
<author><name sortKey="Luo, F" uniqKey="Luo F">F Luo</name>
</author>
<author><name sortKey="Emch, M" uniqKey="Emch M">M Emch</name>
</author>
<author><name sortKey="Donis, R" uniqKey="Donis R">R Donis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stuyver, L" uniqKey="Stuyver L">L Stuyver</name>
</author>
<author><name sortKey="De Gendt, S" uniqKey="De Gendt S">S De Gendt</name>
</author>
<author><name sortKey="Van Geyt, C" uniqKey="Van Geyt C">C Van Geyt</name>
</author>
<author><name sortKey="Zoulim, F" uniqKey="Zoulim F">F Zoulim</name>
</author>
<author><name sortKey="Fried, M" uniqKey="Fried M">M Fried</name>
</author>
<author><name sortKey="Schinazi, Rf" uniqKey="Schinazi R">RF Schinazi</name>
</author>
<author><name sortKey="Rossau, R" uniqKey="Rossau R">R Rossau</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Colosimo, M" uniqKey="Colosimo M">M Colosimo</name>
</author>
<author><name sortKey="Hirschman, L" uniqKey="Hirschman L">L Hirschman</name>
</author>
<author><name sortKey="Keybl, M" uniqKey="Keybl M">M Keybl</name>
</author>
<author><name sortKey="Luciano, J" uniqKey="Luciano J">J Luciano</name>
</author>
<author><name sortKey="Mardis, S" uniqKey="Mardis S">S Mardis</name>
</author>
<author><name sortKey="Peterson, M" uniqKey="Peterson M">M Peterson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Thompson, Jd" uniqKey="Thompson J">JD Thompson</name>
</author>
<author><name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
<author><name sortKey="Gibson, Tj" uniqKey="Gibson T">TJ Gibson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Notredame, C" uniqKey="Notredame C">C Notredame</name>
</author>
<author><name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
<author><name sortKey="Heringa, J" uniqKey="Heringa J">J Heringa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Do, Cb" uniqKey="Do C">CB Do</name>
</author>
<author><name sortKey="Mahabhashyam, Ms" uniqKey="Mahabhashyam M">MS Mahabhashyam</name>
</author>
<author><name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
<author><name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
<author><name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author><name sortKey="Keller, K" uniqKey="Keller K">K Keller</name>
</author>
<author><name sortKey="Brodie, El" uniqKey="Brodie E">EL Brodie</name>
</author>
<author><name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
<author><name sortKey="Piceno, Ym" uniqKey="Piceno Y">YM Piceno</name>
</author>
<author><name sortKey="Phan, R" uniqKey="Phan R">R Phan</name>
</author>
<author><name sortKey="Andersen, Gl" uniqKey="Andersen G">GL Andersen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wallace, Im" uniqKey="Wallace I">IM Wallace</name>
</author>
<author><name sortKey="O Sullivan, O" uniqKey="O Sullivan O">O O'Sullivan</name>
</author>
<author><name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
<author><name sortKey="Notredame, C" uniqKey="Notredame C">C Notredame</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chu, Kh" uniqKey="Chu K">KH Chu</name>
</author>
<author><name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author><name sortKey="Yu, Zg" uniqKey="Yu Z">ZG Yu</name>
</author>
<author><name sortKey="Anh, V" uniqKey="Anh V">V Anh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gao, L" uniqKey="Gao L">L Gao</name>
</author>
<author><name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author><name sortKey="Sun, J" uniqKey="Sun J">J Sun</name>
</author>
<author><name sortKey="Hao, B" uniqKey="Hao B">B Hao</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, X" uniqKey="Wu X">X Wu</name>
</author>
<author><name sortKey="Wan, X F" uniqKey="Wan X">X-F Wan</name>
</author>
<author><name sortKey="Wu, G" uniqKey="Wu G">G Wu</name>
</author>
<author><name sortKey="Xu, D" uniqKey="Xu D">D Xu</name>
</author>
<author><name sortKey="Lin, G" uniqKey="Lin G">G Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Retief, Jd" uniqKey="Retief J">JD Retief</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wilgenbusch, Jc" uniqKey="Wilgenbusch J">JC Wilgenbusch</name>
</author>
<author><name sortKey="Swofford, D" uniqKey="Swofford D">D Swofford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Giribet, G" uniqKey="Giribet G">G Giribet</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Saitou, N" uniqKey="Saitou N">N Saitou</name>
</author>
<author><name sortKey="Nei, M" uniqKey="Nei M">M Nei</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rost, U" uniqKey="Rost U">U Rost</name>
</author>
<author><name sortKey="Bornberg Bauer, E" uniqKey="Bornberg Bauer E">E Bornberg-Bauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hughes, T" uniqKey="Hughes T">T Hughes</name>
</author>
<author><name sortKey="Hyun, Y" uniqKey="Hyun Y">Y Hyun</name>
</author>
<author><name sortKey="Liberles, Da" uniqKey="Liberles D">DA Liberles</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Janies, D" uniqKey="Janies D">D Janies</name>
</author>
<author><name sortKey="Hill, Aw" uniqKey="Hill A">AW Hill</name>
</author>
<author><name sortKey="Guralnick, R" uniqKey="Guralnick R">R Guralnick</name>
</author>
<author><name sortKey="Habib, F" uniqKey="Habib F">F Habib</name>
</author>
<author><name sortKey="Waltari, E" uniqKey="Waltari E">E Waltari</name>
</author>
<author><name sortKey="Wheeler, Wc" uniqKey="Wheeler W">WC Wheeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Frey, Bj" uniqKey="Frey B">BJ Frey</name>
</author>
<author><name sortKey="Dueck, D" uniqKey="Dueck D">D Dueck</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author><name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mckenna, A" uniqKey="Mckenna A">A McKenna</name>
</author>
<author><name sortKey="Hanna, M" uniqKey="Hanna M">M Hanna</name>
</author>
<author><name sortKey="Banks, E" uniqKey="Banks E">E Banks</name>
</author>
<author><name sortKey="Sivachenko, A" uniqKey="Sivachenko A">A Sivachenko</name>
</author>
<author><name sortKey="Cibulskis, K" uniqKey="Cibulskis K">K Cibulskis</name>
</author>
<author><name sortKey="Kernytsky, A" uniqKey="Kernytsky A">A Kernytsky</name>
</author>
<author><name sortKey="Garimella, K" uniqKey="Garimella K">K Garimella</name>
</author>
<author><name sortKey="Altshuler, D" uniqKey="Altshuler D">D Altshuler</name>
</author>
<author><name sortKey="Gabriel, S" uniqKey="Gabriel S">S Gabriel</name>
</author>
<author><name sortKey="Daly, M" uniqKey="Daly M">M Daly</name>
</author>
<author><name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Matthews, Sj" uniqKey="Matthews S">SJ Matthews</name>
</author>
<author><name sortKey="Williams, Tl" uniqKey="Williams T">TL Williams</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ranger, C" uniqKey="Ranger C">C Ranger</name>
</author>
<author><name sortKey="Raghuraman, R" uniqKey="Raghuraman R">R Raghuraman</name>
</author>
<author><name sortKey="Penmetsa, A" uniqKey="Penmetsa A">A Penmetsa</name>
</author>
<author><name sortKey="Bradski, G" uniqKey="Bradski G">G Bradski</name>
</author>
<author><name sortKey="Kozyrakis, C" uniqKey="Kozyrakis C">C Kozyrakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gabriel, E" uniqKey="Gabriel E">E Gabriel</name>
</author>
<author><name sortKey="Fagg, Ge" uniqKey="Fagg G">GE Fagg</name>
</author>
<author><name sortKey="Bosilca, G" uniqKey="Bosilca G">G Bosilca</name>
</author>
<author><name sortKey="Angskun, T" uniqKey="Angskun T">T Angskun</name>
</author>
<author><name sortKey="Dongarra, Jj" uniqKey="Dongarra J">JJ Dongarra</name>
</author>
<author><name sortKey="Squyres, Jm" uniqKey="Squyres J">JM Squyres</name>
</author>
<author><name sortKey="Sahay, V" uniqKey="Sahay V">V Sahay</name>
</author>
<author><name sortKey="Kambadur, P" uniqKey="Kambadur P">P Kambadur</name>
</author>
<author><name sortKey="Barrett, B" uniqKey="Barrett B">B Barrett</name>
</author>
<author><name sortKey="Lumsdaine, A" uniqKey="Lumsdaine A">A Lumsdaine</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Holmes, Ec" uniqKey="Holmes E">EC Holmes</name>
</author>
<author><name sortKey="Ghedin, E" uniqKey="Ghedin E">E Ghedin</name>
</author>
<author><name sortKey="Miller, N" uniqKey="Miller N">N Miller</name>
</author>
<author><name sortKey="Taylor, J" uniqKey="Taylor J">J Taylor</name>
</author>
<author><name sortKey="Bao, Y" uniqKey="Bao Y">Y Bao</name>
</author>
<author><name sortKey="St George, K" uniqKey="St George K">K St George</name>
</author>
<author><name sortKey="Grenfell, Bt" uniqKey="Grenfell B">BT Grenfell</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author><name sortKey="Fraser, Cm" uniqKey="Fraser C">CM Fraser</name>
</author>
<author><name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
<author><name sortKey="Taubenberger, Jk" uniqKey="Taubenberger J">JK Taubenberger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Reddy, Tb" uniqKey="Reddy T">TB Reddy</name>
</author>
<author><name sortKey="Riley, R" uniqKey="Riley R">R Riley</name>
</author>
<author><name sortKey="Wymore, F" uniqKey="Wymore F">F Wymore</name>
</author>
<author><name sortKey="Montgomery, P" uniqKey="Montgomery P">P Montgomery</name>
</author>
<author><name sortKey="Decaprio, D" uniqKey="Decaprio D">D DeCaprio</name>
</author>
<author><name sortKey="Engels, R" uniqKey="Engels R">R Engels</name>
</author>
<author><name sortKey="Gellesch, M" uniqKey="Gellesch M">M Gellesch</name>
</author>
<author><name sortKey="Hubble, J" uniqKey="Hubble J">J Hubble</name>
</author>
<author><name sortKey="Jen, D" uniqKey="Jen D">D Jen</name>
</author>
<author><name sortKey="Jin, H" uniqKey="Jin H">H Jin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, X" uniqKey="Wu X">X Wu</name>
</author>
<author><name sortKey="Cai, Z" uniqKey="Cai Z">Z Cai</name>
</author>
<author><name sortKey="Wan, Xf" uniqKey="Wan X">XF Wan</name>
</author>
<author><name sortKey="Hoang, T" uniqKey="Hoang T">T Hoang</name>
</author>
<author><name sortKey="Goebel, R" uniqKey="Goebel R">R Goebel</name>
</author>
<author><name sortKey="Lin, G" uniqKey="Lin G">G Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Brendel, V" uniqKey="Brendel V">V Brendel</name>
</author>
<author><name sortKey="Beckmann, Js" uniqKey="Beckmann J">JS Beckmann</name>
</author>
<author><name sortKey="Trifonov, En" uniqKey="Trifonov E">EN Trifonov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author><name sortKey="Fang, W" uniqKey="Fang W">W Fang</name>
</author>
<author><name sortKey="Ling, L" uniqKey="Ling L">L Ling</name>
</author>
<author><name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author><name sortKey="Xuan, Z" uniqKey="Xuan Z">Z Xuan</name>
</author>
<author><name sortKey="Chen, R" uniqKey="Chen R">R Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bullard, J" uniqKey="Bullard J">J Bullard</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fauci, As" uniqKey="Fauci A">AS Fauci</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peterson, Mw" uniqKey="Peterson M">MW Peterson</name>
</author>
<author><name sortKey="Colosimo, Me" uniqKey="Colosimo M">ME Colosimo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Drummond, A" uniqKey="Drummond A">A Drummond</name>
</author>
<author><name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="product-review"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Source Code Biol Med</journal-id>
<journal-title-group><journal-title>Source Code for Biology and Medicine</journal-title>
</journal-title-group>
<issn pub-type="epub">1751-0473</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">21851626</article-id>
<article-id pub-id-type="pmc">3182884</article-id>
<article-id pub-id-type="publisher-id">1751-0473-6-13</article-id>
<article-id pub-id-type="doi">10.1186/1751-0473-6-13</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Software Review</subject>
</subj-group>
</article-categories>
<title-group><article-title>Nephele: genotyping via complete composition vectors and MapReduce</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes" id="A1"><name><surname>Colosimo</surname>
<given-names>Marc E</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mcolosimo@mitre.org</email>
</contrib>
<contrib contrib-type="author" id="A2"><name><surname>Peterson</surname>
<given-names>Matthew W</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mpeterson@mitre.org</email>
</contrib>
<contrib contrib-type="author" id="A3"><name><surname>Mardis</surname>
<given-names>Scott</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mardis@mitre.org</email>
</contrib>
<contrib contrib-type="author" id="A4"><name><surname>Hirschman</surname>
<given-names>Lynette</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>lynette@mitre.org</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA</aff>
<pub-date pub-type="collection"><year>2011</year>
</pub-date>
<pub-date pub-type="epub"><day>18</day>
<month>8</month>
<year>2011</year>
</pub-date>
<volume>6</volume>
<fpage>13</fpage>
<lpage>13</lpage>
<history><date date-type="received"><day>5</day>
<month>4</month>
<year>2011</year>
</date>
<date date-type="accepted"><day>18</day>
<month>8</month>
<year>2011</year>
</date>
</history>
<permissions><copyright-statement>Copyright ©2011 Colosimo et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2011</copyright-year>
<copyright-holder>Colosimo et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.scfbm.org/content/6/1/13"></self-uri>
<abstract><sec><title>Background</title>
<p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p>
</sec>
<sec><title>Results</title>
<p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p>
</sec>
<sec><title>Conclusions</title>
<p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p>
</sec>
</abstract>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>In the post-genomic era, as sequencing becomes ever cheaper and more routine, biological sequence analysis has provided many useful tools for the study and combat of infectious disease. These tools, which can include both experimental and computational methods, are important for molecular epidemiological studies [<xref ref-type="bibr" rid="B1">1</xref>
-<xref ref-type="bibr" rid="B3">3</xref>
], vaccine development [<xref ref-type="bibr" rid="B4">4</xref>
-<xref ref-type="bibr" rid="B6">6</xref>
], and microbial forensics [<xref ref-type="bibr" rid="B7">7</xref>
-<xref ref-type="bibr" rid="B9">9</xref>
]. One such method is genotyping, the grouping of samples based on their genetic sequence. This can be done experimentally [<xref ref-type="bibr" rid="B10">10</xref>
-<xref ref-type="bibr" rid="B12">12</xref>
] or computationally, either by identifying genetic signatures (nucleotide substrings which are only found in a single group of sequences) [<xref ref-type="bibr" rid="B13">13</xref>
], or on the basis of genetic distance among the sequences [<xref ref-type="bibr" rid="B14">14</xref>
-<xref ref-type="bibr" rid="B16">16</xref>
]. These methods allow a researcher to split a group of sequences into distinct partitions for further analysis. In a forensics context, genotyping a sequence can yield clues on where the sequence comes from. In surveillance, genotyping can be used to examine the evolutionary footprint of a pathogen, for example, to identify areas where certain vaccines and other countermeasures should be used.</p>
<p>Sequence-based comparison involves three major steps. The first is to choose a set of sequences to study, based on some criteria, such as strain, time period or geographic region. Ideally, this set can be easily extracted from a well-populated reference database, containing not only the sequence data for the samples of interest, such as a particular serotype of <italic>Influenza</italic>
, but also sufficient metadata. For infectious diseases, types of metadata include geospatial and temporal co-ordinates, host information, and pathogenicity. Once the appropriate dataset is chosen, the samples are compared and clustered in sequence space. From here, the metadata associated with the sequences is used to assess the evolutionary landscape of the organism or pathogen [<xref ref-type="bibr" rid="B17">17</xref>
].</p>
<sec><title>Sequence Comparison Methods</title>
<p>Traditionally, the first step in performing sequence comparisons is to generate a multiple sequence alignment (MSA) from the sequences of interest. This is most often done using heuristics found in utilities such as CLUSTAL W [<xref ref-type="bibr" rid="B18">18</xref>
], MUSCLE [<xref ref-type="bibr" rid="B19">19</xref>
], T-COFFEE [<xref ref-type="bibr" rid="B20">20</xref>
], and ProbCons [<xref ref-type="bibr" rid="B21">21</xref>
]. The dynamic programming solution, which can find the mathematically but not necessarily biologically correct solution, quickly becomes impractical with the sample sizes used in any meaningful analysis. A recent review [<xref ref-type="bibr" rid="B22">22</xref>
] examined many of the issues in producing these alignments, most notably the trade-offs between alignment accuracy, time, and computational expense. Many of the most accurate algorithms cannot be used on a large number of sequences, or on very lengthy sequences, and were only recommended for sets of less than 100 sequences.</p>
<p>Because the alignment is dependent on each of the sequences from which it is calculated, the alignment must be recomputed whenever a new sequence is added. This becomes problematic for surveillance applications, where new sequences will be added constantly. While this problem has been mitigated to some extent using with algorithms such as Near-Alignment Space Termination (NAST) [<xref ref-type="bibr" rid="B23">23</xref>
], this still adds a level of complexity if the dataset is continually growing, as is the case with Influenza and other infectious diseases. Another issue is that different heuristics will yield different alignments -- they are only designed to find an acceptable answer, not the optimal alignment. While methods have been developed to find a "consensus alignment" [<xref ref-type="bibr" rid="B24">24</xref>
] from a set of alignments, this requires a good deal of time and computing power.</p>
<p>The composition vector (CV) [<xref ref-type="bibr" rid="B25">25</xref>
] method has been used to describe DNA/RNA and protein sequences as vectors, using the distance between these vectors as the genetic distance. This method involves using a sliding window to represent each sequence as a vector, where each element of the vector is calculated based on the actual and expected frequency of the k-mer (DNA/protein subsequence of length k) observed in that window. The vector representation allows the distance between two sequences to be calculated with any standard distance metric. The CV method was shown to produce trees which matched established taxonomies, as inferred from the 16S RNA segment by more conventional alignment-based methods [<xref ref-type="bibr" rid="B26">26</xref>
]. The CV method was later expanded into the complete composition vector (CCV) method [<xref ref-type="bibr" rid="B27">27</xref>
], which uses sliding windows over a range of lengths to describe the sequence. Since these methods do not require alignments to be calculated, distances calculated between sequences remain constant, rather than being dependent on the set of sequences being examined, making these methods ideal for the handling of rapidly growing datasets. No molecular models need be used to calculate distances -- distances are calculated using any distance metric that can be used to calculate the distance between vectors.</p>
<p>The next step in sequence analysis is the clustering of the sequences. Traditionally, this is done by inferring a phylogenetic tree. Tools for this purpose include PHYLIP [<xref ref-type="bibr" rid="B28">28</xref>
], PAUP* [<xref ref-type="bibr" rid="B29">29</xref>
], or POY [<xref ref-type="bibr" rid="B30">30</xref>
]. This work was initially performed using distance-based methods, such as the UPGMA or neighbour-joining algorithms [<xref ref-type="bibr" rid="B31">31</xref>
], or cladistic methods such as Maximum Parsimony. As computational power increased, methods that inferred trees based on models of evolution were used. These include the Maximum Likelihood technique, as well as Bayesian Inference. While these methods produce phylogenetic trees, which provide a useful visualization, any further analysis and grouping must be performed manually. As the number of sequences to compare increases, this becomes more and more difficult. In fact, there has been much recent research into new methods to visualize phylogenetic trees with large numbers of leaves [<xref ref-type="bibr" rid="B32">32</xref>
,<xref ref-type="bibr" rid="B33">33</xref>
]. In addition, the phylogenetic tree view proves difficult to integrate with the metadata. For example, a recent paper discussing the spread of H5N1 Avian Influenza used Google Earth to draw a phylogenetic tree on top of the globe [<xref ref-type="bibr" rid="B34">34</xref>
]. While this visualization works well for a small number of samples, it is ineffective for larger datasets, due to the "busyness" of the visualization.</p>
</sec>
<sec><title>Computational Genotyping</title>
<p>An alternative to the pure phylogenetic approach is computational genotyping. This involves partitioning the set of sequences into discrete groups, based on some criteria. This can be based on differences between known subtypes, such as tandem repeats or single nucleotide polymorphisms, or by genetic distance. In the case of genotyping based on distance, this becomes a clustering problem. In 2007, Frey and Dueck published a paper on a new clustering algorithm known as affinity propagation clustering [<xref ref-type="bibr" rid="B35">35</xref>
]. In contrast to other clustering algorithms, such as k-means and Expectation Maximization (EM), the affinity propagation algorithm does not require the user to explicitly select a given number of exemplars at the start of clustering. Instead, affinity propagation simultaneously considers all points as potential exemplars, using an initial preference to determine the sensitivity, and therefore the number of clusters. This eliminates the need for large numbers of runs to determine the ideal number of clusters and any dependence on initial conditions seen in other partition clustering algorithms. Furthermore, this algorithm allows the user to set the preference for each data point. This is useful for a scenario where a partial set of representative samples are known, but there may be other exemplars along with these in a data set. The affinity propagation has been tested on geospatial, text, and gene expression data and showed improvements in both speed and accuracy over other clustering algorithms.</p>
<p>The main advantage of an automated computational genotyping method is that it gives the researcher the ability to combine a measure of sequence similarity (cluster membership) with the metadata. It is this metadata that yields the most information about a sample. A phylogenetic tree will tell what samples are close in sequence space, but any further inference is made using the metadata. By separating the sequences into discrete groups, the researcher is given much more flexibility to visualize the data and associated metadata.</p>
</sec>
<sec><title>MapReduce</title>
<p>MapReduce [<xref ref-type="bibr" rid="B36">36</xref>
] is the software framework developed by Google™ to support parallel distributed execution of their data intensive applications. MapReduce is designed for fault-tolerant computations with extremely large datasets. MapReduce is divided into two major phases called map and reduce, separated by an internal shuffle phase of the intermediate results. Hadoop is an open-source version of MapReduce implemented in Java and sponsored by Amazon™, Yahoo™, and other major vendors. Recently, MapReduce has been used for sequence and phylogenetic applications. For example, CloudBurst uses Hadoop for parallel short read-mapping for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics [<xref ref-type="bibr" rid="B37">37</xref>
]. The Genome Analysis Toolkit uses the MapReduce paradigm for shared memory platforms [<xref ref-type="bibr" rid="B38">38</xref>
]. MrsRF (MapReduce Speeds up RF) is a multi-core, multi-machine algorithm that generates t × t Robinson-Foulds distance matrix between t trees [<xref ref-type="bibr" rid="B39">39</xref>
] using Phoenix [<xref ref-type="bibr" rid="B40">40</xref>
], a MapReduce implementation for shared memory multi-core platform, and OpenMPI [<xref ref-type="bibr" rid="B41">41</xref>
]. These uses indicate that MapReduce is a promising tool to help solve the computational challenges with large datasets.</p>
</sec>
<sec><title>Nephele</title>
<p>In this paper, we describe a scalable complete genotyping system that brings together the complete composition vector and affinity propagation algorithms to produce genotypes from <italic>Influenza </italic>
A sequences. The system has been tested on a variety of <italic>Influenza </italic>
A and <italic>Actinomycetes </italic>
genome data. In addition to providing discrete clusters representing genotypes, we use methods that produce trees that closely match the topologies of trees inferred using traditional phylogenetic methods, in order to provide scientists with a more familiar visualization.</p>
</sec>
</sec>
<sec><title>Implementation</title>
<sec><title>Datasets</title>
<p>The <italic>Influenza </italic>
dataset used to develop our methods was that of Holmes <italic>et al</italic>
. [<xref ref-type="bibr" rid="B42">42</xref>
]. The clades and reassortment events found in these samples were discussed in detail, providing eight sets of sequences (one for each gene studied in the paper) for verification of our methods. This dataset consists of 155 samples, taken from New York State during the 1999-2000, 2001-2002, 2002-2003, and 2003-2004 flu seasons. The complete coding sequences are available, as well as the date and county of collection for these strains. We also used HA segments from H1N1 (1141) and H3N2 (2201) parsed from GenBank's viral division (gbvrl). For testing our implementation, we used 10,270 16S samples from GreenGenes (core_set_aligned.fasta retired on 07 February 2007; <ext-link ext-link-type="uri" xlink:href="http://greengenes.lbl.gov/">http://greengenes.lbl.gov/</ext-link>
).</p>
<p>To test our methods, two additional datasets were identified. A set of 94 sequences representing WHO expert-defined genotypes (<ext-link ext-link-type="uri" xlink:href="http://www.who.int/csr/disease/avian_influenza/guidelines/nomenclature/">http://www.who.int/csr/disease/avian_influenza/guidelines/nomenclature/</ext-link>
) was used to validate our methods, and another dataset representing an 2007 <italic>Influenza </italic>
outbreak in Europe was chosen to demonstrate the utility of the computational genotyping approach for microbial forensic analysis (Additional File <xref ref-type="supplementary-material" rid="S1">1</xref>
).</p>
<p>We also used 27 full length genomes of Actinomycetes bacteria from the Broad Institute along with their computed concatenated protein sequences, downloaded from the Tuberculosis Database (TBDB) [<xref ref-type="bibr" rid="B43">43</xref>
].</p>
</sec>
<sec><title>Complete Composition Vector</title>
<p>The method used is based on that of Wu et al. [<xref ref-type="bibr" rid="B44">44</xref>
]. Each sequence, S, of a given length L, can be broken into L -- k + 1 overlapping substrings of length k. For each substring α, the probability of occurrence is calculated as</p>
<p><disp-formula><mml:math id="M1" name="1751-0473-6-13-i1" overflow="scroll"><mml:mrow><mml:mi>p</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>f</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:mi>L</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where f(α) is the frequency of substring α in S. Next, the expected probability, q is calculated using a Markov model described by Brendel, Beckmann, and Trifonov, which takes into account the probabilities of length-(k-1) and length(k-2) strings [<xref ref-type="bibr" rid="B45">45</xref>
].</p>
<p><disp-formula><mml:math id="M2" name="1751-0473-6-13-i2" overflow="scroll"><mml:mrow><mml:mi>q</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>p</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-op">…</mml:mo>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-op">…</mml:mo>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:mi>p</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-op">…</mml:mo>
<mml:msub><mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>This is designed to highlight the role of selective mutation, and it was found that phylogenetic trees produced without subtracting the background via the Markov model were not consistent with traditional approaches [<xref ref-type="bibr" rid="B25">25</xref>
].</p>
<p>The composition value, π, for substring α is defined as:</p>
<p><disp-formula><mml:math id="M3" name="1751-0473-6-13-i3" overflow="scroll"><mml:mrow><mml:mi>π</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfenced open="{"><mml:mrow><mml:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd class="array" columnalign="center"><mml:mi>p</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">∕</mml:mo>
<mml:mi>q</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd class="array" columnalign="center"><mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd class="array" columnalign="center"></mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mstyle class="text"><mml:mtext class="textsf" mathvariant="sans-serif"> </mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:mfenced>
<mml:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd class="array" columnalign="center"><mml:mi>q</mml:mi>
<mml:mo class="MathClass-rel">≠</mml:mo>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd class="array" columnalign="center"><mml:mi>q</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo class="MathClass-punc">.</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd class="array" columnalign="center"></mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The k<sup>th </sup>
composition vector, V<sub>k</sub>
(S), is comprised of the composition values for all possible substrings of length k. For amino acid sequences, V is of length 20 k, and for DNA/RNA, V is of length 4 k. This method has been shown to produce trees which match known taxonomies [<xref ref-type="bibr" rid="B26">26</xref>
].</p>
<p>In 2004, Wu et al. extended the CV approach into the complete composition vector (CCV) [<xref ref-type="bibr" rid="B27">27</xref>
]. This method combines the composition vector approach with the idea of the complete information set, in order to supplement any information loss from the background subtraction in the CV method [<xref ref-type="bibr" rid="B46">46</xref>
]. The CCV is defined as the sequence of composition vectors from 3 to M, where M is a pre-determined constant.</p>
<p>For all experiments described in this paper, the complete composition vectors were calculated with M = 9. In addition, the revised relative entropy string selection string scoring scheme described by Wu <italic>et al</italic>
. [<xref ref-type="bibr" rid="B44">44</xref>
] was employed to reduce the dimensionality of the vectors. This is calculated as</p>
<p><disp-formula><mml:math id="M4" name="1751-0473-6-13-i4" overflow="scroll"><mml:mrow><mml:mi>R</mml:mi>
<mml:mi>E</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msubsup><mml:mrow><mml:mo mathsize="big"> ∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="|" close="|"><mml:mrow><mml:mi>π</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>a</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
<mml:mi>l</mml:mi>
<mml:mi>n</mml:mi>
<mml:mfenced open="|" close="|"><mml:mrow><mml:mfrac><mml:mrow><mml:mi>π</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:mi>Π</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where Π represents the complete composition vector calculated from the concatenation of all n sequences in the dataset. In summary, this method evaluates the information content associated with each possible substring, and the most informative substrings are chosen for inclusion in the analysis. The number of n-mers used for distance calculations was chosen based on the dataset: if the absolute revised relative entropy was below 1.0, the substring was not used for any further calculations.</p>
<p>Once the final set of n-mers is chosen, the vectors are normalized by calculating the Z-score for each n-mer. From these normalized vectors, the distance matrix is then calculated. For each pair of samples, the distance between the normalized complete composition vectors Vi and Vj is calculated using cosine distance:</p>
<p><disp-formula><mml:math id="M5" name="1751-0473-6-13-i5" overflow="scroll"><mml:mrow><mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub><mml:mrow><mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow><mml:mfenced open="|" close="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mfenced open="|" close="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>We also experimented with using the Euclidian distance, calculated as</p>
<p><disp-formula><mml:math id="M6" name="1751-0473-6-13-i6" overflow="scroll"><mml:mrow><mml:msub><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mo mathsize="big">∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup><mml:mrow><mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub><mml:mrow><mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where n is the number of substrings kept after the substring selection (Figure <xref ref-type="fig" rid="F1">1</xref>
). The CCV and distance calculation code was written in Java (1.5+), using custom classes to save space and memory. Experiments were run on Apple dual quad-core Intel Mac Pro with 8 GB running OS × 10.6. Additional testing was done under CentOS 5.5 and Ubuntu 9.4 Linux distributions.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>Comparison of CCV and Maximum Likelihood Trees</bold>
. Patristic Distance plots of trees produced by Maximum Likelihood (Y axes) vs. those produced by CCV and cosine/Euclidian distance measures. (X axes).</p>
</caption>
<graphic xlink:href="1751-0473-6-13-1"></graphic>
</fig>
</sec>
<sec><title>Affinity Propagation Clustering</title>
<p>The input to the affinity propagation clustering algorithm is a similarity matrix. For Euclidian and Manhattan distances, the similarity is represented by the negative of the distance, while for cosine distances, the similarity is found by subtracting the distance matrix from 1. To determine the optimal preference, the mean silhouette value was used. This value is a measure of how similar a given sample is to others in the same clusters, versus samples found in other clusters. It ranges from 1 (sample is well-clustered) to -1 (the sample is found in an incorrect cluster) and is calculated as:</p>
<p><disp-formula><mml:math id="M7" name="1751-0473-6-13-i7" overflow="scroll"><mml:mrow><mml:mi>s</mml:mi>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub><mml:mrow><mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow><mml:mstyle class="text"><mml:mtext class="textsf" mathvariant="sans-serif">max</mml:mtext>
</mml:mstyle>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msub><mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Where a<sub>i </sub>
is the sample's average distance to the other samples in its cluster and b<sub>i </sub>
is the minimum average distance between the sample and the samples in each of the other clusters. The developers of the algorithm recommend using the minimum similarity between samples for a low number of clusters, and the median similarity for a moderate number of clusters. The preference resulting in the optimal partitioning, using the average silhouette value as a measurement was chosen from a set of four preferences spanning the minimum and median similarity. Affinity propagation was performed using the MATLAB function available at the authors' website (<ext-link ext-link-type="uri" xlink:href="http://www.psi.toronto.edu/affinitypropagation/">http://www.psi.toronto.edu/affinitypropagation/</ext-link>
), with the default parameters or with our re-implementation written in Java as part of Nephele.</p>
</sec>
<sec><title>Parallelization with Hadoop</title>
<p>We also implemented most of our CCV code for execution as a series of nine MapReduce jobs using the open-source MapReduce implementation Hadoop (<ext-link ext-link-type="uri" xlink:href="http://hadoop.apache.org/">http://hadoop.apache.org/</ext-link>
). MapReduce is not ideal for the generation of neighbour-joined trees or affinity propagation. The MapReduce paradigm depends on the maps not dependent on any other data than what they are given. Both the neighbour-joined trees and affinity propagation algorithms depend on shared states, which breaks the MapReduce paradigm. However, Nephele provides a Message Passing Interface (MPI) version of Panjo, a neighbour-joining algorithm, that is able to handle very large trees [<xref ref-type="bibr" rid="B47">47</xref>
]. Our version accepts row packed matrices instead of column packed, because it was easier to generate them as opposed to column packed matrices using MapReduce. All experiments were run on a Rocks (<ext-link ext-link-type="uri" xlink:href="http://www.rocksclusters.org/">http://www.rocksclusters.org/</ext-link>
) cluster running CentOS 5.4 on Intel Core 2 Quad and Core 2 Duo processors with 8 and 4 GB of memory, respectively, using Java 1.5 and Hadoop 0.20.1.</p>
</sec>
</sec>
<sec><title>Results and Discussion</title>
<sec><title>Genotyping the New York Dataset: Results, Computation Time and Choice of Distance Metric</title>
<p>The dataset used by Holmes and colleagues to study reassortment events throughout New York State provided a dataset to use to build and refine the genotyping methods. This set included 155 full genomes of H3N2 found in New York state between 1999 and 2004, collected as part of the <italic>Influenza </italic>
Genome Sequencing Initiative [<xref ref-type="bibr" rid="B48">48</xref>
], a worldwide sequencing initiative (this project has also sampled from the southern hemisphere, in Australia and New Zealand). Since the complete genome for each of the samples in this set was sequenced, this provided genes, with differing rates of evolution to test the pipeline.</p>
<p>Clustering was performed on the eight genes studied in detail in the paper (HA, M1, NA, NP, NS1, PA, PB1, PB2) as described in the Implementation section. Cluster counts ranged from 5 (NP, PB2) to 15 (PB1), with clusters ranging in size from 1 (representing an outlier in the dataset) to 36. The results from the affinity propagation clustering matched the clade structure of the trees. The trees for all eight genes, colored by cluster membership, can be seen in Additional File <xref ref-type="supplementary-material" rid="S2">2</xref>
. All phylogenetic trees in this work were produced using TreeViewJ [<xref ref-type="bibr" rid="B49">49</xref>
]. The segmented nature of the <italic>Influenza </italic>
genome adds a level of complexity to the genotyping problem. For each sample, the set of clusters for the eight genes can be used to define a cluster "profile," which represents the genotype defined by the complete genome for that sample. This profile represents the composite genotype for that sample. For the 155 samples in the dataset, 32 genotypes were identified. From these, the groups of samples identified by Holmes and colleagues to be involved in reassortment events were found as distinct genotypes.</p>
<p>One of the major advantages of the complete composition vector approach over traditional phylogenetic methods is the speed of analysis. For the individual gene segments in the test dataset (155 sequences), execution times ranged from 1.50 minutes (M1) to 2.25 minutes (NA) In contrast, alignment times using MUSCLE were on the order of 5-10 minutes, and inference of maximum likelihood trees took roughly 20 minutes per gene If the genes are concatenated together to create a full genome sequence, the gains are even more impressive -- trees were produced in a few minutes with the CCV-based approach, rather than hours for traditional alignment-based methods. This is consistent with initial results from composition vector based approaches, which focused on inferring trees for complete prokaryotic genomes [<xref ref-type="bibr" rid="B26">26</xref>
].</p>
<p>In order to determine the ideal distance metric to use for clustering, the patristic distances (distances between leaves along the branches) of the phylogenetic trees inferred using the neighbour-joining algorithm on distances calculated using cosine and Euclidian distances were compared (see Implementation section for details of computation). Patristic distances (the distance between two leaves along the branches of a tree) were calculated using the TreeDistanceMatrix methods from the Phylogenetic Analysis Java Library [<xref ref-type="bibr" rid="B50">50</xref>
]. Patristic distances for each gene and distance measure were plotted against each other. Figure <xref ref-type="fig" rid="F1">1</xref>
 shows the patristic distance (distance between leaves along a tree) plots for HA and M1, which represent rapidly mutating and slowly mutating genes, respectively. It is clear that trees produced using the neighbour-joining algorithm on cosine distance matrices produce trees that are the most similar to the Maximum Likelihood trees, while the Euclidian distance metric produces trees which have overly large distances near the leaves of the tree, as shown in Figure <xref ref-type="fig" rid="F2">2</xref>
. The cosine distance is shown to produce trees whose patristic distance has a linear relationship with that of the tree produced by maximum likelihood, while the trees produced using Euclidean distance show a higher-order relationship. These indicate that while the Euclidian distance has been used as a distance metric for the majority of previously published work involving composition/complete composition vectors [<xref ref-type="bibr" rid="B15">15</xref>
,<xref ref-type="bibr" rid="B44">44</xref>
], it appears that cosine distance provides a better correlation with trees produced by traditional phylogenetic methods.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>Comparison of Distance Metrics Used to Create Trees</bold>
. Phylogenetic trees of the HA gene from the New York State dataset constructed with (a) Euclidian and (b) cosine distances. Note the long leaf-leaf distances on tree (a).</p>
</caption>
<graphic xlink:href="1751-0473-6-13-2"></graphic>
</fig>
</sec>
<sec><title>Clustering on H5N1 Standard Nomenclature Dataset for Validation</title>
<p>In 2001, the World Health Organization (WHO), along with the World Organization for Animal Health (OIE) and Food and Agriculture Organization of the United Nations (FAO) released, in poster form, a standard nomenclature system for the various lineages of <italic>Influenza </italic>
H5N1 found in over 50 countries throughout the world This nomenclature is intended to replace the current nomenclature used in publications, where samples are often identified by the location of the earliest sample with the closest genetic similarity (for example, "Fujian-like" or "Quinghai lineage"). Alignments of 904 HA sequences were created, and clades were chosen from the tree based on a set of rules. These clades, developed to define a new standard nomenclature, provided an opportunity to blind test set our genotyping system.</p>
<p>In addition to the complete 904 sample dataset, the authors provided a smaller, 109 sample representative dataset. Of these, we were able to find 94 which were in Genbank, and thus were available, with metadata, in our database. We ran our genotyping pipeline on these sequences, and found 18 clusters, as opposed to the 19 clades found in the nomenclature study. The phylogenetic tree, colored by cluster membership, is shown in Figure <xref ref-type="fig" rid="F3">3</xref>
. To compare the results of our genotyping pipeline with the expert-defined genotypes, we used the Adjusted Rand Index [<xref ref-type="bibr" rid="B48">48</xref>
], which has an expected value of zero, and a maximum value of 1. The Adjusted Rand Index for this experiment was 0.833, indicating a strong agreement between our results and the clades defined by the WHO/FAO/OIE.</p>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>Clustering of Influenza Dataset</bold>
. Phylogenetic tree of the WHO Dataset, colored by cluster membership. The shorter HA1 sequences are boxed in red.</p>
</caption>
<graphic xlink:href="1751-0473-6-13-3"></graphic>
</fig>
<p>We also performed a detailed examination of the trees produced by the CCV method with those from the study. We found that members of Clade 2.3.1 were found in two distinct groups on our tree, one of which was quite distant from the rest of the samples in the tree. Upon looking at the sequences, we found that these were much shorter (~1000 bp) than the rest of the sequences (~1600 bp), indicating that these sequences were most likely HA1 sequences, rather than the full HA coding sequence, even though they were labelled full HA. This highlights a problem with the quality of data that currently exists in the databases. These inconsistencies in the data can significantly distort the results of the various sequence analysis methods.</p>
</sec>
<sec><title>Bacterial Genomes</title>
<p>We investigated is our implementation can be used for larger and more complex genomes. We acquired 27 full length <italic>Actinomycetes </italic>
genomes from the Tuberculosis Database at the Broad Institute and ran them through our pipeline. We were able to produce trees that had the same topology as those generated by the Broad in about 30 minutes compared to several hours for them using traditional tools (Brian Weiner, personal communication, from unpublished data). In addition, we compared the trees produced using the concatenation of the predicted proteins of the same set of genomes and we got similar results in both time to produce and the topology of the trees. However, it should be noted that the length of the branches are different.</p>
</sec>
<sec><title>MapReduce</title>
<p>During the development of the CCV code, we ran into memory bottle necks that required extensive coding to minimize. In addition, it was noted that several of the steps could be parallelized. We examined Hadoop to determine if we could utilize it for parallelizing our code across commodity hardware in a fault-tolerant way. We were able to code most of our algorithms using Hadoop. The few that we did not code were the neighbour-joining tree and the affinity propagation clustering algorithms. We provide a modified version of Panjo [<xref ref-type="bibr" rid="B47">47</xref>
], a neighbour-joining algorithm, that uses the output of our Hadoop cosine distance matrix, which is in row major (packed) order. We also can output the matrix in the Phylip square format. We were able to generate a neighbour-joined tree for 10,270 16S samples in 106 minutes using our Rocks Cluster of 30 machines, which we were not able to compute at all using our code on a single machine.</p>
</sec>
</sec>
<sec><title>Conclusions</title>
<p>We have described a fast and accurate method for computational genotyping, using both human and avian <italic>Influenza </italic>
as a model organism, full length <italic>Actinomycetes </italic>
genomes, and 13 S samples. This method utilizes techniques that are faster than traditional methods for both sequence comparison and clustering. Our method produces genotypes that closely match those produced by expert analysis. In addition to providing discrete genotypes with minimal human intervention, the complete composition vector based method produces trees that correlate highly with those produced by sequence alignment and maximum likelihood methods, giving scientists a visualization of the data that they are familiar with in a fraction of the time. Possible uses of these tools include displaying the genotypes and associated metadata on a timeline or map, to show the geospatial and temporal distribution of the pathogen population (Figure <xref ref-type="fig" rid="F4">4</xref>
). Finally, our MapReduce implementation should handle tens of thousands of bacterial size genomes and genomes of complex Eukaryote organisms (we have tested this with several <italic>Fusarium sp</italic>
. and got similar trees, data not shown), such as those being produced from current and next generation sequencers, providing a method to analyze these large datasets.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Integration of Metadata With Genotyping Results</bold>
. Clustering of HA genes from the 2007 United Kingdom H5N1 outbreak with other samples isolated around the same time. This shows how researchers can combine genotyping with the metadata. Colors on the map represent the cluster membership. The red box indicates the cluster containing the UK Turkey Sample, along with the two Hungarian samples. Note the dates for the UK and Hungary samples (Light Blue) -- these dates were only provided at the year level in Genbank, even though more accurate dates can be inferred from other sources.</p>
</caption>
<graphic xlink:href="1751-0473-6-13-4"></graphic>
</fig>
</sec>
<sec><title>Availability</title>
<p>Project name: Nephele</p>
<p>Project home page: <ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/nephele/">http://code.google.com/p/nephele/</ext-link>
</p>
<p>Operating system: Linux, Mac OS X, Unix</p>
<p>Programming language: Java and C</p>
<p>License: <underline>Apache License 2.0</underline>
</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Authors' contributions</title>
<p>MEC, MP, and SM wrote source code for CCV. MEC wrote source code for Hadoop CCV. MEC, LH, and MP conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors have read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
<p><bold>Representative Standard Nomenclature Dataset of H5N1 Genotypes</bold>
. A set of 94 sequences representing WHO expert-defined genotypes (<ext-link ext-link-type="uri" xlink:href="http://www.who.int/csr/disease/avian_influenza/guidelines/nomenclature/">http://www.who.int/csr/disease/avian_influenza/guidelines/nomenclature/</ext-link>
).</p>
</caption>
<media xlink:href="1751-0473-6-13-S1.XLSX" mimetype="application" mime-subtype="vnd.ms-excel"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2"><caption><title>Additional file 2</title>
<p><bold>Clustering of Eight Genes from Influenza H3N2 Viruses (HA, M1, NA, NP, NS1, PA, PB1, PB2)</bold>
. This dataset consists of 155 samples, taken from New York State during the 1999-2000, 2001-2002, 2002-2003, and 2003-2004 flu seasons.</p>
</caption>
<media xlink:href="1751-0473-6-13-S2.PDF" mimetype="application" mime-subtype="pdf"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>This work was funded through the MITRE Internal Research Program. Approved for Public Release: 10-9999. Distribution Unlimited.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Li</surname>
<given-names>KS</given-names>
</name>
<name><surname>Guan</surname>
<given-names>Y</given-names>
</name>
<name><surname>Wang</surname>
<given-names>J</given-names>
</name>
<name><surname>Smith</surname>
<given-names>GJ</given-names>
</name>
<name><surname>Xu</surname>
<given-names>KM</given-names>
</name>
<name><surname>Duan</surname>
<given-names>L</given-names>
</name>
<name><surname>Rahardjo</surname>
<given-names>AP</given-names>
</name>
<name><surname>Puthavathana</surname>
<given-names>P</given-names>
</name>
<name><surname>Buranathai</surname>
<given-names>C</given-names>
</name>
<name><surname>Nguyen</surname>
<given-names>TD</given-names>
</name>
<etal></etal>
<article-title>Genesis of a highly pathogenic and potentially pandemic H5N1 influenza virus in eastern Asia</article-title>
<source>Nature</source>
<year>2004</year>
<volume>430</volume>
<fpage>209</fpage>
<lpage>213</lpage>
<pub-id pub-id-type="doi">10.1038/nature02746</pub-id>
<pub-id pub-id-type="pmid">15241415</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Campitelli</surname>
<given-names>L</given-names>
</name>
<name><surname>Di Martino</surname>
<given-names>A</given-names>
</name>
<name><surname>Spagnolo</surname>
<given-names>D</given-names>
</name>
<name><surname>Smith</surname>
<given-names>GJ</given-names>
</name>
<name><surname>Di Trani</surname>
<given-names>L</given-names>
</name>
<name><surname>Facchini</surname>
<given-names>M</given-names>
</name>
<name><surname>De Marco</surname>
<given-names>MA</given-names>
</name>
<name><surname>Foni</surname>
<given-names>E</given-names>
</name>
<name><surname>Chiapponi</surname>
<given-names>C</given-names>
</name>
<name><surname>Martin</surname>
<given-names>AM</given-names>
</name>
<etal></etal>
<article-title>Molecular analysis of avian H7 influenza viruses circulating in Eurasia in 1999-2005: detection of multiple reassortant virus genotypes</article-title>
<source>J Gen Virol</source>
<year>2008</year>
<volume>89</volume>
<fpage>48</fpage>
<lpage>59</lpage>
<pub-id pub-id-type="doi">10.1099/vir.0.83111-0</pub-id>
<pub-id pub-id-type="pmid">18089728</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Rambaut</surname>
<given-names>A</given-names>
</name>
<name><surname>Pybus</surname>
<given-names>OG</given-names>
</name>
<name><surname>Nelson</surname>
<given-names>MI</given-names>
</name>
<name><surname>Viboud</surname>
<given-names>C</given-names>
</name>
<name><surname>Taubenberger</surname>
<given-names>JK</given-names>
</name>
<name><surname>Holmes</surname>
<given-names>EC</given-names>
</name>
<article-title>The genomic and epidemiological dynamics of human influenza A virus</article-title>
<source>Nature</source>
<year>2008</year>
<volume>453</volume>
<fpage>615</fpage>
<lpage>619</lpage>
<pub-id pub-id-type="doi">10.1038/nature06945</pub-id>
<pub-id pub-id-type="pmid">18418375</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>De Groot</surname>
<given-names>AS</given-names>
</name>
<name><surname>Bosma</surname>
<given-names>A</given-names>
</name>
<name><surname>Chinai</surname>
<given-names>N</given-names>
</name>
<name><surname>Frost</surname>
<given-names>J</given-names>
</name>
<name><surname>Jesdale</surname>
<given-names>BM</given-names>
</name>
<name><surname>Gonzalez</surname>
<given-names>MA</given-names>
</name>
<name><surname>Martin</surname>
<given-names>W</given-names>
</name>
<name><surname>Saint-Aubin</surname>
<given-names>C</given-names>
</name>
<article-title>From genome to vaccine: in silico predictions, ex vivo verification</article-title>
<source>Vaccine</source>
<year>2001</year>
<volume>19</volume>
<fpage>4385</fpage>
<lpage>4395</lpage>
<pub-id pub-id-type="doi">10.1016/S0264-410X(01)00145-1</pub-id>
<pub-id pub-id-type="pmid">11483263</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Yang</surname>
<given-names>HL</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>YZ</given-names>
</name>
<name><surname>Qin</surname>
<given-names>JH</given-names>
</name>
<name><surname>He</surname>
<given-names>P</given-names>
</name>
<name><surname>Jiang</surname>
<given-names>XC</given-names>
</name>
<name><surname>Zhao</surname>
<given-names>GP</given-names>
</name>
<name><surname>Guo</surname>
<given-names>XK</given-names>
</name>
<article-title>In silico and microarray-based genomic approaches to identifying potential vaccine candidates against Leptospira interrogans</article-title>
<source>BMC Genomics</source>
<year>2006</year>
<volume>7</volume>
<fpage>293</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-7-293</pub-id>
<pub-id pub-id-type="pmid">17109759</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><name><surname>Macken</surname>
<given-names>C</given-names>
</name>
<name><surname>Lu</surname>
<given-names>H</given-names>
</name>
<name><surname>Goodman</surname>
<given-names>J</given-names>
</name>
<name><surname>Boykin</surname>
<given-names>L</given-names>
</name>
<article-title>The value of a database in surveillance and vaccine selection</article-title>
<source>International Congress Series</source>
<year>2001</year>
<volume>1219</volume>
<fpage>103</fpage>
<lpage>106</lpage>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Cummings</surname>
<given-names>CA</given-names>
</name>
<name><surname>Relman</surname>
<given-names>DA</given-names>
</name>
<article-title>Genomics and microbiology. Microbial forensics--"cross-examining pathogens"</article-title>
<source>Science</source>
<year>2002</year>
<volume>296</volume>
<fpage>1976</fpage>
<lpage>1979</lpage>
<pub-id pub-id-type="doi">10.1126/science.1073125</pub-id>
<pub-id pub-id-type="pmid">12004075</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><name><surname>Budowle</surname>
<given-names>B</given-names>
</name>
<name><surname>Schutzer</surname>
<given-names>SE</given-names>
</name>
<name><surname>Ascher</surname>
<given-names>MS</given-names>
</name>
<name><surname>Atlas</surname>
<given-names>RM</given-names>
</name>
<name><surname>Burans</surname>
<given-names>JP</given-names>
</name>
<name><surname>Chakraborty</surname>
<given-names>R</given-names>
</name>
<name><surname>Dunn</surname>
<given-names>JJ</given-names>
</name>
<name><surname>Fraser</surname>
<given-names>CM</given-names>
</name>
<name><surname>Franz</surname>
<given-names>DR</given-names>
</name>
<name><surname>Leighton</surname>
<given-names>TJ</given-names>
</name>
<etal></etal>
<article-title>Toward a system of microbial forensics: from sample collection to interpretation of evidence</article-title>
<source>Appl Environ Microbiol</source>
<year>2005</year>
<volume>71</volume>
<fpage>2209</fpage>
<lpage>2213</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.71.5.2209-2213.2005</pub-id>
<pub-id pub-id-type="pmid">15870301</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>McEwen</surname>
<given-names>SA</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>TM</given-names>
</name>
<name><surname>Ashford</surname>
<given-names>DA</given-names>
</name>
<name><surname>Heegaard</surname>
<given-names>ED</given-names>
</name>
<name><surname>Kournikakis</surname>
<given-names>B</given-names>
</name>
<article-title>Microbial forensics for natural and intentional incidents of infectious disease involving animals</article-title>
<source>Rev Sci Tech</source>
<year>2006</year>
<volume>25</volume>
<fpage>329</fpage>
<lpage>339</lpage>
<pub-id pub-id-type="pmid">16796058</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><name><surname>Wang</surname>
<given-names>D</given-names>
</name>
<name><surname>Coscoy</surname>
<given-names>L</given-names>
</name>
<name><surname>Zylberberg</surname>
<given-names>M</given-names>
</name>
<name><surname>Avila</surname>
<given-names>PC</given-names>
</name>
<name><surname>Boushey</surname>
<given-names>HA</given-names>
</name>
<name><surname>Ganem</surname>
<given-names>D</given-names>
</name>
<name><surname>DeRisi</surname>
<given-names>JL</given-names>
</name>
<article-title>Microarray-based detection and genotyping of viral pathogens</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>2002</year>
<volume>99</volume>
<fpage>15687</fpage>
<lpage>15692</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.242579699</pub-id>
<pub-id pub-id-type="pmid">12429852</pub-id>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>Ghindilis</surname>
<given-names>AL</given-names>
</name>
<name><surname>Smith</surname>
<given-names>MW</given-names>
</name>
<name><surname>Schwarzkopf</surname>
<given-names>KR</given-names>
</name>
<name><surname>Roth</surname>
<given-names>KM</given-names>
</name>
<name><surname>Peyvan</surname>
<given-names>K</given-names>
</name>
<name><surname>Munro</surname>
<given-names>SB</given-names>
</name>
<name><surname>Lodes</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Stover</surname>
<given-names>AG</given-names>
</name>
<name><surname>Bernards</surname>
<given-names>K</given-names>
</name>
<name><surname>Dill</surname>
<given-names>K</given-names>
</name>
<name><surname>McShea</surname>
<given-names>A</given-names>
</name>
<article-title>CombiMatrix oligonucleotide arrays: genotyping and gene expression assays employing electrochemical detection</article-title>
<source>Biosens Bioelectron</source>
<year>2007</year>
<volume>22</volume>
<fpage>1853</fpage>
<lpage>1860</lpage>
<pub-id pub-id-type="doi">10.1016/j.bios.2006.06.024</pub-id>
<pub-id pub-id-type="pmid">16891109</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><name><surname>Lindh</surname>
<given-names>M</given-names>
</name>
<name><surname>Andersson</surname>
<given-names>AS</given-names>
</name>
<name><surname>Gusdal</surname>
<given-names>A</given-names>
</name>
<article-title>Genotypes, nt 1858 variants, and geographic origin of hepatitis B virus--large-scale analysis using a new genotyping method</article-title>
<source>J Infect Dis</source>
<year>1997</year>
<volume>175</volume>
<fpage>1285</fpage>
<lpage>1293</lpage>
<pub-id pub-id-type="doi">10.1086/516458</pub-id>
<pub-id pub-id-type="pmid">9180165</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><name><surname>Lin</surname>
<given-names>G</given-names>
</name>
<name><surname>Cai</surname>
<given-names>Z</given-names>
</name>
<name><surname>Wu</surname>
<given-names>J</given-names>
</name>
<name><surname>Wan</surname>
<given-names>XF</given-names>
</name>
<name><surname>Xu</surname>
<given-names>L</given-names>
</name>
<name><surname>Goebel</surname>
<given-names>R</given-names>
</name>
<article-title>Identifying a few foot-and-mouth disease virus signature nucleotide strings for computational genotyping</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>279</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-279</pub-id>
<pub-id pub-id-type="pmid">18554404</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Lu</surname>
<given-names>G</given-names>
</name>
<name><surname>Rowley</surname>
<given-names>T</given-names>
</name>
<name><surname>Garten</surname>
<given-names>R</given-names>
</name>
<name><surname>Donis</surname>
<given-names>RO</given-names>
</name>
<article-title>FluGenome: a web tool for genotyping influenza A virus</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<fpage>W275</fpage>
<lpage>279</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkm365</pub-id>
<pub-id pub-id-type="pmid">17537820</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><name><surname>Wan</surname>
<given-names>XF</given-names>
</name>
<name><surname>Chen</surname>
<given-names>G</given-names>
</name>
<name><surname>Luo</surname>
<given-names>F</given-names>
</name>
<name><surname>Emch</surname>
<given-names>M</given-names>
</name>
<name><surname>Donis</surname>
<given-names>R</given-names>
</name>
<article-title>A quantitative genotype algorithm reflecting H5N1 Avian influenza niches</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2368</fpage>
<lpage>2375</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm354</pub-id>
<pub-id pub-id-type="pmid">17623701</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><name><surname>Stuyver</surname>
<given-names>L</given-names>
</name>
<name><surname>De Gendt</surname>
<given-names>S</given-names>
</name>
<name><surname>Van Geyt</surname>
<given-names>C</given-names>
</name>
<name><surname>Zoulim</surname>
<given-names>F</given-names>
</name>
<name><surname>Fried</surname>
<given-names>M</given-names>
</name>
<name><surname>Schinazi</surname>
<given-names>RF</given-names>
</name>
<name><surname>Rossau</surname>
<given-names>R</given-names>
</name>
<article-title>A new genotype of hepatitis B virus: complete genome and phylogenetic relatedness</article-title>
<source>J Gen Virol</source>
<year>2000</year>
<volume>81</volume>
<fpage>67</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="pmid">10640543</pub-id>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="book"><name><surname>Colosimo</surname>
<given-names>M</given-names>
</name>
<name><surname>Hirschman</surname>
<given-names>L</given-names>
</name>
<name><surname>Keybl</surname>
<given-names>M</given-names>
</name>
<name><surname>Luciano</surname>
<given-names>J</given-names>
</name>
<name><surname>Mardis</surname>
<given-names>S</given-names>
</name>
<name><surname>Peterson</surname>
<given-names>M</given-names>
</name>
<source>Genomics For Bioforensics: MITRE Sponsored Research Final Report</source>
<year>2008</year>
<publisher-name>Bedford, MA: The MITRE Corporation</publisher-name>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Thompson</surname>
<given-names>JD</given-names>
</name>
<name><surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name><surname>Gibson</surname>
<given-names>TJ</given-names>
</name>
<article-title>CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice</article-title>
<source>Nucleic Acids Res</source>
<year>1994</year>
<volume>22</volume>
<fpage>4673</fpage>
<lpage>4680</lpage>
<pub-id pub-id-type="doi">10.1093/nar/22.22.4673</pub-id>
<pub-id pub-id-type="pmid">7984417</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<article-title>MUSCLE: a multiple sequence alignment method with reduced time and space complexity</article-title>
<source>BMC Bioinformatics</source>
<year>2004</year>
<volume>5</volume>
<fpage>113</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-5-113</pub-id>
<pub-id pub-id-type="pmid">15318951</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>Notredame</surname>
<given-names>C</given-names>
</name>
<name><surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name><surname>Heringa</surname>
<given-names>J</given-names>
</name>
<article-title>T-Coffee: A novel method for fast and accurate multiple sequence alignment</article-title>
<source>J Mol Biol</source>
<year>2000</year>
<volume>302</volume>
<fpage>205</fpage>
<lpage>217</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.2000.4042</pub-id>
<pub-id pub-id-type="pmid">10964570</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="journal"><name><surname>Do</surname>
<given-names>CB</given-names>
</name>
<name><surname>Mahabhashyam</surname>
<given-names>MS</given-names>
</name>
<name><surname>Brudno</surname>
<given-names>M</given-names>
</name>
<name><surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<article-title>ProbCons: Probabilistic consistency-based multiple sequence alignment</article-title>
<source>Genome Res</source>
<year>2005</year>
<volume>15</volume>
<fpage>330</fpage>
<lpage>340</lpage>
<pub-id pub-id-type="doi">10.1101/gr.2821705</pub-id>
<pub-id pub-id-type="pmid">15687296</pub-id>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<name><surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<article-title>Multiple sequence alignment</article-title>
<source>Curr Opin Struct Biol</source>
<year>2006</year>
<volume>16</volume>
<fpage>368</fpage>
<lpage>373</lpage>
<pub-id pub-id-type="doi">10.1016/j.sbi.2006.04.004</pub-id>
<pub-id pub-id-type="pmid">16679011</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><name><surname>DeSantis</surname>
<given-names>TZ</given-names>
<suffix>Jr</suffix>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name><surname>Keller</surname>
<given-names>K</given-names>
</name>
<name><surname>Brodie</surname>
<given-names>EL</given-names>
</name>
<name><surname>Larsen</surname>
<given-names>N</given-names>
</name>
<name><surname>Piceno</surname>
<given-names>YM</given-names>
</name>
<name><surname>Phan</surname>
<given-names>R</given-names>
</name>
<name><surname>Andersen</surname>
<given-names>GL</given-names>
</name>
<article-title>NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>W394</fpage>
<lpage>399</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl244</pub-id>
<pub-id pub-id-type="pmid">16845035</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="journal"><name><surname>Wallace</surname>
<given-names>IM</given-names>
</name>
<name><surname>O'Sullivan</surname>
<given-names>O</given-names>
</name>
<name><surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name><surname>Notredame</surname>
<given-names>C</given-names>
</name>
<article-title>M-Coffee: combining multiple sequence alignment methods with T-Coffee</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>1692</fpage>
<lpage>1699</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl091</pub-id>
<pub-id pub-id-type="pmid">16556910</pub-id>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Chu</surname>
<given-names>KH</given-names>
</name>
<name><surname>Qi</surname>
<given-names>J</given-names>
</name>
<name><surname>Yu</surname>
<given-names>ZG</given-names>
</name>
<name><surname>Anh</surname>
<given-names>V</given-names>
</name>
<article-title>Origin and phylogeny of chloroplasts revealed by a simple correlation analysis of complete genomes</article-title>
<source>Mol Biol Evol</source>
<year>2004</year>
<volume>21</volume>
<fpage>200</fpage>
<lpage>206</lpage>
<pub-id pub-id-type="pmid">14595102</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Gao</surname>
<given-names>L</given-names>
</name>
<name><surname>Qi</surname>
<given-names>J</given-names>
</name>
<name><surname>Sun</surname>
<given-names>J</given-names>
</name>
<name><surname>Hao</surname>
<given-names>B</given-names>
</name>
<article-title>Prokaryote phylogeny meets taxonomy: An exhaustive comparison of composition vector trees with systematic bacteriology</article-title>
<source>Sci China C Life Sci</source>
<year>2007</year>
<volume>50</volume>
<fpage>587</fpage>
<lpage>599</lpage>
<pub-id pub-id-type="doi">10.1007/s11427-007-0084-3</pub-id>
<pub-id pub-id-type="pmid">17879055</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="book"><name><surname>Wu</surname>
<given-names>X</given-names>
</name>
<name><surname>Wan</surname>
<given-names>X-F</given-names>
</name>
<name><surname>Wu</surname>
<given-names>G</given-names>
</name>
<name><surname>Xu</surname>
<given-names>D</given-names>
</name>
<name><surname>Lin</surname>
<given-names>G</given-names>
</name>
<article-title>Whole Genome Phyogeny via Complete Composition Vectors</article-title>
<source>Technical Report TR05-06</source>
<year>2005</year>
<publisher-name>Department of Computing Science, University of Alberta</publisher-name>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><name><surname>Retief</surname>
<given-names>JD</given-names>
</name>
<article-title>Phylogenetic analysis using PHYLIP</article-title>
<source>Methods Mol Biol</source>
<year>2000</year>
<volume>132</volume>
<fpage>243</fpage>
<lpage>258</lpage>
<pub-id pub-id-type="pmid">10547839</pub-id>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Wilgenbusch</surname>
<given-names>JC</given-names>
</name>
<name><surname>Swofford</surname>
<given-names>D</given-names>
</name>
<article-title>Inferring evolutionary trees with PAUP*</article-title>
<source>Curr Protoc Bioinformatics</source>
<year>2003</year>
<volume>6</volume>
<comment>Unit 6 4</comment>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><name><surname>Giribet</surname>
<given-names>G</given-names>
</name>
<article-title>Exploring the behavior of POY, a program for direct optimization of molecular data</article-title>
<source>Cladistics</source>
<year>2001</year>
<volume>17</volume>
<fpage>S60</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1111/j.1096-0031.2001.tb00105.x</pub-id>
<pub-id pub-id-type="pmid">12240678</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><name><surname>Saitou</surname>
<given-names>N</given-names>
</name>
<name><surname>Nei</surname>
<given-names>M</given-names>
</name>
<article-title>The neighbor-joining method: a new method for reconstructing phylogenetic trees</article-title>
<source>Mol Biol Evol</source>
<year>1987</year>
<volume>4</volume>
<fpage>406</fpage>
<lpage>425</lpage>
<pub-id pub-id-type="pmid">3447015</pub-id>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="journal"><name><surname>Rost</surname>
<given-names>U</given-names>
</name>
<name><surname>Bornberg-Bauer</surname>
<given-names>E</given-names>
</name>
<article-title>TreeWiz: interactive exploration of huge trees</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>109</fpage>
<lpage>114</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.1.109</pub-id>
<pub-id pub-id-type="pmid">11836218</pub-id>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="journal"><name><surname>Hughes</surname>
<given-names>T</given-names>
</name>
<name><surname>Hyun</surname>
<given-names>Y</given-names>
</name>
<name><surname>Liberles</surname>
<given-names>DA</given-names>
</name>
<article-title>Visualising very large phylogenetic trees in three dimensional hyperbolic space</article-title>
<source>BMC Bioinformatics</source>
<year>2004</year>
<volume>5</volume>
<fpage>48</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-5-48</pub-id>
<pub-id pub-id-type="pmid">15117420</pub-id>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><name><surname>Janies</surname>
<given-names>D</given-names>
</name>
<name><surname>Hill</surname>
<given-names>AW</given-names>
</name>
<name><surname>Guralnick</surname>
<given-names>R</given-names>
</name>
<name><surname>Habib</surname>
<given-names>F</given-names>
</name>
<name><surname>Waltari</surname>
<given-names>E</given-names>
</name>
<name><surname>Wheeler</surname>
<given-names>WC</given-names>
</name>
<article-title>Genomic analysis and geographic visualization of the spread of avian influenza (H5N1)</article-title>
<source>Syst Biol</source>
<year>2007</year>
<volume>56</volume>
<fpage>321</fpage>
<lpage>329</lpage>
<pub-id pub-id-type="doi">10.1080/10635150701266848</pub-id>
<pub-id pub-id-type="pmid">17464886</pub-id>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><name><surname>Frey</surname>
<given-names>BJ</given-names>
</name>
<name><surname>Dueck</surname>
<given-names>D</given-names>
</name>
<article-title>Clustering by passing messages between data points</article-title>
<source>Science</source>
<year>2007</year>
<volume>315</volume>
<fpage>972</fpage>
<lpage>976</lpage>
<pub-id pub-id-type="doi">10.1126/science.1136800</pub-id>
<pub-id pub-id-type="pmid">17218491</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="book"><name><surname>Dean</surname>
<given-names>J</given-names>
</name>
<name><surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
<article-title>MapReduce: simplified data processing on large clusters</article-title>
<source>Proceedings of the 6th conference on Symposium on Opearting Systems Design\& Implementation - Volume 6</source>
<year>2004</year>
<publisher-name>San Francisco, CA: USENIX Association</publisher-name>
</mixed-citation>
</ref>
<ref id="B37"><mixed-citation publication-type="journal"><name><surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<article-title>CloudBurst: highly sensitive read mapping with MapReduce</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1363</fpage>
<lpage>1369</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp236</pub-id>
<pub-id pub-id-type="pmid">19357099</pub-id>
</mixed-citation>
</ref>
<ref id="B38"><mixed-citation publication-type="journal"><name><surname>McKenna</surname>
<given-names>A</given-names>
</name>
<name><surname>Hanna</surname>
<given-names>M</given-names>
</name>
<name><surname>Banks</surname>
<given-names>E</given-names>
</name>
<name><surname>Sivachenko</surname>
<given-names>A</given-names>
</name>
<name><surname>Cibulskis</surname>
<given-names>K</given-names>
</name>
<name><surname>Kernytsky</surname>
<given-names>A</given-names>
</name>
<name><surname>Garimella</surname>
<given-names>K</given-names>
</name>
<name><surname>Altshuler</surname>
<given-names>D</given-names>
</name>
<name><surname>Gabriel</surname>
<given-names>S</given-names>
</name>
<name><surname>Daly</surname>
<given-names>M</given-names>
</name>
<name><surname>DePristo</surname>
<given-names>MA</given-names>
</name>
<article-title>The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<fpage>1297</fpage>
<lpage>1303</lpage>
<pub-id pub-id-type="doi">10.1101/gr.107524.110</pub-id>
<pub-id pub-id-type="pmid">20644199</pub-id>
</mixed-citation>
</ref>
<ref id="B39"><mixed-citation publication-type="journal"><name><surname>Matthews</surname>
<given-names>SJ</given-names>
</name>
<name><surname>Williams</surname>
<given-names>TL</given-names>
</name>
<article-title>MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 1</issue>
<fpage>S15</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-S1-S15</pub-id>
<pub-id pub-id-type="pmid">20122186</pub-id>
</mixed-citation>
</ref>
<ref id="B40"><mixed-citation publication-type="other"><name><surname>Ranger</surname>
<given-names>C</given-names>
</name>
<name><surname>Raghuraman</surname>
<given-names>R</given-names>
</name>
<name><surname>Penmetsa</surname>
<given-names>A</given-names>
</name>
<name><surname>Bradski</surname>
<given-names>G</given-names>
</name>
<name><surname>Kozyrakis</surname>
<given-names>C</given-names>
</name>
<article-title>Evaluating MapReduce for Multi-core and Multiprocessor Systems</article-title>
<source>High Performance Computer Architecture, 2007 HPCA 2007 IEEE 13th International Symposium on</source>
<year>2007</year>
<fpage>13</fpage>
<lpage>24</lpage>
</mixed-citation>
</ref>
<ref id="B41"><mixed-citation publication-type="other"><name><surname>Gabriel</surname>
<given-names>E</given-names>
</name>
<name><surname>Fagg</surname>
<given-names>GE</given-names>
</name>
<name><surname>Bosilca</surname>
<given-names>G</given-names>
</name>
<name><surname>Angskun</surname>
<given-names>T</given-names>
</name>
<name><surname>Dongarra</surname>
<given-names>JJ</given-names>
</name>
<name><surname>Squyres</surname>
<given-names>JM</given-names>
</name>
<name><surname>Sahay</surname>
<given-names>V</given-names>
</name>
<name><surname>Kambadur</surname>
<given-names>P</given-names>
</name>
<name><surname>Barrett</surname>
<given-names>B</given-names>
</name>
<name><surname>Lumsdaine</surname>
<given-names>A</given-names>
</name>
<article-title>Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation</article-title>
<source>Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary</source>
<year>2004</year>
<fpage>97</fpage>
<lpage>104</lpage>
</mixed-citation>
</ref>
<ref id="B42"><mixed-citation publication-type="journal"><name><surname>Holmes</surname>
<given-names>EC</given-names>
</name>
<name><surname>Ghedin</surname>
<given-names>E</given-names>
</name>
<name><surname>Miller</surname>
<given-names>N</given-names>
</name>
<name><surname>Taylor</surname>
<given-names>J</given-names>
</name>
<name><surname>Bao</surname>
<given-names>Y</given-names>
</name>
<name><surname>St George</surname>
<given-names>K</given-names>
</name>
<name><surname>Grenfell</surname>
<given-names>BT</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name><surname>Fraser</surname>
<given-names>CM</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Taubenberger</surname>
<given-names>JK</given-names>
</name>
<article-title>Whole-genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses</article-title>
<source>PLoS Biol</source>
<year>2005</year>
<volume>3</volume>
<fpage>e300</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pbio.0030300</pub-id>
<pub-id pub-id-type="pmid">16026181</pub-id>
</mixed-citation>
</ref>
<ref id="B43"><mixed-citation publication-type="journal"><name><surname>Reddy</surname>
<given-names>TB</given-names>
</name>
<name><surname>Riley</surname>
<given-names>R</given-names>
</name>
<name><surname>Wymore</surname>
<given-names>F</given-names>
</name>
<name><surname>Montgomery</surname>
<given-names>P</given-names>
</name>
<name><surname>DeCaprio</surname>
<given-names>D</given-names>
</name>
<name><surname>Engels</surname>
<given-names>R</given-names>
</name>
<name><surname>Gellesch</surname>
<given-names>M</given-names>
</name>
<name><surname>Hubble</surname>
<given-names>J</given-names>
</name>
<name><surname>Jen</surname>
<given-names>D</given-names>
</name>
<name><surname>Jin</surname>
<given-names>H</given-names>
</name>
<etal></etal>
<article-title>TB database: an integrated platform for tuberculosis research</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
<volume>37</volume>
<fpage>D499</fpage>
<lpage>508</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn652</pub-id>
<pub-id pub-id-type="pmid">18835847</pub-id>
</mixed-citation>
</ref>
<ref id="B44"><mixed-citation publication-type="journal"><name><surname>Wu</surname>
<given-names>X</given-names>
</name>
<name><surname>Cai</surname>
<given-names>Z</given-names>
</name>
<name><surname>Wan</surname>
<given-names>XF</given-names>
</name>
<name><surname>Hoang</surname>
<given-names>T</given-names>
</name>
<name><surname>Goebel</surname>
<given-names>R</given-names>
</name>
<name><surname>Lin</surname>
<given-names>G</given-names>
</name>
<article-title>Nucleotide composition string selection in HIV-1 subtyping using whole genomes</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>1744</fpage>
<lpage>1752</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm248</pub-id>
<pub-id pub-id-type="pmid">17495995</pub-id>
</mixed-citation>
</ref>
<ref id="B45"><mixed-citation publication-type="journal"><name><surname>Brendel</surname>
<given-names>V</given-names>
</name>
<name><surname>Beckmann</surname>
<given-names>JS</given-names>
</name>
<name><surname>Trifonov</surname>
<given-names>EN</given-names>
</name>
<article-title>Linguistics of nucleotide sequences: morphology and comparison of vocabularies</article-title>
<source>J Biomol Struct Dyn</source>
<year>1986</year>
<volume>4</volume>
<fpage>11</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="pmid">3078230</pub-id>
</mixed-citation>
</ref>
<ref id="B46"><mixed-citation publication-type="other"><name><surname>Li</surname>
<given-names>M</given-names>
</name>
<name><surname>Fang</surname>
<given-names>W</given-names>
</name>
<name><surname>Ling</surname>
<given-names>L</given-names>
</name>
<name><surname>Wang</surname>
<given-names>J</given-names>
</name>
<name><surname>Xuan</surname>
<given-names>Z</given-names>
</name>
<name><surname>Chen</surname>
<given-names>R</given-names>
</name>
<article-title>Phylogeny based on whole genome as inferred from complete infomration set analysis</article-title>
<source>Journal of Biological Physics</source>
<year>2002</year>
<fpage>439</fpage>
<lpage>447</lpage>
</mixed-citation>
</ref>
<ref id="B47"><mixed-citation publication-type="book"><name><surname>Bullard</surname>
<given-names>J</given-names>
</name>
<source>panjo: a parallel neighbor joining algorithm</source>
<year>2007</year>
<publisher-name>Berkeley</publisher-name>
</mixed-citation>
</ref>
<ref id="B48"><mixed-citation publication-type="journal"><name><surname>Fauci</surname>
<given-names>AS</given-names>
</name>
<article-title>Race against time</article-title>
<source>Nature</source>
<year>2005</year>
<volume>435</volume>
<fpage>423</fpage>
<lpage>424</lpage>
<pub-id pub-id-type="doi">10.1038/435423a</pub-id>
<pub-id pub-id-type="pmid">15917781</pub-id>
</mixed-citation>
</ref>
<ref id="B49"><mixed-citation publication-type="journal"><name><surname>Peterson</surname>
<given-names>MW</given-names>
</name>
<name><surname>Colosimo</surname>
<given-names>ME</given-names>
</name>
<article-title>TreeViewJ: an application for viewing and analyzing phylogenetic trees</article-title>
<source>Source Code Biol Med</source>
<year>2007</year>
<volume>2</volume>
<fpage>7</fpage>
<pub-id pub-id-type="doi">10.1186/1751-0473-2-7</pub-id>
<pub-id pub-id-type="pmid">17974028</pub-id>
</mixed-citation>
</ref>
<ref id="B50"><mixed-citation publication-type="journal"><name><surname>Drummond</surname>
<given-names>A</given-names>
</name>
<name><surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<article-title>PAL: an object-oriented programming library for molecular evolution and phylogenetics</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>662</fpage>
<lpage>663</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.7.662</pub-id>
<pub-id pub-id-type="pmid">11448888</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000935 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000935 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3182884
   |texte=   Nephele: genotyping via complete composition vectors and MapReduce
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21851626" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Nephele: genotyping via complete composition vectors and MapReduce

Nephele: genotyping via complete composition vectors and MapReduce

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki