Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000D029 ( Pmc/Corpus ); précédent : 000D028; suivant : 000D030 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier</title>
<author>
<name sortKey="Hu, Xiao" sort="Hu, Xiao" uniqKey="Hu X" first="Xiao" last="Hu">Xiao Hu</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Friedberg, Iddo" sort="Friedberg, Iddo" uniqKey="Friedberg I" first="Iddo" last="Friedberg">Iddo Friedberg</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31648300</idno>
<idno type="pmc">6812468</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6812468</idno>
<idno type="RBID">PMC:6812468</idno>
<idno type="doi">10.1093/gigascience/giz118</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000D02</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D02</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier</title>
<author>
<name sortKey="Hu, Xiao" sort="Hu, Xiao" uniqKey="Hu X" first="Xiao" last="Hu">Xiao Hu</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Friedberg, Iddo" sort="Friedberg, Iddo" uniqKey="Friedberg I" first="Iddo" last="Friedberg">Iddo Friedberg</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">GigaScience</title>
<idno type="eISSN">2047-217X</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="abs1">
<title>Background</title>
<p>Gene homology type classification is required for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. Consequently, a large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic data sets, these tools require high memory and CPU usage, typically available only in computational clusters.</p>
</sec>
<sec id="abs2">
<title>Findings</title>
<p>Here we present a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. SwiftOrtho uses long
<italic>k</italic>
-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long
<italic>k</italic>
-mers. In addition, it uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups. In our tests, SwiftOrtho was the only tool that completed orthology analysis of proteins from 1,760 bacterial genomes on a computer with only 4 GB RAM. Using various standard orthology data sets, we also show that SwiftOrtho has a high accuracy.</p>
</sec>
<sec id="abs3">
<title>Conclusions</title>
<p>SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low-memory computers. SwiftOrtho is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/Rinoahu/SwiftOrtho">https://github.com/Rinoahu/SwiftOrtho</ext-link>
</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fitch, Wm" uniqKey="Fitch W">WM Fitch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author>
<name sortKey="Fonstein, M" uniqKey="Fonstein M">M Fonstein</name>
</author>
<author>
<name sortKey="D Ouza, M" uniqKey="D Ouza M">M D’souza</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rivera, Mc" uniqKey="Rivera M">MC Rivera</name>
</author>
<author>
<name sortKey="Jain, R" uniqKey="Jain R">R Jain</name>
</author>
<author>
<name sortKey="Moore, Je" uniqKey="Moore J">JE Moore</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
<author>
<name sortKey="Storm, Ce" uniqKey="Storm C">CE Storm</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="O Rien, Kp" uniqKey="O Rien K">KP O’Brien</name>
</author>
<author>
<name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
<author>
<name sortKey="Sonnhammer, Ell" uniqKey="Sonnhammer E">ELL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gabald N, T" uniqKey="Gabald N T">T Gabaldón</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goodman, M" uniqKey="Goodman M">M Goodman</name>
</author>
<author>
<name sortKey="Czelusniak, J" uniqKey="Czelusniak J">J Czelusniak</name>
</author>
<author>
<name sortKey="Moore, Gw" uniqKey="Moore G">GW Moore</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kristensen, Dm" uniqKey="Kristensen D">DM Kristensen</name>
</author>
<author>
<name sortKey="Wolf, Yi" uniqKey="Wolf Y">YI Wolf</name>
</author>
<author>
<name sortKey="Mushegian, Ar" uniqKey="Mushegian A">AR Mushegian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gabald N, T" uniqKey="Gabald N T">T Gabaldón</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hulsen, T" uniqKey="Hulsen T">T Hulsen</name>
</author>
<author>
<name sortKey="Huynen, Ma" uniqKey="Huynen M">MA Huynen</name>
</author>
<author>
<name sortKey="De Vlieg, J" uniqKey="De Vlieg J">J de Vlieg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuzniar, A" uniqKey="Kuzniar A">A Kuzniar</name>
</author>
<author>
<name sortKey="Van Ham, Rchj" uniqKey="Van Ham R">RCHJ van Ham</name>
</author>
<author>
<name sortKey="Pongor, S" uniqKey="Pongor S">S Pongor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trachana, K" uniqKey="Trachana K">K Trachana</name>
</author>
<author>
<name sortKey="Larsson, Ta" uniqKey="Larsson T">TA Larsson</name>
</author>
<author>
<name sortKey="Powell, S" uniqKey="Powell S">S Powell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ward, N" uniqKey="Ward N">N Ward</name>
</author>
<author>
<name sortKey="Moreno Hagelsieb, G" uniqKey="Moreno Hagelsieb G">G Moreno-Hagelsieb</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tatusov, Rl" uniqKey="Tatusov R">RL Tatusov</name>
</author>
<author>
<name sortKey="Galperin, My" uniqKey="Galperin M">MY Galperin</name>
</author>
<author>
<name sortKey="Natale, Da" uniqKey="Natale D">DA Natale</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roth, Acj" uniqKey="Roth A">ACJ Roth</name>
</author>
<author>
<name sortKey="Gonnet, Gh" uniqKey="Gonnet G">GH Gonnet</name>
</author>
<author>
<name sortKey="Dessimoz, C" uniqKey="Dessimoz C">C Dessimoz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altenhoff, Am" uniqKey="Altenhoff A">AM Altenhoff</name>
</author>
<author>
<name sortKey="Glover, Nm" uniqKey="Glover N">NM Glover</name>
</author>
<author>
<name sortKey="Train, Cm" uniqKey="Train C">CM Train</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alexeyenko, A" uniqKey="Alexeyenko A">A Alexeyenko</name>
</author>
<author>
<name sortKey="Tamas, I" uniqKey="Tamas I">I Tamas</name>
</author>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Kisman, D" uniqKey="Kisman D">D Kisman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fischer, S" uniqKey="Fischer S">S Fischer</name>
</author>
<author>
<name sortKey="Brunk, Bp" uniqKey="Brunk B">BP Brunk</name>
</author>
<author>
<name sortKey="Chen, F" uniqKey="Chen F">F Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S van Dongen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnhammer, Ell" uniqKey="Sonnhammer E">ELL Sonnhammer</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cannon, Sb" uniqKey="Cannon S">SB Cannon</name>
</author>
<author>
<name sortKey="Young, Nd" uniqKey="Young N">ND Young</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cutts, T" uniqKey="Cutts T">T Cutts</name>
</author>
<author>
<name sortKey="Down, T" uniqKey="Down T">T Down</name>
</author>
<author>
<name sortKey="Dyer, Sc" uniqKey="Dyer S">SC Dyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goodstadt, L" uniqKey="Goodstadt L">L Goodstadt</name>
</author>
<author>
<name sortKey="Ponting, Cp" uniqKey="Ponting C">CP Ponting</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vilella, Aj" uniqKey="Vilella A">AJ Vilella</name>
</author>
<author>
<name sortKey="Severin, J" uniqKey="Severin J">J Severin</name>
</author>
<author>
<name sortKey="Ureta Vidal, A" uniqKey="Ureta Vidal A">A Ureta-Vidal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, F" uniqKey="Chen F">F Chen</name>
</author>
<author>
<name sortKey="Mackey, Aj" uniqKey="Mackey A">AJ Mackey</name>
</author>
<author>
<name sortKey="Vermunt, Jk" uniqKey="Vermunt J">JK Vermunt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altenhoff, Am" uniqKey="Altenhoff A">AM Altenhoff</name>
</author>
<author>
<name sortKey="Dessimoz, C" uniqKey="Dessimoz C">C Dessimoz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnhammer, Ell" uniqKey="Sonnhammer E">ELL Sonnhammer</name>
</author>
<author>
<name sortKey="Ostlund, G" uniqKey="Ostlund G">G Östlund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cosentino, S" uniqKey="Cosentino S">S Cosentino</name>
</author>
<author>
<name sortKey="Iwasaki, W" uniqKey="Iwasaki W">W Iwasaki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lechner, M" uniqKey="Lechner M">M Lechner</name>
</author>
<author>
<name sortKey="Findei, S" uniqKey="Findei S">S Findeiß</name>
</author>
<author>
<name sortKey="Steiner, L" uniqKey="Steiner L">L Steiner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altenhoff, Am" uniqKey="Altenhoff A">AM Altenhoff</name>
</author>
<author>
<name sortKey="Boeckmann, B" uniqKey="Boeckmann B">B Boeckmann</name>
</author>
<author>
<name sortKey="Capella Gutierrez, S" uniqKey="Capella Gutierrez S">S Capella-Gutierrez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Curwen, V" uniqKey="Curwen V">V Curwen</name>
</author>
<author>
<name sortKey="Eyras, E" uniqKey="Eyras E">E Eyras</name>
</author>
<author>
<name sortKey="Andrews, Td" uniqKey="Andrews T">TD Andrews</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, Da" uniqKey="Benson D">DA Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Camacho, C" uniqKey="Camacho C">C Camacho</name>
</author>
<author>
<name sortKey="Coulouris, G" uniqKey="Coulouris G">G Coulouris</name>
</author>
<author>
<name sortKey="Avagyan, V" uniqKey="Avagyan V">V Avagyan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brohee, S" uniqKey="Brohee S">S Brohée</name>
</author>
<author>
<name sortKey="Van Helden, J" uniqKey="Van Helden J">J van Helden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kielbasa, Sm" uniqKey="Kielbasa S">SM Kiełbasa</name>
</author>
<author>
<name sortKey="Wan, R" uniqKey="Wan R">R Wan</name>
</author>
<author>
<name sortKey="Sato, K" uniqKey="Sato K">K Sato</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buchfink, B" uniqKey="Buchfink B">B Buchfink</name>
</author>
<author>
<name sortKey="Xie, C" uniqKey="Xie C">C Xie</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medlar, A" uniqKey="Medlar A">A Medlar</name>
</author>
<author>
<name sortKey="Holm, L" uniqKey="Holm L">L Holm</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rigo, A" uniqKey="Rigo A">A Rigo</name>
</author>
<author>
<name sortKey="Pedroni, S" uniqKey="Pedroni S">S Pedroni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bratlie, Ms" uniqKey="Bratlie M">MS Bratlie</name>
</author>
<author>
<name sortKey="Johansen, J" uniqKey="Johansen J">J Johansen</name>
</author>
<author>
<name sortKey="Sherman, Bt" uniqKey="Sherman B">BT Sherman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Katju, V" uniqKey="Katju V">V Katju</name>
</author>
<author>
<name sortKey="Bergthorsson, U" uniqKey="Bergthorsson U">U Bergthorsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shiryev, Sa" uniqKey="Shiryev S">SA Shiryev</name>
</author>
<author>
<name sortKey="Papadopoulos, Js" uniqKey="Papadopoulos J">JS Papadopoulos</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Tromp, J" uniqKey="Tromp J">J Tromp</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
<author>
<name sortKey="Khoshraftar, S" uniqKey="Khoshraftar S">S Khoshraftar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, Tf" uniqKey="Smith T">TF Smith</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chao, Km" uniqKey="Chao K">KM Chao</name>
</author>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Landes, C" uniqKey="Landes C">C Landès</name>
</author>
<author>
<name sortKey="Risler, Jl" uniqKey="Risler J">JL Risler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Murphy, Lr" uniqKey="Murphy L">LR Murphy</name>
</author>
<author>
<name sortKey="Wallqvist, A" uniqKey="Wallqvist A">A Wallqvist</name>
</author>
<author>
<name sortKey="Levy, Rm" uniqKey="Levy R">RM Levy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peterson, El" uniqKey="Peterson E">EL Peterson</name>
</author>
<author>
<name sortKey="Kondev, J" uniqKey="Kondev J">J Kondev</name>
</author>
<author>
<name sortKey="Theriot, Ja" uniqKey="Theriot J">JA Theriot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, Y" uniqKey="Ye Y">Y Ye</name>
</author>
<author>
<name sortKey="Choi, Jh" uniqKey="Choi J">JH Choi</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gibbons, Tr" uniqKey="Gibbons T">TR Gibbons</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Cooper, Ed" uniqKey="Cooper E">ED Cooper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Emms, Dm" uniqKey="Emms D">DM Emms</name>
</author>
<author>
<name sortKey="Kelly, S" uniqKey="Kelly S">S Kelly</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davis, Jj" uniqKey="Davis J">JJ Davis</name>
</author>
<author>
<name sortKey="Gerdes, S" uniqKey="Gerdes S">S Gerdes</name>
</author>
<author>
<name sortKey="Olsen, Gj" uniqKey="Olsen G">GJ Olsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Frey, Bj" uniqKey="Frey B">BJ Frey</name>
</author>
<author>
<name sortKey="Dueck, D" uniqKey="Dueck D">D Dueck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lam, Sk" uniqKey="Lam S">SK Lam</name>
</author>
<author>
<name sortKey="Pitrou, A" uniqKey="Pitrou A">A Pitrou</name>
</author>
<author>
<name sortKey="Seibert, S" uniqKey="Seibert S">S Seibert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
<author>
<name sortKey="Friedberg, I" uniqKey="Friedberg I">I Friedberg</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Gigascience</journal-id>
<journal-id journal-id-type="iso-abbrev">Gigascience</journal-id>
<journal-id journal-id-type="publisher-id">gigascience</journal-id>
<journal-title-group>
<journal-title>GigaScience</journal-title>
</journal-title-group>
<issn pub-type="epub">2047-217X</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31648300</article-id>
<article-id pub-id-type="pmc">6812468</article-id>
<article-id pub-id-type="doi">10.1093/gigascience/giz118</article-id>
<article-id pub-id-type="publisher-id">giz118</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Technical Note</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Hu</surname>
<given-names>Xiao</given-names>
</name>
<xref ref-type="aff" rid="aff1"></xref>
<xref ref-type="author-notes" rid="afn1"></xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-1789-8000</contrib-id>
<name>
<surname>Friedberg</surname>
<given-names>Iddo</given-names>
</name>
<pmc-comment>idoerg@iastate.edu</pmc-comment>
<xref ref-type="aff" rid="aff1"></xref>
<xref ref-type="corresp" rid="cor1"></xref>
</contrib>
</contrib-group>
<aff id="aff1">
<institution>Department of Veterinary Microbiology and Preventive Medicine, 2118 Veterinary Medicine, College of Veterinary Medicine, Iowa State University</institution>
, Ames, IA, 50011,
<country country="US">USA</country>
</aff>
<author-notes>
<corresp id="cor1">Correspondence address. Iddo Friedberg, Department of Veterinary Microbiology and Preventive Medicine, College of Veterinary Medicine, Iowa State University, Ames, IA, USA. E-mail:
<email>idoerg@iastate.edu</email>
</corresp>
<fn id="afn1">
<p>Present address: Gianforte School of Computing, 357 Barnard Hall Montana State University, Bozeman, MT, 59717 USA</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<month>10</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2019-10-24">
<day>24</day>
<month>10</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>24</day>
<month>10</month>
<year>2019</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>8</volume>
<issue>10</issue>
<elocation-id>giz118</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>2</month>
<year>2019</year>
</date>
<date date-type="rev-recd">
<day>07</day>
<month>6</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>9</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2019. Published by Oxford University Press.</copyright-statement>
<copyright-year>2019</copyright-year>
<license license-type="cc-by" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="giz118.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="abs1">
<title>Background</title>
<p>Gene homology type classification is required for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. Consequently, a large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic data sets, these tools require high memory and CPU usage, typically available only in computational clusters.</p>
</sec>
<sec id="abs2">
<title>Findings</title>
<p>Here we present a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. SwiftOrtho uses long
<italic>k</italic>
-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long
<italic>k</italic>
-mers. In addition, it uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups. In our tests, SwiftOrtho was the only tool that completed orthology analysis of proteins from 1,760 bacterial genomes on a computer with only 4 GB RAM. Using various standard orthology data sets, we also show that SwiftOrtho has a high accuracy.</p>
</sec>
<sec id="abs3">
<title>Conclusions</title>
<p>SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low-memory computers. SwiftOrtho is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/Rinoahu/SwiftOrtho">https://github.com/Rinoahu/SwiftOrtho</ext-link>
</p>
</sec>
</abstract>
<kwd-group kwd-group-type="keywords">
<kwd>orthology analysis</kwd>
<kwd>homology search</kwd>
<kwd>orthology inference</kwd>
<kwd>clustering</kwd>
<kwd>orthologs</kwd>
<kwd>paralogs</kwd>
</kwd-group>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">National Science Foundation</named-content>
<named-content content-type="funder-identifier">10.13039/100000001</named-content>
</funding-source>
<award-id>ABI 1458359</award-id>
</award-group>
</funding-group>
<counts>
<page-count count="12"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Background</title>
<p>Gene homology type classification consists of identifying paralogs and orthologs across species. Orthologs are genes that evolved from a common ancestral gene following speciation, while paralogs are genes that are homologous owing to duplication. Paralogs can be further classified into in-paralogs, which evolved via gene duplication before the speciation event, and out-paralogs, which evolved via gene duplication after the speciation event [
<xref rid="bib1" ref-type="bibr">1</xref>
]. Classifying orthologs and paralogs across species is an important problem because the evolutionary history of genes has implications for our understanding of gene function and evolution.</p>
<p>While the proper inference of homology type involves tracing gene history using phylogenetic trees [
<xref rid="bib2" ref-type="bibr">2</xref>
], several proxy methods have been developed over the years. The most common method to infer orthologs by proxy is reciprocal best hits (RBH) [
<xref rid="bib3" ref-type="bibr">3</xref>
,
<xref rid="bib4" ref-type="bibr">4</xref>
]. Briefly, RBH states the following: when 2 proteins that are encoded by 2 genes, each in a different genome, find each other as the best-scoring match among all homologs, they are considered to be orthologs [
<xref rid="bib3" ref-type="bibr">3</xref>
,
<xref rid="bib4" ref-type="bibr">4</xref>
].</p>
<p>InParanoid extends the RBH orthology relationship to include both orthologs and in-paralogs. Specifically, InParanoid uses RBH to identify orthologs between 2 species. The genes in the 2 species are classified as in-paralogs if they are more similar to the corresponding ortholog than to any gene in the other species [
<xref rid="bib5" ref-type="bibr">5–7</xref>
]. The concept of orthologous pairs between 2 species can be extended to an "ortholog group," which is a set of genes that are hypothesized to have descended from a common ancestor [
<xref rid="bib7" ref-type="bibr">7</xref>
]. Several methods have been developed to identify ortholog groups across multiple species, typically classified as either tree-based or graph-based methods. Tree-based methods construct a gene tree from an alignment of homologous sequences in different species and infer orthology relationships by reconciling the gene tree with its corresponding species tree [
<xref rid="bib2" ref-type="bibr">2</xref>
,
<xref rid="bib8" ref-type="bibr">8</xref>
,
<xref rid="bib9" ref-type="bibr">9</xref>
] and can infer a correct orthology relationship if the correct gene tree and species tree are provided [
<xref rid="bib10" ref-type="bibr">10</xref>
]. The chief limiting factor of tree-based methods is the accuracy of the given gene tree and species tree. Erroneous trees lead to incorrect ortholog and in-paralog assignments [
<xref rid="bib9" ref-type="bibr">9–11</xref>
]. Tree-based methods are also computationally expensive, which limits the ability to apply them to a large number of species [
<xref rid="bib10" ref-type="bibr">10</xref>
,
<xref rid="bib12" ref-type="bibr">12–14</xref>
]. Graph-based methods infer orthologs and in-paralogs from homologs and then use different strategies to cluster them into orthologous groups [
<xref rid="bib9" ref-type="bibr">9</xref>
,
<xref rid="bib12" ref-type="bibr">12</xref>
,
<xref rid="bib13" ref-type="bibr">13</xref>
] (Fig. 
<xref ref-type="fig" rid="fig1">1</xref>
). The Clusters of Orthologous Groups (COG) database detects triangles of RBHs in 3 different species and merges the triangles with a common side [
<xref rid="bib15" ref-type="bibr">15</xref>
]. Orthologous Matrix (OMA) clusters RBHs in orthologous groups by finding maximum weight cliques from the similarity graph [
<xref rid="bib16" ref-type="bibr">16</xref>
,
<xref rid="bib17" ref-type="bibr">17</xref>
]. MultiParanoid is an extension of InParanoid, which uses InParanoid to detect triangle orthologs and in-paralogs in 3 different species as seeds and then merges the seeds into larger groups [
<xref rid="bib18" ref-type="bibr">18</xref>
]. OrthoMCL also uses InParanoid to detect orthologs, co-orthologs, and in-paralogs between 2 species [
<xref rid="bib19" ref-type="bibr">19</xref>
,
<xref rid="bib20" ref-type="bibr">20</xref>
] and then uses Markov clustering (MCL) [
<xref rid="bib21" ref-type="bibr">21</xref>
] to cluster these relationships into orthologous groups, where the co-orthologs are ≥2 genes in 1 species that are orthologous to ≥1 genes in another species due to a gene duplication event [
<xref rid="bib1" ref-type="bibr">1</xref>
,
<xref rid="bib22" ref-type="bibr">22</xref>
].</p>
<fig id="fig1" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>Flow chart of SwiftOrtho. SwiftOrtho is a graph-based method that consists of 3 major steps. (i) All-vs-all homology search: a seed-and-extension method is used to perform a homology search. (ii) Orthology inference: nodes are gene names; edges are similarity score of pairwise genes. 1. A
<sub>1</sub>
-B
<sub>1</sub>
are putative orthologs identified by RBH; 2. A
<sub>1</sub>
-A
<sub>2</sub>
and B
<sub>1</sub>
-B
<sub>2</sub>
are putative in-paralogs because the bit scores of these pairs are greater than A
<sub>1</sub>
-B
<sub>1</sub>
; 3. A
<sub>2</sub>
-B
<sub>1</sub>
and A
<sub>2</sub>
-B
<sub>2</sub>
are putative co-orthologs because these pairs are not orthologs but A
<sub>1</sub>
-B
<sub>1</sub>
are orthologs and A
<sub>1</sub>
-A
<sub>2</sub>
, B
<sub>1</sub>
-B
<sub>2</sub>
are in-paralogs. (iii) Cluster analysis: Markov clustering or affinity propagation algorithm is used to cluster orthology relationships.</p>
</caption>
<graphic xlink:href="giz118fig1"></graphic>
</fig>
<p>Finally, there are hybrid methods that combine both graph-based and tree-based methods [
<xref rid="bib12" ref-type="bibr">12</xref>
,
<xref rid="bib23" ref-type="bibr">23–26</xref>
]. Typically, hybrid methods first perform all-vs-all sequence alignment, then construct gene families by sequence similarity or conserved gene neighborhood. EnsEMBL first uses RBH to find the gene families, then constructs a phylogenetic gene tree for each gene family [
<xref rid="bib24" ref-type="bibr">24</xref>
]. Finally, each gene tree is reconciled with the species tree to infer paralogs and orthologs.</p>
<p>In theory, graph-based methods are less accurate than tree-based methods because the former identify orthologs and in-paralogs using proxy methods rather than directly inferring homology type from gene and species evolutionary history. However, graph-based methods have been found to be comparably accurate to tree-based methods [
<xref rid="bib10" ref-type="bibr">10</xref>
,
<xref rid="bib11" ref-type="bibr">11</xref>
,
<xref rid="bib27" ref-type="bibr">27</xref>
]. Moreover, a comparison of several methods found that tree-based methods had an even worse performance than graph-based methods on large data sets [
<xref rid="bib11" ref-type="bibr">11</xref>
].</p>
<p>One study compared several common methods, including simple RBH, graph-based, tree-based, and hybrid methods, and found that the tree-based methods of InParanoid and OrthoMCL exhibit the best balance of sensitivity and specificity [
<xref rid="bib28" ref-type="bibr">28</xref>
]. Several studies have also shown that graph-based methods find a better trade-off between specificity and sensitivity than tree-based methods [
<xref rid="bib11" ref-type="bibr">11</xref>
,
<xref rid="bib28" ref-type="bibr">28</xref>
,
<xref rid="bib29" ref-type="bibr">29</xref>
]. For these reasons, graph-based methods are generally preferred for analyzing large-scale data sets. OrthoMCL and InParanoid have been applied to analyze hundreds of genomes; at the same time, they require considerable computational resources that may not be readily available [
<xref rid="bib20" ref-type="bibr">20</xref>
,
<xref rid="bib30" ref-type="bibr">30</xref>
]. More recently, several graph-based tools, such as SonicParanoid, OMA, and ProteinOrtho [
<xref rid="bib17" ref-type="bibr">17</xref>
,
<xref rid="bib31" ref-type="bibr">31</xref>
,
<xref rid="bib32" ref-type="bibr">32</xref>
], have been developed to speed up orthology analysis on large-scale data sets. These tools also tend to require high-performance computers with large memory to analyze large-scale data.</p>
<p>Here we present SwiftOrtho, a fast method for orthology classification that makes minimal use of computational resources, especially memory. SwiftOrtho uses a seed-and-extension method to speed up homology search, a binary search method and RBH rule to infer orthologs and in-paralogs, and the affinity propagation algorithm to reduce memory usage in cluster analysis. We compare SwiftOrtho with several existing graph-based tools using the gold standard data set Orthobench [
<xref rid="bib13" ref-type="bibr">13</xref>
], and the Quest for Orthologs service [
<xref rid="bib33" ref-type="bibr">33</xref>
]. Using both benchmarks, we show that SwiftOrtho provides a high accuracy with lower CPU and memory usage than other graph-based methods. SwiftOrtho is the only tool that completed an orthology analysis of 1,760 bacterial genomes on a very low-memory computer. With the growing number of genomes, especially microbial genomes, we see SwiftOrtho to be a tool of choice for a fast and accurate ortholog classification, while requiring low computational resources, as are found in conventional desktop or laptop computers.</p>
</sec>
<sec id="sec2">
<title>Application of SwiftOrtho</title>
<sec id="sec2-1">
<title>Data sets</title>
<p>We applied SwiftOrtho to 3 data sets to evaluate its predictive quality and performance:
<list list-type="order">
<list-item>
<p>The Euk set was used to evaluate the quality of predicted orthologous groups. This set contains 420,415 protein sequences from 12 eukaryotic species, including
<italic>Caenorhabditis elegans, Drosophila melanogaster, Ciona intestinalis, Danio rerio, Tetraodon nigroviridis, Gallus gallus, Monodelphis domestica, Mus musculus, Rattus norvegicus, Canis familiaris, Pan troglodytes</italic>
, and
<italic>Homo sapiens</italic>
. The protein sequences for these genes were downloaded from EMBL v65 [
<xref rid="bib34" ref-type="bibr">34</xref>
].</p>
</list-item>
<list-item>
<p>The QfO 2011 set was used to evaluate the quality of predicted orthology relationships. This set was the reference proteome data set (2011) of The Quest for Orthologs [
<xref rid="bib33" ref-type="bibr">33</xref>
], which contains 754,149 protein sequences of 66 species.</p>
</list-item>
<list-item>
<p>The large Bac set was used to evaluate performance, including CPU time, real time, and RAM usage. This set includes 5,950,817 protein sequences from 1,760 bacterial species. The protein sequences were downloaded from GenBank [
<xref rid="bib35" ref-type="bibr">35</xref>
]. For a full list see [
<xref rid="bib64" ref-type="bibr">64</xref>
], file: readme.txt. .</p>
</list-item>
</list>
</p>
<p>We also compared SwiftOrtho with several existing orthology analysis tools for predictive quality and performance. The methods compared were OrthoMCL (v2.0), FastOrtho, OrthAgogue, and OrthoFinder.</p>
</sec>
<sec id="sec2-2">
<title>Orthology analysis pipeline</title>
<p>The pipeline for all the tools follows the standard steps of graph-based orthology prediction, (i) all-vs-all homology search, (ii) orthology inference, and (iii) cluster analysis.</p>
<sec id="sec2-2-1">
<title>Homology search</title>
<p>SwiftOrtho used its built-in module to perform all-vs-all homology search. For all 3 sets, the E-value was set 10
<sup>−5</sup>
. The amino acid alphabet was set to the regular 20 amino acids for the 3 sets. The spaced seed parameter was set to "1011111,11111" for the Euk, "11111111" for the QfO 2011, and "111111" for Bac.</p>
<p>OrthoMCL, FastOrtho, OrthAgogue, and OrthoFinder use BLASTP (v2.2.27+) [
<xref rid="bib36" ref-type="bibr">36</xref>
] to perform all-vs-all homology search. The first 3 tools require the user to do this manually. To compare the methods, the -e (e-value), -v (number of database sequences to show one-line descriptions), and -b (number of database sequence to show alignments) parameters of BLASTP were set to 10
<sup>−5</sup>
, 1,000,000, and 1,000,000 for OrthoMCL, FastOrtho, and OrthAgogue. The OrthoFinder calls BLASTP, and the e-value of BLASTP has been set to 10
<sup>−3</sup>
.</p>
</sec>
<sec id="sec2-2-2">
<title>Orthology inference</title>
<p>SwiftOrtho, OrthoMCL, FastOrtho, OrthAgogue, and OrthoFinder were applied to perform orthology inference on the homologs. The first 4 tools are able to identify (co-)orthologs and in-paralogs, and the coverage (fraction of aligned regions) was set to 50%, while other parameters were set to their default values (see
<xref ref-type="supplementary-material" rid="sup14">Supplementary Materials</xref>
section 4.2. for full details).</p>
<p>FastOrtho does not report (co-)orthologs and in-paralogs directly. However, the relevant information is stored in an intermediate file, from which we have extracted that information. Orthofinder does not report orthology relationships.</p>
</sec>
<sec id="sec2-2-3">
<title>Cluster analysis</title>
<p>All the tools in this study use MCL [
<xref rid="bib21" ref-type="bibr">21</xref>
] for clustering. To control the granularity of the clustering, MCL performs an inflation operation set by the
<italic>-I</italic>
option [
<xref rid="bib21" ref-type="bibr">21</xref>
,
<xref rid="bib37" ref-type="bibr">37</xref>
]. In this study,
<italic>-I</italic>
was set to 1.5. To take advantage of multiprocessor capabilities, we set the thread number of MCL to 12. SwiftOrtho has an alternative clustering algorithm (Affinity Propagation Cluster [APC]), which we have also applied to Euk and Bac.</p>
</sec>
</sec>
<sec id="sec2-3">
<title>Evaluation of prediction quality</title>
<sec id="sec2-3-1">
<title>Evaluation of predicted orthologous groups</title>
<p>The OrthoBench set was used to evaluate the quality of predicted orthologous groups in Bac. This set contains 70 manually curated orthologous groups of the 12 species from Bac and has been used as a high-quality gold standard benchmark set for orthologous group prediction [
<xref rid="bib13" ref-type="bibr">13</xref>
]. We used OrthoBench v2 (
<xref ref-type="supplementary-material" rid="sup14">Supplementary Table S1</xref>
). Each manually curated group of the OrthoBench v2 set finds the best match in the predicted orthologous groups, where the best match means that the number of genes shared between manually curated and predicted orthologs is maximized, and the method to calculate precision and recall is shown in
<xref ref-type="supplementary-material" rid="sup14">Supplementary Figure S1</xref>
.</p>
</sec>
<sec id="sec2-3-2">
<title>Evaluation of predicted orthology relationships</title>
<p>The Quest of Orthologs web-based service (QfO) was used to evaluate the quality of the orthology relationships predicted from the QfO 2011 set [
<xref rid="bib33" ref-type="bibr">33</xref>
]. The QfO service evaluates the predictive quality by performing 4 phylogeny-based tests, Species Tree Discordance Benchmark, Generalized Species Tree Discordance Benchmark, Agreement with Reference Gene Phylogenies: SwissTree, and Agreement with Reference Gene Phylogenies: TreeFam-A, and 2 function-based tests, Gene Ontology conservation test and Enzyme Classification conservation test [
<xref rid="bib33" ref-type="bibr">33</xref>
].</p>
<p>We also applied two more orthology prediction tools, SonicParanoid [
<xref rid="bib31" ref-type="bibr">31</xref>
] and InParanoid (v4.1) [
<xref rid="bib5" ref-type="bibr">5</xref>
], on the QfO 2011 set and used their results as control because InParanoid has the best performance among the results from the QfO service website and SonicParanoid is a fast implementation of InParanoid. The pairwise orthology relationships were extracted from the predicted orthologous groups of all the tools, including SonicParanoid and InParanoid, and then submitted to the QfO web service for further evaluation.</p>
</sec>
</sec>
<sec id="sec2-4">
<title>Hardware</title>
<p>Unless specified otherwise, all tests were run on the Condo cluster of Iowa State University with Intel Xeon E5-2640 v3 at 2.60 GHz, 128 GB RAM, 28 TB free disk. The Linux command "time -v" was used to track CPU and peak memory usage.</p>
</sec>
</sec>
<sec id="sec3">
<title>Findings</title>
<p>We compared the orthology analysis performance of SwiftOrtho, OrthoMCL, FastOrtho, OrthAgogue, and OrthFinder using Euk, QfO 2011, and Bac. The orthology analysis consisted of homology search, orthology inference, and cluster analysis.</p>
<sec id="sec3-1">
<title>Orthology analysis on Euk</title>
<p>The results of orthology analysis on Euk are summarized in Table 
<xref rid="tbl1" ref-type="table">1</xref>
and are elaborated upon below.</p>
<table-wrap id="tbl1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Comparative orthology analysis on the Euk set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th colspan="2" align="center" rowspan="1">SwiftOrtho</th>
<th align="left" rowspan="1" colspan="1">OrthoMCL</th>
<th align="center" rowspan="1" colspan="1">FastOrtho</th>
<th align="center" rowspan="1" colspan="1">OrthAgogue</th>
<th align="center" rowspan="1" colspan="1">OrthoFinder</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Homology search</td>
<td rowspan="1" colspan="1">Method</td>
<td colspan="2" align="center" rowspan="1">SwiftOrtho built-in</td>
<td colspan="4" align="left" rowspan="1">BLASTP</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Hits</td>
<td colspan="2" align="center" rowspan="1">162,695,330</td>
<td colspan="3" align="left" rowspan="1">947,203,546</td>
<td rowspan="1" colspan="1">654,792,861</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Unique Hits</td>
<td colspan="2" align="center" rowspan="1">162,695,330</td>
<td colspan="3" align="left" rowspan="1">297,107,872</td>
<td rowspan="1" colspan="1">266,104,611</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Orthology</td>
<td rowspan="1" colspan="1">(Co-)orthologs</td>
<td colspan="2" align="center" rowspan="1">1,422,920</td>
<td rowspan="1" colspan="1">8,279,424</td>
<td rowspan="1" colspan="1">3,297,613</td>
<td rowspan="1" colspan="1">1,265,553</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">inference</td>
<td rowspan="1" colspan="1">In-paralogs</td>
<td colspan="2" align="center" rowspan="1">631,033</td>
<td rowspan="1" colspan="1">2,517,166</td>
<td rowspan="1" colspan="1">2,546,296</td>
<td rowspan="1" colspan="1">759,989</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Clustering</td>
<td rowspan="1" colspan="1">Algorithm</td>
<td rowspan="1" colspan="1">MCL</td>
<td rowspan="1" colspan="1">APC</td>
<td colspan="4" align="left" rowspan="1">MCL</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Orthologous Groups</td>
<td rowspan="1" colspan="1">44,551</td>
<td rowspan="1" colspan="1">38,748</td>
<td rowspan="1" colspan="1">36,901</td>
<td rowspan="1" colspan="1">40,943</td>
<td rowspan="1" colspan="1">51,297</td>
<td rowspan="1" colspan="1">19,904</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="req-157013751608885820">
<p>APC: Affinity Propagation Cluster; MCL: Markov clustering; N/A: not available.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<sec id="sec3-1-1">
<title>Homology search</title>
<p>The homology search results show that BLASTP detected the largest number of homologs, 947,203,546. SwiftOrtho found 57.50% of the homologs detected by BLASTP but was 38.7 times faster than BLASTP. SwiftOrtho used longer
<italic>k</italic>
-mers, which reduced both specific and non-specific seed extension. The longer
<italic>k</italic>
-mers cause seed-and-extension methods to ignore sequences with low similarity. According to the RBH rule, orthologs should have higher similarity than non-orthologs, so the decrease in homolgs of SwiftOrtho does not significantly affect the next orthology inference.</p>
<p>We compared RBHs inferred from homologs detected by BLASTP and SwiftOrtho, and the numbers of RBHs for BLASTP and SwiftOrtho were 899,473 and 957,387, respectively. Identical RBHs were 767,884 (85.37% of BLASTP). These results show that although SwiftOrtho found fewer homologs than BLASTP, it did not significantly reduce the number of RBHs. The following results in Fig. 
<xref ref-type="fig" rid="fig3">3</xref>
also show that there is no significant difference between SwiftOrtho and BLASTP in predicting orthologous groups. Homology searches against a large number of protein sequences are a major bottleneck in bioinformatics pipelines. For that reason, many tools have been developed to speed up this process including, among others, BLAT, Usearch, LAST, DIAMOND, and Topaz [
<xref rid="bib38" ref-type="bibr">38–42</xref>
]. All these tools use longer
<italic>k</italic>
-mers than BLASTP to speed up performance. We also compared SwiftOrtho with them in speed and sensitivity (
<xref ref-type="supplementary-material" rid="sup14">Supplementary Table S9</xref>
). Because BLASTP is widely considered the gold standard for comparing protein sequences, we use its results as the benchmark to evaluate the sensitivity of other homology search tools. We found Usearch and LAST to be the fastest; however, they only found 0.88% and 2.97% of BLASTP's hits, respectively. Topaz and BLAT used the most CPU time but found only 33.48% and 28.34% of the BLASTP hits, respectively. SwiftOrtho and DIAMOND (more sensitive mode) had the highest sensitivity and found 52.72% and 58.30% of the BLASTP hits in a moderate amount of time, respectively. These results show that SwiftOrtho delivers a good trade-off between speed and sensitivity.</p>
</sec>
<sec id="sec3-1-2">
<title>Orthology inference</title>
<p>OrthoMCL and FastOrtho found more orthology relationships than SwiftOrtho and OrthAgogue. This is because OrthoMCL and FastOrtho use the negative log ratio of the e-value as the edge-weighting metric. The BLASTP program rounds e-value <10
<sup>−180</sup>
to 0. Consequently, for homolgs with an e-value <10
<sup>−180</sup>
, OrthoMCL and FastOrtho treat them as the RBHs, overestimating the number of orthologs. An example showing the OrthoMCL and FastOrtho overestimation can be found in Table S4.</p>
</sec>
<sec id="sec3-1-3">
<title>Use of computational resources</title>
<p>OrthoMCL v2.0 used the most CPU time and real time because of the required input/output (I/O) operations. The RAM usage of OrthoMCL was 3.45 GB, while the generated intermediate file occupied >19 TB of disk space. OrthAgogue was the most efficient in real time, because of its ability to exploit a multi-core processor. However, the RAM usage of OrthAgogue was >100 GB, which exceeds that of common workstations and many servers. The orthology inference module of FastOrtho was the most memory efficient among all the tools and was also fast. SwiftOrtho was the most CPU time efficient, although its real time was twice as that of OrthAgogue. Because the orthology inference module of SwiftOrtho was written in pure Python, we retested it by using the PyPy interpreter, an alternate implementation of Python [
<xref rid="bib43" ref-type="bibr">43</xref>
]. When running with PyPy the real run time of SwiftOrtho was close to that of OrthAgogue (Table S5).</p>
</sec>
<sec id="sec3-1-4">
<title>Cluster analysis</title>
<p>OrthoFinder identified the smallest number of orthologous groups. Other tools identified many more orthologous groups than OrthoFinder, ranging from 36,901 to 51,297. The APC algorithm found fewer clusters than the MCL algorithm.</p>
</sec>
<sec id="sec3-1-5">
<title>Evaluation of predicted orthologous groups</title>
<p>The quality of predicted orthologous groups is shown in Fig. 
<xref ref-type="fig" rid="fig2">2</xref>
. OrthoFinder had the best recall, while SwiftOrtho and OrthAgogue had top precision values but lower recall values than other tools. Because SwiftOrtho and OrthAgogue use a more stringent standard to perform orthology inference, this strategy often increases precision but decreases recall [
<xref rid="bib11" ref-type="bibr">11</xref>
,
<xref rid="bib28" ref-type="bibr">28</xref>
,
<xref rid="bib29" ref-type="bibr">29</xref>
].</p>
<fig id="fig2" orientation="portrait" position="float">
<label>Figure 2</label>
<caption>
<p>Evaluation of predicted orthologous groups. Evaluation of different tools on OrthoBench database. SO+MCL: SwiftOrtho with MCL; SO+APC: SwiftOrtho with Affinity Propagation Clustering; OM: OrthoMCL v2; FO: FastOrtho; OA: OrthAgogue; OF: OrthoFinder.</p>
</caption>
<graphic xlink:href="giz118fig2"></graphic>
</fig>
<p>Because SwiftOrtho uses its built-in homology search module and its recall is lower than BLASTP’s, it may reduce the recall of orthologous groups. To address this problem, we made 2 replacements. We replaced SwiftOrtho’s homology module with BLASTP for SwiftOrtho and replaced BLASTP with SwiftOrtho’s homology module for OrthoMCL, FastOrtho, OrthAgogue, and OrthoFinder. We then reran the orthology analysis on Euk. The results show that for most tools, replacing BLASTP with SwiftOrtho’s built-in homology search module does not significantly reduce the recall (Fig. 
<xref ref-type="fig" rid="fig3">3</xref>
). The difference in recall between using SwiftOrtho’s homology search and using BLASTP is <4% except for OrthoMCL and FastOrtho. The recall for OrthoMCL and FastOrtho decreased by 8% and 7%, respectively. The most likely reason is that the E-value of SwiftOrtho’s homology search module is more precise than that of BLASTP, which reduces the false RBHs as mentioned above. These results show that SwiftOrtho’s homology search module is a reliable and fast alternative to BLASTP.</p>
<fig id="fig3" orientation="portrait" position="float">
<label>Figure 3</label>
<caption>
<p>Comparing BLASTP and SwiftOrtho’s homology search module on the quality of orthologous group prediction. BLASTP and SwiftOrtho’s search module performed an all-vs-all search on the Euk set, respectively. Then, all the orthology prediction tools were used for orthology inference. Finally, the predicted orthology relationships were clustered into orthologous groups by MCL algorithm.</p>
</caption>
<graphic xlink:href="giz118fig3"></graphic>
</fig>
<p>To test the differences exhibited by the clustering component of SwiftOrtho, we ran SwiftOrtho with MCL and APC on the same data. The results (Fig. 
<xref ref-type="fig" rid="fig4">4</xref>
) show that the performance of APC is close to that of MCL. APC improves the recall of most tools (Fig. 
<xref ref-type="fig" rid="fig4">4</xref>
). These results show that APC has a performance similar to that of the MCL algorithm and is a reliable alternative to MCL.</p>
<fig id="fig4" orientation="portrait" position="float">
<label>Figure 4</label>
<caption>
<p>Markov clustering (MCL) versus Affinity Propagation Clustering (APC). Both algorithms were applied to cluster the orthology relationships of the Euk set inferred by different orthology prediction tools into orthologous groups. Because OrthoFinder does not report orthology relationships, the affinity propagation cannot be applied to its results.</p>
</caption>
<graphic xlink:href="giz118fig4"></graphic>
</fig>
</sec>
</sec>
<sec id="sec3-2">
<title>Orthology analysis on QfO 2011</title>
<p>The results of the orthology analysis on QfO 2011 are presented in Table 
<xref rid="tbl2" ref-type="table">2</xref>
and elaborated below.</p>
<table-wrap id="tbl2" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<p>Comparative orthology analysis on the Quest for Orthologs reference proteome 2011 data set.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th colspan="2" rowspan="1"></th>
<th align="left" rowspan="1" colspan="1">SwiftOrtho</th>
<th align="center" rowspan="1" colspan="1">OrthoMCL</th>
<th align="center" rowspan="1" colspan="1">FastOrtho</th>
<th align="center" rowspan="1" colspan="1">OrthAgogue</th>
<th align="center" rowspan="1" colspan="1">OrthoFinder</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Homology search</td>
<td rowspan="1" colspan="1">Method</td>
<td rowspan="1" colspan="1">SO built-in</td>
<td colspan="4" rowspan="1">BLASTP</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Hits</td>
<td rowspan="1" colspan="1">183,883,417</td>
<td colspan="3" rowspan="1">642,372,369</td>
<td rowspan="1" colspan="1">935,579,809</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Unique Hits</td>
<td rowspan="1" colspan="1">183,883,417</td>
<td colspan="3" rowspan="1">317,333,885</td>
<td rowspan="1" colspan="1">462,876,579</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Orthology</td>
<td rowspan="1" colspan="1">(Co-)orthologs</td>
<td rowspan="1" colspan="1">2,209,243</td>
<td rowspan="1" colspan="1">3,743,779</td>
<td rowspan="1" colspan="1">2,588,851</td>
<td rowspan="1" colspan="1">2,716,128</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">inference</td>
<td rowspan="1" colspan="1">In-paralogs</td>
<td rowspan="1" colspan="1">6,929,058</td>
<td rowspan="1" colspan="1">11,427,118</td>
<td rowspan="1" colspan="1">13,649,582</td>
<td rowspan="1" colspan="1">13,694,208</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Clustering</td>
<td rowspan="1" colspan="1">Algorithm</td>
<td colspan="4" align="left" rowspan="1">MCL</td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Orthologous groups</td>
<td rowspan="1" colspan="1">60,418</td>
<td rowspan="1" colspan="1">50,970</td>
<td rowspan="1" colspan="1">55,530</td>
<td rowspan="1" colspan="1">50,203</td>
<td rowspan="1" colspan="1">166,217</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="req-157020236516730570">
<p>MCL: Markov clustering; APC: Affinity Propagation Cluster; N/A: not available.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<sec id="sec3-2-1">
<title>Homology search</title>
<p>SwiftOrtho found 183,883,417 unique hits while BLASTP found 462,876,579 unique hits. However, SwiftOrtho was ∼163 times faster than BLASTP.</p>
</sec>
<sec id="sec3-2-2">
<title>Orthology inference</title>
<p>OrthoMCL found many more orthologs and co-orthologs than the other tools. SwiftOrtho found fewer in-paralogs than other available tools. The CPU time of SwiftOrtho was the least of all tools. When the PyPy interpreter was used, the real time of SwiftOrtho was also close to that of the fastest one, OrthAgogue (
<xref ref-type="supplementary-material" rid="sup14">Supplementary Table S6</xref>
).</p>
</sec>
<sec id="sec3-2-3">
<title>Cluster analysis</title>
<p>Overall, the clustering numbers of SwiftOrtho, OrthoMCL, FastOrtho, and OrthAgogue were similar. However, the number of clusters found by OrthoFinder was 3 times that of other tools, and the next evaluation also shows that OrthoFinder performed poorly on QfO 2011.</p>
</sec>
<sec id="sec3-2-4">
<title>Evaluation of predicted ortholog relationships</title>
<p>The evaluation shows that the performance of SwiftOrtho was close to that of InParanoid (Fig. 
<xref ref-type="fig" rid="fig5">5</xref>
). In some tests (Fig. 
<xref ref-type="fig" rid="fig5">5D</xref>
<xref ref-type="fig" rid="fig5">E</xref>
), SwiftOrtho outperformed InParanoid. SwiftOrtho had the best performance in the Generalized Species Tree Discordance Benchmark and Agreement with Reference Gene Phylogenies: TreeFam-A tests. In the Species Tree Discordance Benchmark, SwiftOrtho had the minimum Robinson-Foulds distance. In the Enzyme Classification (EC) conservation test, SwiftOrtho had the maximum Schlicker similarity. These 2 metrics reflect the accuracy of the algorithm, and the results show that SwiftOrtho has an overall higher accuracy than the other tools. At the same time, the recall of SwiftOrtho was lower in some of the QfO tests, the main reason being that SwiftOrtho uses a stringent metric system to identify orthology relationships.</p>
<fig id="fig5" orientation="portrait" position="float">
<label>Figure 5</label>
<caption>
<p>Benchmarking in Quest for Orthologs. (A) Species Tree Discordance Benchmark. InParanoid had the minimum average Robinson-Foulds distance. SwiftOrtho’s average RF distance was close to that of InParanoid. The prediction inferred by OrthoFinder was not available for this test. (B) Generalized Species Tree Discordance Benchmark. InParanoid had the minimum average Robinson-Foulds distance. The prediction inferred by OrthoFinder was not available for this test. (C) Agreement with the Reference Gene Phylogenies of SwissTree. SwiftOrtho had the highest positive prediction value rate (recall). InParanoid had the highest true-positive rate (precision). (D) Agreement with Reference Gene Phylogenies of TreeFam-A. SonicParanoid had the highest positive prediction value rate (recall); however, its true-positive rate (precision) was close to zero. SwiftOrtho had the second highest recall and precision. (E) Gene Ontology conservation test. OrthoMCL had the highest average Schlicker similarity. (F) Enzyme Classification conservation test. SwiftOrtho had the highest average Schlicker similarity. OrthoMCL detected the most orthology relationships and had the highest recall.</p>
</caption>
<graphic xlink:href="giz118fig5"></graphic>
</fig>
</sec>
</sec>
<sec id="sec3-3">
<title>Orthology analysis on Bac</title>
<p>The results of orthology analysis on Bac are summarized in Table 
<xref rid="tbl3" ref-type="table">3</xref>
.</p>
<table-wrap id="tbl3" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<p>Comparative orthology analysis on the Bac set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th colspan="2" align="center" rowspan="1">SwiftOrtho</th>
<th align="center" rowspan="1" colspan="1">OrthoMCL</th>
<th align="center" rowspan="1" colspan="1">FastOrtho</th>
<th align="center" rowspan="1" colspan="1">OrthAgogue</th>
<th align="center" rowspan="1" colspan="1">OrthoFinder</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Homology search</td>
<td rowspan="1" colspan="1">Method</td>
<td colspan="5" align="left" rowspan="1">SO built-in</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Hits</td>
<td colspan="5" align="left" rowspan="1">8,478,732,753</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Unique Hits</td>
<td colspan="5" align="left" rowspan="1">8,478,732,753</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Orthology</td>
<td rowspan="1" colspan="1">(Co-)orthologs</td>
<td align="center" rowspan="1" colspan="1">876,766,940</td>
<td colspan="2" align="center" rowspan="1">N/A</td>
<td rowspan="1" colspan="1">950,683,849</td>
<td rowspan="1" colspan="1">N/A</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">inference</td>
<td rowspan="1" colspan="1">In-paralogs</td>
<td align="center" rowspan="1" colspan="1">622,292</td>
<td colspan="2" align="center" rowspan="1">N/A</td>
<td rowspan="1" colspan="1">663,052</td>
<td rowspan="1" colspan="1">N/A</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Clustering</td>
<td rowspan="1" colspan="1">Algorithm</td>
<td rowspan="1" colspan="1">MCL</td>
<td rowspan="1" colspan="1">APC</td>
<td align="center" rowspan="1" colspan="1">MCL</td>
<td colspan="3" align="left" rowspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Orthologous groups</td>
<td rowspan="1" colspan="1">240,162</td>
<td rowspan="1" colspan="1">167,355</td>
<td rowspan="1" colspan="1">N/A</td>
<td rowspan="1" colspan="1">242,816</td>
<td rowspan="1" colspan="1">N/A</td>
<td rowspan="1" colspan="1">N/A</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="req-157020286369430570">
<p>MCL: Markov clustering; APC: Affinity Propagation Cluster; N/A: not available.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<sec id="sec3-3-1">
<title>Homology search</title>
<p>SwiftOrtho detected 8,966,131,536 homologs in the Bac set within 1,247 CPU hours.</p>
<p>Because it takes a long time to perform all-vs-all BLASTP search on the full Bac, we randomly selected 1,000 protein sequences from Bac and used them to search against the full Bac set. It took BLASTP 5.1 CPU hours to find the homologs of these 1,000 protein sequences. We infer that the estimated CPU time of BLASTP on the full Bac set should be ∼30,000 CPU hours. SwiftOrtho was almost 25 times faster than BLASTP on Bac.</p>
</sec>
<sec id="sec3-3-2">
<title>Orthology inference</title>
<p>SwiftOrtho, OrthoMCL, FastOrtho, and OrthAgogue were used to infer (co-)orthologs and in-paralogs from the homologs detected by the homology search module of SwiftOrtho in the Bac set. We did not test Orthofinder because Orthofinder does not accept a single file of homologs as input. For the 1,760 proteomes in Bac, OrthoFinder needs to perform 3,097,600 pairwise species-by-species comparisons, which will generate the same number of files. Then, OrthoFinder performs the orthology inference on these 3,097,600 files. Even at 1 minute per file, it will take an estimated 6 CPU years to process all the files.</p>
<p>Due to memory limitations, only SwiftOrtho and FastOrtho finished the orthology inference on Bac. The results are provided in Table 
<xref rid="tbl3" ref-type="table">3</xref>
. The numbers of (co-)orthologs and in-paralogs inferred by SwiftOrtho and FastOrtho were similar. The number of common orthology relationships between SwiftOrtho and FastOrtho was 861,619,519 (98.2% of SwiftOrtho and 90.6% of FastOrtho). Compared with Euk, SwiftOrtho and FastOrtho had a similar predictive quality on Bac. There are 3 possible explainations for these results. The first is that Euk contains many protein isoforms that cause FastOrtho to overestimate the number of orthologs and in-paralogs. The second is that the gene duplication rate in macteria is lower than that in eukaryotes [
<xref rid="bib44" ref-type="bibr">44</xref>
,
<xref rid="bib45" ref-type="bibr">45</xref>
]. For Bac, each gene in 1 species has only a small number of homologs in other species, which makes FastOrtho unlikely to overestimate the number of RBHs. The third is that SwiftOrtho uses double-precision floating-point to store the e-value, which increases the precision of the e-value from 10
<sup>−180</sup>
to 10
<sup>−308</sup>
. This improvement also reduces the possibility that FastOrtho may report false RBHs.</p>
<p>Computational resource use varied: of the programs tested, only SwiftOrtho and FastOrtho finished the orthology inference step. FastOrtho and OrthAgogue did not finish the tests owing to insufficient RAM; OrthoMCL aborted after running out of disk space because it needed >18 TB. The peak RAM usage of SwiftOrtho and FastOrtho was 90.6 and 99.5 GB, respectively. When we used the PyPy interpreter, the peak RAM usage of SwiftOrtho was reduced to 72.1 GB. FastOrtho was ∼1.52 times faster than SwiftOrtho, which ran the tests in the CPython interpreter. When using the PyPy interpreter, SwiftOrtho ran 1.58 times faster than FastOrtho. The memory usage and CPU time are reported in Table S7.</p>
</sec>
<sec id="sec3-3-3">
<title>Cluster analysis</title>
<p>The clustering numbers of SwiftOrtho and FastOrtho were similar. We compared the APC algorithm and the MCL algorithm, and APC found fewer clusters than MCL. The APC used much less memory and less CPU time than MCL. However, owing to the lack of support for multi-threading and a large number of I/O operations, the real run time of APC is longer than that of MCL.</p>
</sec>
<sec id="sec3-3-4">
<title>Tests on a low-memory system</title>
<p>Because SwiftOrtho is designed to process large-scale data on low-memory computers, we used it to analyze Bac on a range of computers with different specifications.</p>
<p>The results show that the memory usage of SwiftOrtho is flexible and adapts to the size of the computer’s memory. In the tests, SwiftOrtho finished an orthology analysis of the Bac set on a computer with only 4 GB RAM in a reasonable time (Table S8).</p>
</sec>
</sec>
<sec id="sec3-4">
<title>Comparison with other orthology analysis pipelines</title>
<p>SonicParanoid, OMA, and ProteinOrth are also graph-based methods and have been optimized for large-scale data sets [
<xref rid="bib17" ref-type="bibr">17</xref>
,
<xref rid="bib31" ref-type="bibr">31</xref>
,
<xref rid="bib32" ref-type="bibr">32</xref>
]. We compared SwiftOrtho with these tools in both speed and memory usage. The results are presented in Table S10. OMA seems to be the slowest because it uses the Smith-Waterman algorithm to perform all-vs-all alignment. In our tests, OMA took 0.84 CPU hours to align 2 species (4,064 and 4,140 genes) of the Bac set. For the Bac set, OMA needs to perform 3,097,600 species-by-species alignments and the total time will be >2 million CPU hours. SonicParanoid worked well on the Euk and QfO 2011 sets. Compared with SwiftOrtho, SonicParanoid ran faster and required less RAM on small data sets. However, it exited abnormally when applied to the large Bac set. Proteinortho also worked well on the Euk and QfO 2011 sets. When applied to the Bac set, Proteinortho needed to perform 1,547,920 species-by-species proteome alignments. It took Proteinortho 186.5 CPU hours, using DIAMOND, to complete 23,331 (1.5%) alignments; we therefore estimate that Proteinortho will take ∼12,355 CPU hours to finish a full homology search. Because LAST is much faster than DIAMOND, we reran Proteinortho on the Bac set, using LAST for homology search. The CPU time for LAST on the Bac set was 2,368 hours. Although the previous results (
<xref ref-type="supplementary-material" rid="sup14">Supplementary Table S9</xref>
) show that LAST is ∼20 times faster than SwiftOrtho, LAST required much more CPU time than SwiftOrtho in the all-vs-all homology search step. We think it is because the species-by-species alignment approach requires >1.5 million I/O operations, which significantly reduces the speed. The CPU utilization of orthology inference and clustering of Proteinortho was very low (<10%) when applied to the Bac set, which led to an exceptionally long real time run (>150 hours). The reason for this exceptionally long run time is because Proteinortho occupied ∼85% of physical memory when applied to large-scale data, which resulted in frequent data exchange between RAM and swap space and greatly reduced the speed. In sum, these results show that SwiftOrtho is a top performer on large-scale data.</p>
</sec>
</sec>
<sec sec-type="discussion" id="sec4">
<title>Discussion</title>
<p>We present SwiftOrtho, a new high-performance graph-based homology classification tool. Unlike most tools that only perform orthology inference, SwiftOrtho integrates all the modules necessary for a full orthology analysis, including homology search, orthology inference, and cluster analysis. SwiftOrtho is designed to analyze large-scale genomic data on a normal desktop computer in a reasonable time. In our tests, SwiftOrtho’s homology search module was nearly 30 times faster than BLASTP. The orthology inference module of SwiftOrtho was nearly 500 times faster than OrthoMCL when applied to Euk. When applied to the large-scale data set, Bac, SwiftOrtho was the only program that finished the orthology inference test on a workstation with 32 GB RAM. The cluster module of SwiftOrtho using APC can handle data that are much larger than the available RAM. In our test, APC had comparable recall and accuracy but required considerably less memory than MCL. It should be noted that APC improved the
<italic>F</italic>
<sub>1</sub>
-measure score by increasing recall in most cases. With the help of these optimized modules, SwiftOrtho has successfully finished an orthology analysis of proteins from 1,760 bacterial genomes on a machine with only 4 GB RAM, which makes SwiftOrho usable for large-scale analyses for researchers who may not have access to expensive computational resources. SwiftOrtho is not only fast but also accurate, as shown in the results produced when running on orthobench and QfO [
<xref rid="bib13" ref-type="bibr">13</xref>
,
<xref rid="bib33" ref-type="bibr">33</xref>
].</p>
</sec>
<sec id="sec5">
<title>Potential Implications</title>
<p>In summary, SwiftOrtho is a fast and accurate orthology prediction tool that can analyze a large number of sequences with minimal computational resource use. The installation and configuration of SwiftOrtho is simple and does not require the user to have any experience in database configuration. It is easy to use because the only input required by SwiftOrtho is a FASTA format file of protein sequences with taxonomy information in the header line. SwiftOrtho can be integrated into various common pipelines where fast orthology classification is required such as pan-genome analysis, large-scale phylogenetic tree construction, and other multi-genome analyses. It is specifically suited for microbial community analyses, where a large number of sequences and species are involved.</p>
</sec>
<sec sec-type="methods" id="sec6">
<title>Methods</title>
<sec id="sec6-1">
<title>Algorithms</title>
<p>Here we outline the homology search, orthology inference, and clustering as implemented in SwiftOrtho.</p>
<sec id="sec6-1-1">
<title>Homology search</title>
<p>SwiftOrtho uses a seed-and-extension algorithm to find homologous gene pairs [
<xref rid="bib46" ref-type="bibr">46</xref>
,
<xref rid="bib47" ref-type="bibr">47</xref>
]. At the seed phase, SwiftOrtho finds candidate target sequences that share common
<italic>k</italic>
-mers with the query sequence.
<italic>k</italic>
-mer size is an important factor that affects search sensitivity and speed [
<xref rid="bib38" ref-type="bibr">38</xref>
,
<xref rid="bib48" ref-type="bibr">48</xref>
]. SwiftOrtho therefore uses long (≥6)
<italic>k</italic>
-mers to accelerate search speed. At the same time,
<italic>k</italic>
-mer length is negatively correlated with sensitivity [
<xref rid="bib38" ref-type="bibr">38</xref>
]. To compensate for the loss of sensitivity caused by increasing the
<italic>k</italic>
-mer size, SwiftOrtho uses 2 approaches: non-consecutive
<italic>k</italic>
-mers and reduced amino acid alphabets. Non-consecutive
<italic>k</italic>
-mer seeds (known as spaced seeds) were introduced in PatternHunter [
<xref rid="bib19" ref-type="bibr">19</xref>
,
<xref rid="bib49" ref-type="bibr">49</xref>
]. The main difference between consecutive seeds and spaced seeds is that the latter allow mismatches in alignment. For example, the spaced seed 101101 allows mismatches at positions 2 and 5. The total number of matched positions in a spaced seed is known as the weight, so the weight of this seed is 4. A consecutive seed can be considered as a special case of spaced seed in which its weight equals its length. Spaced seeds often provide a better sensitivity than consecutive seeds [
<xref rid="bib49" ref-type="bibr">49</xref>
,
<xref rid="bib50" ref-type="bibr">50</xref>
]. Several tools such as PatternHunter, Usearch, LAST, and DIAMOND [
<xref rid="bib19" ref-type="bibr">19</xref>
,
<xref rid="bib39" ref-type="bibr">39–41</xref>
,
<xref rid="bib49" ref-type="bibr">49</xref>
] have used spaced seed to increase sensitivity. PatternHunter and Usearch allow users to use custom spaced seed. The default spaced seed patterns of SwiftOrtho are 1110100010001011, 11010110111—two spaced seeds with weight of 8—but the user can define their own spaced seeds. Seed patterns were optimized using SpEED [
<xref rid="bib50" ref-type="bibr">50</xref>
] and manual inspection. The choice of the spaced seeds and default alphabet are elaborated upon in the Methods section and in the
<xref ref-type="supplementary-material" rid="sup14">Supplementary Materials</xref>
sections 2.1 and 3.. At the extension phase, SwiftOrtho uses a variation of the Smith-Waterman algorithm [
<xref rid="bib51" ref-type="bibr">51</xref>
], the
<italic>k</italic>
-banded Smith-Waterman or
<italic>k</italic>
-SWAT, which only allows for
<italic>k</italic>
gaps [
<xref rid="bib52" ref-type="bibr">52</xref>
].
<italic>k</italic>
-SWAT fills a band of cells along the main diagonal of the similarity score matrix (Figure 
<xref ref-type="fig" rid="fig6">6B</xref>
), and the complexity of
<italic>k</italic>
-SWAT is reduced to
<italic>O</italic>
[
<italic>k</italic>
· min(
<italic>n, m</italic>
)], where
<italic>k</italic>
is the maximum allowed number of gaps.</p>
<fig id="fig6" orientation="portrait" position="float">
<label>Figure 6</label>
<caption>
<p>Comparing standard Smith-Waterman with banded Smith-Waterman. A. Similarity score matrix for standard Smith-Waterman. The standard Smith-Waterman algorithm needs to calculate all the entries. B. Similarity score matrix for banded Smith-Waterman. The banded Smith-Waterman algorithm only needs to calculate the entries on and near the diagonal.</p>
</caption>
<graphic xlink:href="giz118fig6"></graphic>
</fig>
<p>Reduced alphabets are used to represent protein sequences using an alternative alphabet that combines several amino acids into a single representative letter, based on common physico-chemical traits [
<xref rid="bib53" ref-type="bibr">53–55</xref>
]. Compared with the original alphabet of 20 amino acids, reduced alphabets usually improve sensitivity [
<xref rid="bib56" ref-type="bibr">56</xref>
,
<xref rid="bib57" ref-type="bibr">57</xref>
]. At the same time, reduced alphabets also introduce less specific seeds than the original alphabet, reducing the search speed.</p>
</sec>
<sec id="sec6-1-2">
<title>Orthology inference</title>
<p>The orthology inference step in Fig. 
<xref ref-type="fig" rid="fig1">1</xref>
shows the algorithm to infer orthologs and in-paralogs from homologs: gene A
<sub>1</sub>
in genome A and B
<sub>1</sub>
in genome B are considered to be orthologs according to the RBH rule. If the bit score between gene A
<sub>1</sub>
and A
<sub>2</sub>
in genome A is higher than that between A
<sub>1</sub>
and all its orthologs in other genomes, A
<sub>1</sub>
and A
<sub>2</sub>
are considered in-paralogs in genome A. If A
<sub>1</sub>
in genome A and B
<sub>1</sub>
in genome B are orthologs, in-paralogs of A
<sub>1</sub>
and B
<sub>1</sub>
are co-orthologs. Because orthology inference requires many queries, it is better to store the data in a way that facilitates fast querying. First, SwiftOrtho sorts the data and stores it in the file system. Then, it uses binary search to query the sorted data, dramatically reducing memory usage when compared with a relational database management system or a hash table. With the help of this query system, SwiftOrtho can process data that are much larger than the computer memory.</p>
<p>The inferred relationships are treated as the edges of a graph. Each edge is assigned a weight for cluster analysis, where using appropriate edge-weighting metrics can improve the accuracy of cluster analysis. Gibbons et al. [
<xref rid="bib58" ref-type="bibr">58</xref>
] compared the performance of several BLAST-based edge-weighting metrics and found that the bit score had the best performance. Therefore, SwiftOrtho uses the normalized bit score as edge-weighting metric. The normalization step takes the same approach as OrthoMCL [
<xref rid="bib20" ref-type="bibr">20</xref>
]. For orthologs or co-orthologs, the weight of (co-)ortholog (Fig. 
<xref ref-type="fig" rid="fig1">1</xref>
) A
<sub>1</sub>
in genome A and B
<sub>1</sub>
in genome B is divided by the average edge-weight of all the (co-)orthologs between genome A and genome B. For in-paralogs, SwiftOrtho identifies a subset S of all in-paralogs in genome A, with each in-paralog A
<sub>
<italic>x</italic>
</sub>
-A
<sub>
<italic>y</italic>
</sub>
in subset S, A
<sub>
<italic>x</italic>
</sub>
or A
<sub>
<italic>y</italic>
</sub>
having ≥1 ortholog in another genome. The weight of each in-paralog in genome A is divided by the mean edge-weight of subset S in genome A [
<xref rid="bib20" ref-type="bibr">20</xref>
].</p>
</sec>
<sec id="sec6-1-3">
<title>Clustering orthology relationships into orthologous groups</title>
<p>SwiftOrtho provides 2 methods to cluster orthology relationships into orthologous groups. One is the Markov cluster algorithm (MCL), an unsupervised clustering algorithm based on simulation of flow in graphs [
<xref rid="bib21" ref-type="bibr">21</xref>
]. MCL is fast and robust on small networks and has been used by several graph-based tools [
<xref rid="bib19" ref-type="bibr">19</xref>
,
<xref rid="bib59" ref-type="bibr">59–61</xref>
]. However, MCL may run out of memory when applied to a large-scale network. To reduce memory usage, we cluster each individual connected component instead of the whole network because there is no flow among components [
<xref rid="bib21" ref-type="bibr">21</xref>
]. For large and dense networks a single connected component could still be too large to be loaded into memory.</p>
<p>For large networks, SwiftOrtho uses an APC algorithm [
<xref rid="bib62" ref-type="bibr">62</xref>
]. The APC algorithm finds a set of centers in a network, where the centers are the actual data points and are called “exemplars.” To find exemplars, APC needs to maintain 2 matrices: the responsibility matrix
<italic>R</italic>
and the availability matrix
<italic>A</italic>
. The element
<italic>R
<sub>i, k</sub>
</italic>
in
<italic>R</italic>
reflects how well suited node
<italic>k</italic>
is to serve as the exemplar for node
<italic>i</italic>
while the element
<italic>A
<sub>i, k</sub>
</italic>
in
<italic>A</italic>
reflects how appropriate node
<italic>i</italic>
is to choose node
<italic>k</italic>
as its exemplar [
<xref rid="bib62" ref-type="bibr">62</xref>
]. APC uses Equation (
<xref ref-type="disp-formula" rid="update49119_equ1">1</xref>
) to update
<italic>R</italic>
and Equation (
<xref ref-type="disp-formula" rid="update49119_update49119_update49119_equ2">2</xref>
) to update
<italic>A</italic>
, where
<italic>i, k, i</italic>
′,
<italic>k</italic>
′ denote the node number and
<italic>S
<inline-formula>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$_{i,k^{\prime }}$\end{document}</tex-math>
</inline-formula>
</italic>
denotes the similarity between node
<italic>i</italic>
and node
<italic>k</italic>
′.
<disp-formula id="update49119_equ1">
<label>(1)</label>
<tex-math id="M2">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$$\begin{equation*} R_{i,k} = S_{i,k} - \mathrm{max}_{k^{\prime } \ne k} \lbrace A_{i, k^{\prime }} + S_{i, k^{\prime }}\rbrace, \end{equation*}$$\end{document}</tex-math>
</disp-formula>
<disp-formula id="update49119_update49119_update49119_equ2">
<label>(2)</label>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$$\begin{equation*} A_{i, k} = \left\lbrace \begin{array}{@{}l@{\quad }l@{}}\mathrm{min} \lbrace 0, R_{k, k} + \sum _{i^{\prime }\not\in \lbrace i, k\rbrace } \mathrm{max} \lbrace 0, R_{i^{\prime }, k} \rbrace , & \text{if}\ i \ne k \\ \sum _{i^{\prime }\ne k} \mathrm{max} \lbrace 0, R_{i^{\prime }, k} \rbrace , & \text{if}\ i = k \end{array}\right. \end{equation*}$$\end{document}</tex-math>
</disp-formula>
</p>
<p>The node
<italic>k</italic>
that maximizes
<italic>A
<sub>i, k</sub>
</italic>
+
<italic>R
<sub>i, k</sub>
</italic>
is the exemplar of node
<italic>i</italic>
, and each node
<italic>i</italic>
is assigned to its nearest exemplar. APC can update each element of matrix
<italic>R</italic>
and
<italic>A</italic>
one by one, so it is unnecessary to keep the whole matrix of
<italic>R</italic>
and
<italic>A</italic>
in memory. Generally, the time complexity of APC is O(
<italic>N</italic>
<sup>2</sup>
·
<italic>T</italic>
), where
<italic>N</italic>
is the number of nodes and
<italic>T</italic>
is the number of iterations [
<xref rid="bib62" ref-type="bibr">62</xref>
]. In this case, the time complexity is
<italic>O</italic>
(
<italic>E</italic>
·
<italic>T</italic>
), where
<italic>E</italic>
stands for edges, which is the number of orthology relationships, and
<italic>T</italic>
is the number of iterations. We implemented APC in Python, using Numba [
<xref rid="bib63" ref-type="bibr">63</xref>
] to accelerate the numeric-intensive calculation parts.</p>
</sec>
</sec>
</sec>
<sec id="sec7">
<title>Availability of Source Code and Requirements</title>
<p>Project Name: SwiftOrtho</p>
<p>Project Home Page:
<ext-link ext-link-type="uri" xlink:href="https://github.com/Rinoahu/SwiftOrtho">https://github.com/Rinoahu/SwiftOrtho</ext-link>
</p>
<p>Operating System(s): SwiftOrtho was tested on GNU/Linux distribution Ubuntu 16.04 64-bit, but we expect SwitOrtho to work on most *nix systems</p>
<p>Programming Language: Python</p>
<p>Other Requirements: Python 2.7, Python 3.7, PyPy2.7 v7.0 or higher</p>
<p>License: GPLv3</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://scicrunch.org/browse/resources/SCR_017122">RRID:SCR_017122</ext-link>
</p>
</sec>
<sec sec-type="materials" id="sec8">
<title>Availability of Supporting Data and Materials</title>
<p>The data sets supporting the results of this article are available in the GigaDB repository [
<xref rid="bib64" ref-type="bibr">64</xref>
].</p>
</sec>
<sec sec-type="supplementary-material" id="sec9">
<title>Additional Files</title>
<p>Supplementary Material S1. Further details on the methodolgy.</p>
<supplementary-material content-type="local-data" id="sup1">
<label>giz118_GIGA-D-19-00043_Original_Submission</label>
<media xlink:href="giz118_giga-d-19-00043_original_submission.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup2">
<label>giz118_GIGA-D-19-00043_Revision_1</label>
<media xlink:href="giz118_giga-d-19-00043_revision_1.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup3">
<label>giz118_GIGA-D-19-00043_Revision_2</label>
<media xlink:href="giz118_giga-d-19-00043_revision_2.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup4">
<label>giz118_GIGA-D-19-00043_Revision_3</label>
<media xlink:href="giz118_giga-d-19-00043_revision_3.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup5">
<label>giz118_Response_to_Reviewer_Comments_Original_Submission</label>
<media xlink:href="giz118_response_to_reviewer_comments_original_submission.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup6">
<label>giz118_Response_to_Reviewer_Comments_Revision_1</label>
<media xlink:href="giz118_response_to_reviewer_comments_revision_1.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup7">
<label>giz118_Response_to_Reviewer_Comments_Revision_2</label>
<media xlink:href="giz118_response_to_reviewer_comments_revision_2.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup8">
<label>giz118_Reviewer_1_Report_Original_Submission</label>
<caption>
<p>Marnix Medema -- 3/14/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_1_report_original_submission.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup9">
<label>giz118_Reviewer_1_Report_Revision_1</label>
<caption>
<p>Marnix Medema -- 6/27/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_1_report_revision_1.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup10">
<label>giz118_Reviewer_2_Report_Original_Submission</label>
<caption>
<p>Marcus Lechner, Ph.D. -- 3/22/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_2_report_original_submission.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup11">
<label>giz118_Reviewer_2_Report_Revision_1</label>
<caption>
<p>Marcus Lechner, Ph.D. -- 7/10/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_2_report_revision_1.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup12">
<label>giz118_Reviewer_3_Report_Original_Submission</label>
<caption>
<p>Robert Davey -- 3/29/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_3_report_original_submission.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup13">
<label>giz118_Reviewer_3_Report_Revision_1</label>
<caption>
<p>Robert Davey -- 7/16/2019 Reviewed</p>
</caption>
<media xlink:href="giz118_reviewer_3_report_revision_1.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="sup14">
<label>giz118_Supplemental_File</label>
<media xlink:href="giz118_supplemental_file.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
<sec id="sec10">
<title>Abbreviations</title>
<p>APC: Affinity Propagation Clustering; BLAST: Basic Local Alignment Search Tool; BLAT: BLAST-Like Alignment Tool; COG: Clusters of Orthologous Groups; CPU: central processing unit; I/O: input/output; MCL: Markov clustering; RBH: reciprocal best hit; OMA: Orthologous Matrix; QFO: Quest for Orthologs; RAM: random access memory.</p>
</sec>
<sec id="sec13">
<title>Competing Interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="sec14">
<title>Funding</title>
<p>This study has been funded, in part, by National Science Foundation award ABI 1458359. The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.</p>
</sec>
<sec id="sec17">
<title>Author information</title>
<p>I.F. is an associate professor at the Department of Veterinary Microbiology and Preventive Medicine at Iowa State University. He is also the chair of the Interdepartmental Bioinformatics and Computational Biology graduate program. X.H. was a postodcotoral associate at Iowa State University at the time of this work, and currently is a postdoctoral associate at the Gianforte School of Computing, Montana State University.</p>
</sec>
<sec id="sec15">
<title>Author’s Contributions</title>
<p>Both authors conceived the study. X.H. wrote the software and performed the analysis. Both authors wrote the manuscript.</p>
</sec>
</body>
<back>
<ack id="ack1">
<title>ACKNOWLEDGEMENTS</title>
<p>The authors acknowledge fruitful discussions with all members of the Friedberg Laboratory.</p>
</ack>
<ref-list id="ref1">
<title>References</title>
<ref id="bib1">
<label>1.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<article-title>Orthologs, paralogs, and evolutionary genomics</article-title>
.
<source>Annu Rev Genet</source>
.
<year>2005</year>
;
<volume>39</volume>
:
<fpage>309</fpage>
<lpage>38</lpage>
.
<pub-id pub-id-type="pmid">16285863</pub-id>
</mixed-citation>
</ref>
<ref id="bib2">
<label>2.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Fitch</surname>
<given-names>WM</given-names>
</name>
</person-group>
<article-title>Distinguishing homologous from analogous proteins</article-title>
.
<source>Syst Zool</source>
.
<year>1970</year>
;
<volume>19</volume>
(
<issue>2</issue>
):
<fpage>99</fpage>
.
<pub-id pub-id-type="pmid">5449325</pub-id>
</mixed-citation>
</ref>
<ref id="bib3">
<label>3.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Overbeek</surname>
<given-names>R</given-names>
</name>
,
<name name-style="western">
<surname>Fonstein</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>D’souza</surname>
<given-names>M</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>The use of gene clusters to infer functional coupling</article-title>
.
<source>Genetics</source>
.
<year>1999</year>
;
<volume>96</volume>
:
<fpage>2896</fpage>
<lpage>901</lpage>
.</mixed-citation>
</ref>
<ref id="bib4">
<label>4.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rivera</surname>
<given-names>MC</given-names>
</name>
,
<name name-style="western">
<surname>Jain</surname>
<given-names>R</given-names>
</name>
,
<name name-style="western">
<surname>Moore</surname>
<given-names>JE</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Genomic evidence for two functionally distinct gene classes</article-title>
.
<source>Genetics</source>
.
<year>1998</year>
;
<volume>95</volume>
:
<fpage>6239</fpage>
<lpage>44</lpage>
.</mixed-citation>
</ref>
<ref id="bib5">
<label>5.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Remm</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>Storm</surname>
<given-names>CE</given-names>
</name>
,
<name name-style="western">
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
</person-group>
<article-title>Automatic clustering of orthologs and in-paralogs from pairwise species comparisons</article-title>
.
<source>J Mol Biol</source>
.
<year>2001</year>
;
<volume>314</volume>
(
<issue>5</issue>
):
<fpage>1041</fpage>
<lpage>52</lpage>
.
<pub-id pub-id-type="pmid">11743721</pub-id>
</mixed-citation>
</ref>
<ref id="bib6">
<label>6.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>O’Brien</surname>
<given-names>KP</given-names>
</name>
,
<name name-style="western">
<surname>Remm</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>Sonnhammer</surname>
<given-names>ELL</given-names>
</name>
</person-group>
<article-title>Inparanoid: a comprehensive database of eukaryotic orthologs</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2005</year>
;
<volume>33</volume>
(
<issue>Database issue</issue>
):
<fpage>D476</fpage>
<lpage>80</lpage>
.
<pub-id pub-id-type="pmid">15608241</pub-id>
</mixed-citation>
</ref>
<ref id="bib7">
<label>7.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Gabaldón</surname>
<given-names>T</given-names>
</name>
,
<name name-style="western">
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<source>Nat Rev Genet</source>
.
<year>2013</year>
;
<volume>14</volume>
(
<issue>5</issue>
):
<fpage>360</fpage>
<lpage>6</lpage>
.
<pub-id pub-id-type="pmid">23552219</pub-id>
</mixed-citation>
</ref>
<ref id="bib8">
<label>8.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Goodman</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>Czelusniak</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Moore</surname>
<given-names>GW</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences</article-title>
.
<source>Syst Biol</source>
.
<year>1979</year>
;
<volume>28</volume>
(
<issue>2</issue>
):
<fpage>132</fpage>
<lpage>63</lpage>
.</mixed-citation>
</ref>
<ref id="bib9">
<label>9.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kristensen</surname>
<given-names>DM</given-names>
</name>
,
<name name-style="western">
<surname>Wolf</surname>
<given-names>YI</given-names>
</name>
,
<name name-style="western">
<surname>Mushegian</surname>
<given-names>AR</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<source>Brief Bioinform</source>
.
<year>2011</year>
;
<volume>12</volume>
(
<issue>5</issue>
):
<fpage>379</fpage>
<lpage>91</lpage>
.
<pub-id pub-id-type="pmid">21690100</pub-id>
</mixed-citation>
</ref>
<ref id="bib10">
<label>10.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Gabaldón</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Large-scale assignment of orthology: back to phylogenetics?</article-title>
.
<source>Genome Biol</source>
.
<year>2008</year>
;
<volume>9</volume>
(
<issue>10</issue>
):
<fpage>235</fpage>
.
<pub-id pub-id-type="pmid">18983710</pub-id>
</mixed-citation>
</ref>
<ref id="bib11">
<label>11.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hulsen</surname>
<given-names>T</given-names>
</name>
,
<name name-style="western">
<surname>Huynen</surname>
<given-names>MA</given-names>
</name>
,
<name name-style="western">
<surname>de Vlieg</surname>
<given-names>J</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Benchmarking ortholog identification methods using functional genomics data</article-title>
.
<source>Genome Biol</source>
.
<year>2006</year>
;
<volume>7</volume>
(
<issue>4</issue>
):
<fpage>R31</fpage>
.
<pub-id pub-id-type="pmid">16613613</pub-id>
</mixed-citation>
</ref>
<ref id="bib12">
<label>12.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kuzniar</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>van Ham</surname>
<given-names>RCHJ</given-names>
</name>
,
<name name-style="western">
<surname>Pongor</surname>
<given-names>S</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>The quest for orthologs: finding the corresponding gene across genomes</article-title>
.
<source>Trends Genet</source>
.
<year>2008</year>
;
<volume>24</volume>
(
<issue>11</issue>
):
<fpage>539</fpage>
<lpage>51</lpage>
.
<pub-id pub-id-type="pmid">18819722</pub-id>
</mixed-citation>
</ref>
<ref id="bib13">
<label>13.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Trachana</surname>
<given-names>K</given-names>
</name>
,
<name name-style="western">
<surname>Larsson</surname>
<given-names>TA</given-names>
</name>
,
<name name-style="western">
<surname>Powell</surname>
<given-names>S</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Orthology prediction methods: a quality assessment using curated protein families</article-title>
.
<source>Bioessays</source>
.
<year>2011</year>
;
<volume>33</volume>
(
<issue>10</issue>
):
<fpage>769</fpage>
<lpage>80</lpage>
.
<pub-id pub-id-type="pmid">21853451</pub-id>
</mixed-citation>
</ref>
<ref id="bib14">
<label>14.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ward</surname>
<given-names>N</given-names>
</name>
,
<name name-style="western">
<surname>Moreno-Hagelsieb</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss?</article-title>
.
<source>PLoS One</source>
.
<year>2014</year>
;
<volume>9</volume>
(
<issue>7</issue>
):
<fpage>e101850</fpage>
.
<pub-id pub-id-type="pmid">25013894</pub-id>
</mixed-citation>
</ref>
<ref id="bib15">
<label>15.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
,
<name name-style="western">
<surname>Galperin</surname>
<given-names>MY</given-names>
</name>
,
<name name-style="western">
<surname>Natale</surname>
<given-names>DA</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>The COG database: a tool for genome-scale analysis of protein functions and evolution</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2000</year>
;
<volume>28</volume>
(
<issue>1</issue>
):
<fpage>33</fpage>
<lpage>6</lpage>
.
<pub-id pub-id-type="pmid">10592175</pub-id>
</mixed-citation>
</ref>
<ref id="bib16">
<label>16.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Roth</surname>
<given-names>ACJ</given-names>
</name>
,
<name name-style="western">
<surname>Gonnet</surname>
<given-names>GH</given-names>
</name>
,
<name name-style="western">
<surname>Dessimoz</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Algorithm of OMA for large-scale orthology inference</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2008</year>
;
<volume>9</volume>
(
<issue>1</issue>
):
<fpage>518</fpage>
.
<pub-id pub-id-type="pmid">19055798</pub-id>
</mixed-citation>
</ref>
<ref id="bib17">
<label>17.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Altenhoff</surname>
<given-names>AM</given-names>
</name>
,
<name name-style="western">
<surname>Glover</surname>
<given-names>NM</given-names>
</name>
,
<name name-style="western">
<surname>Train</surname>
<given-names>CM</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2018</year>
;
<volume>46</volume>
(
<issue>D1</issue>
):
<fpage>D477</fpage>
<lpage>85</lpage>
.
<pub-id pub-id-type="pmid">29106550</pub-id>
</mixed-citation>
</ref>
<ref id="bib18">
<label>18.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Alexeyenko</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>Tamas</surname>
<given-names>I</given-names>
</name>
,
<name name-style="western">
<surname>Liu</surname>
<given-names>G</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Automatic clustering of orthologs and inparalogs shared by multiple proteomes</article-title>
.
<source>Bioinformatics</source>
.
<year>2006</year>
;
<volume>22</volume>
(
<issue>14</issue>
):
<fpage>e9</fpage>
<lpage>15</lpage>
.
<pub-id pub-id-type="pmid">16873526</pub-id>
</mixed-citation>
</ref>
<ref id="bib19">
<label>19.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>Ma</surname>
<given-names>B</given-names>
</name>
,
<name name-style="western">
<surname>Kisman</surname>
<given-names>D</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>PatternHunter II: highly sensitive and fast homology search</article-title>
.
<source>Genome Inform</source>
.
<year>2003</year>
;
<volume>14</volume>
(
<issue>3</issue>
):
<fpage>164</fpage>
<lpage>75</lpage>
.
<pub-id pub-id-type="pmid">15706531</pub-id>
</mixed-citation>
</ref>
<ref id="bib20">
<label>20.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Fischer</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Brunk</surname>
<given-names>BP</given-names>
</name>
,
<name name-style="western">
<surname>Chen</surname>
<given-names>F</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups</article-title>
.
<source>Curr Protoc Bioinformatics</source>
.
<year>2011</year>
;
<volume>35</volume>
(
<issue>1</issue>
):
<fpage>6.12.1</fpage>
<lpage>19</lpage>
.</mixed-citation>
</ref>
<ref id="bib21">
<label>21.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>van Dongen</surname>
<given-names>S</given-names>
</name>
</person-group>
Graph clustering by flow simulation.
<year>2000</year>
<comment>Ph.D. Thesis, Univers</comment>
ity of Utrecht.</mixed-citation>
</ref>
<ref id="bib22">
<label>22.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Sonnhammer</surname>
<given-names>ELL</given-names>
</name>
,
<name name-style="western">
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<article-title>Orthology, paralogy and proposed classification for paralog subtypes</article-title>
.
<source>Trends Genet</source>
.
<year>2002</year>
;
<volume>18</volume>
(
<issue>12</issue>
):
<fpage>619</fpage>
<lpage>20</lpage>
.
<pub-id pub-id-type="pmid">12446146</pub-id>
</mixed-citation>
</ref>
<ref id="bib23">
<label>23.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Cannon</surname>
<given-names>SB</given-names>
</name>
,
<name name-style="western">
<surname>Young</surname>
<given-names>ND</given-names>
</name>
</person-group>
<article-title>OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2003</year>
;
<volume>4</volume>
:
<fpage>35</fpage>
.
<pub-id pub-id-type="pmid">12952558</pub-id>
</mixed-citation>
</ref>
<ref id="bib24">
<label>24.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Cutts</surname>
<given-names>T</given-names>
</name>
,
<name name-style="western">
<surname>Down</surname>
<given-names>T</given-names>
</name>
,
<name name-style="western">
<surname>Dyer</surname>
<given-names>SC</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Ensembl 2007</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2007</year>
;
<volume>35</volume>
(
<issue>Database issue</issue>
):
<fpage>D610</fpage>
<lpage>7</lpage>
.
<pub-id pub-id-type="pmid">17148474</pub-id>
</mixed-citation>
</ref>
<ref id="bib25">
<label>25.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Li</surname>
<given-names>H</given-names>
</name>
,
<name name-style="western">
<surname>Chen</surname>
<given-names>Z</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>TreeFam: 2008 update</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2008</year>
;
<volume>36</volume>
(
<issue>Database issue</issue>
):
<fpage>D735</fpage>
<lpage>40</lpage>
.
<pub-id pub-id-type="pmid">18056084</pub-id>
</mixed-citation>
</ref>
<ref id="bib26">
<label>26.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Goodstadt</surname>
<given-names>L</given-names>
</name>
,
<name name-style="western">
<surname>Ponting</surname>
<given-names>CP</given-names>
</name>
</person-group>
<article-title>Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human</article-title>
.
<source>PLoS Comput Biol</source>
.
<year>2006</year>
;
<volume>2</volume>
(
<issue>9</issue>
):
<fpage>e133</fpage>
.
<pub-id pub-id-type="pmid">17009864</pub-id>
</mixed-citation>
</ref>
<ref id="bib27">
<label>27.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Vilella</surname>
<given-names>AJ</given-names>
</name>
,
<name name-style="western">
<surname>Severin</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Ureta-Vidal</surname>
<given-names>A</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates</article-title>
.
<source>Genome Res</source>
.
<year>2009</year>
;
<volume>19</volume>
(
<issue>2</issue>
):
<fpage>327</fpage>
<lpage>35</lpage>
.
<pub-id pub-id-type="pmid">19029536</pub-id>
</mixed-citation>
</ref>
<ref id="bib28">
<label>28.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chen</surname>
<given-names>F</given-names>
</name>
,
<name name-style="western">
<surname>Mackey</surname>
<given-names>AJ</given-names>
</name>
,
<name name-style="western">
<surname>Vermunt</surname>
<given-names>JK</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Assessing performance of orthology detection strategies applied to eukaryotic genomes</article-title>
.
<source>PLoS One</source>
.
<year>2007</year>
;
<volume>2</volume>
(
<issue>4</issue>
):
<fpage>e383</fpage>
.
<pub-id pub-id-type="pmid">17440619</pub-id>
</mixed-citation>
</ref>
<ref id="bib29">
<label>29.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Altenhoff</surname>
<given-names>AM</given-names>
</name>
,
<name name-style="western">
<surname>Dessimoz</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Phylogenetic and functional assessment of orthologs inference projects and methods</article-title>
.
<source>PLoS Comput Biol</source>
.
<year>2009</year>
;
<volume>5</volume>
(
<issue>1</issue>
):
<fpage>e1000262</fpage>
.
<pub-id pub-id-type="pmid">19148271</pub-id>
</mixed-citation>
</ref>
<ref id="bib30">
<label>30.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Sonnhammer</surname>
<given-names>ELL</given-names>
</name>
,
<name name-style="western">
<surname>Östlund</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2015</year>
;
<volume>43</volume>
(
<issue>Database issue</issue>
):
<fpage>D234</fpage>
<lpage>9</lpage>
.
<pub-id pub-id-type="pmid">25429972</pub-id>
</mixed-citation>
</ref>
<ref id="bib31">
<label>31.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Cosentino</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Iwasaki</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>SonicParanoid: fast, accurate and easy orthology inference</article-title>
.
<source>Bioinformatics</source>
.
<year>2019</year>
;
<volume>35</volume>
(
<issue>1</issue>
):
<fpage>149</fpage>
<lpage>51</lpage>
.
<pub-id pub-id-type="pmid">30032301</pub-id>
</mixed-citation>
</ref>
<ref id="bib32">
<label>32.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Lechner</surname>
<given-names>M</given-names>
</name>
,
<name name-style="western">
<surname>Findeiß</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Steiner</surname>
<given-names>L</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Proteinortho: detection of (co-)orthologs in large-scale analysis</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2011</year>
;
<volume>12</volume>
:
<fpage>124</fpage>
.
<pub-id pub-id-type="pmid">21526987</pub-id>
</mixed-citation>
</ref>
<ref id="bib33">
<label>33.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Altenhoff</surname>
<given-names>AM</given-names>
</name>
,
<name name-style="western">
<surname>Boeckmann</surname>
<given-names>B</given-names>
</name>
,
<name name-style="western">
<surname>Capella-Gutierrez</surname>
<given-names>S</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Standardized benchmarking in the quest for orthologs</article-title>
.
<source>Nat Methods</source>
.
<year>2016</year>
;
<volume>13</volume>
(
<issue>5</issue>
):
<fpage>425</fpage>
<lpage>30</lpage>
.
<pub-id pub-id-type="pmid">27043882</pub-id>
</mixed-citation>
</ref>
<ref id="bib34">
<label>34.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Curwen</surname>
<given-names>V</given-names>
</name>
,
<name name-style="western">
<surname>Eyras</surname>
<given-names>E</given-names>
</name>
,
<name name-style="western">
<surname>Andrews</surname>
<given-names>TD</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>The Ensembl automatic gene annotation system</article-title>
.
<source>Genome Res</source>
.
<year>2004</year>
;
<volume>14</volume>
(
<issue>5</issue>
):
<fpage>942</fpage>
<lpage>50</lpage>
.
<pub-id pub-id-type="pmid">15123590</pub-id>
</mixed-citation>
</ref>
<ref id="bib35">
<label>35.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Benson</surname>
<given-names>DA</given-names>
</name>
</person-group>
<article-title>GenBank</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2000</year>
;
<volume>28</volume>
(
<issue>1</issue>
):
<fpage>15</fpage>
<lpage>8</lpage>
.
<pub-id pub-id-type="pmid">10592170</pub-id>
</mixed-citation>
</ref>
<ref id="bib36">
<label>36.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Camacho</surname>
<given-names>C</given-names>
</name>
,
<name name-style="western">
<surname>Coulouris</surname>
<given-names>G</given-names>
</name>
,
<name name-style="western">
<surname>Avagyan</surname>
<given-names>V</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>BLAST+: architecture and applications</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2009</year>
;
<volume>10</volume>
:
<fpage>421</fpage>
.
<pub-id pub-id-type="pmid">20003500</pub-id>
</mixed-citation>
</ref>
<ref id="bib37">
<label>37.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Brohée</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>van Helden</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Evaluation of clustering algorithms for protein-protein interaction networks</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2006</year>
;
<volume>7</volume>
:
<fpage>488</fpage>
.
<pub-id pub-id-type="pmid">17087821</pub-id>
</mixed-citation>
</ref>
<ref id="bib38">
<label>38.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>BLAT – The BLAST-Like Alignment Tool</article-title>
.
<source>Genome Research</source>
.
<year>2002</year>
;
<volume>12</volume>
:
<fpage>656</fpage>
<lpage>64</lpage>
.
<pub-id pub-id-type="pmid">11932250</pub-id>
</mixed-citation>
</ref>
<ref id="bib39">
<label>39.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Search and clustering orders of magnitude faster than BLAST</article-title>
.
<source>Bioinformatics</source>
.
<year>2010</year>
;
<volume>26</volume>
(
<issue>19</issue>
):
<fpage>2460</fpage>
<lpage>1</lpage>
.
<pub-id pub-id-type="pmid">20709691</pub-id>
</mixed-citation>
</ref>
<ref id="bib40">
<label>40.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kiełbasa</surname>
<given-names>SM</given-names>
</name>
,
<name name-style="western">
<surname>Wan</surname>
<given-names>R</given-names>
</name>
,
<name name-style="western">
<surname>Sato</surname>
<given-names>K</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Adaptive seeds tame genomic sequence comparison</article-title>
.
<source>Genome Res</source>
.
<year>2011</year>
;
<volume>21</volume>
(
<issue>3</issue>
):
<fpage>487</fpage>
<lpage>93</lpage>
.
<pub-id pub-id-type="pmid">21209072</pub-id>
</mixed-citation>
</ref>
<ref id="bib41">
<label>41.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Buchfink</surname>
<given-names>B</given-names>
</name>
,
<name name-style="western">
<surname>Xie</surname>
<given-names>C</given-names>
</name>
,
<name name-style="western">
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>Fast and sensitive protein alignment using DIAMOND</article-title>
.
<source>Nat Methods</source>
.
<year>2014</year>
;
<volume>12</volume>
(
<issue>1</issue>
):
<fpage>59</fpage>
<lpage>60</lpage>
.
<pub-id pub-id-type="pmid">25402007</pub-id>
</mixed-citation>
</ref>
<ref id="bib42">
<label>42.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Medlar</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>Holm</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>TOPAZ: asymmetric suffix array neighbourhood search for massive protein databases</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2018</year>
;
<volume>19</volume>
(
<issue>1</issue>
):
<fpage>278</fpage>
.
<pub-id pub-id-type="pmid">30064374</pub-id>
</mixed-citation>
</ref>
<ref id="bib43">
<label>43.</label>
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rigo</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>Pedroni</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>PyPy’s approach to virtual machine construction</article-title>
. In:
<source>Proceedings of OOPSLA '06 Companion to the 21st ACM SIGPLAN Symposium on Object-Oriented Programming Systems, Languages, and Applications, Portland, OR</source>
.
<publisher-loc>New York, NY</publisher-loc>
:
<publisher-name>ACM</publisher-name>
;
<year>2006</year>
:
<fpage>944</fpage>
<lpage>53</lpage>
.</mixed-citation>
</ref>
<ref id="bib44">
<label>44.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bratlie</surname>
<given-names>MS</given-names>
</name>
,
<name name-style="western">
<surname>Johansen</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Sherman</surname>
<given-names>BT</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Gene duplications in prokaryotes can be associated with environmental adaptation</article-title>
.
<source>BMC Genomics</source>
.
<year>2010</year>
;
<volume>11</volume>
:
<fpage>588</fpage>
.
<pub-id pub-id-type="pmid">20961426</pub-id>
</mixed-citation>
</ref>
<ref id="bib45">
<label>45.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Katju</surname>
<given-names>V</given-names>
</name>
,
<name name-style="western">
<surname>Bergthorsson</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Copy-number changes in evolution: rates, fitness effects and adaptive significance</article-title>
.
<source>Front Genet</source>
.
<year>2013</year>
;
<volume>4</volume>
:
<fpage>273</fpage>
.
<pub-id pub-id-type="pmid">24368910</pub-id>
</mixed-citation>
</ref>
<ref id="bib46">
<label>46.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
,
<name name-style="western">
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Improved tools for biological sequence comparison</article-title>
.
<source>Proc Natl Acad Sci U S A</source>
.
<year>1988</year>
;
<volume>85</volume>
(
<issue>8</issue>
):
<fpage>2444</fpage>
<lpage>8</lpage>
.
<pub-id pub-id-type="pmid">3162770</pub-id>
</mixed-citation>
</ref>
<ref id="bib47">
<label>47.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
,
<name name-style="western">
<surname>Gish</surname>
<given-names>W</given-names>
</name>
,
<name name-style="western">
<surname>Miller</surname>
<given-names>W</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Basic Local Alignment Search Tool</article-title>
.
<source>J Mol Biol</source>
.
<year>1990</year>
;
<volume>215</volume>
(
<issue>3</issue>
):
<fpage>403</fpage>
<lpage>10</lpage>
.
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="bib48">
<label>48.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Shiryev</surname>
<given-names>SA</given-names>
</name>
,
<name name-style="western">
<surname>Papadopoulos</surname>
<given-names>JS</given-names>
</name>
,
<name name-style="western">
<surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Improved BLAST searches using longer words for protein seeding</article-title>
.
<source>Bioinformatics</source>
.
<year>2007</year>
;
<volume>23</volume>
(
<issue>21</issue>
):
<fpage>2949</fpage>
<lpage>51</lpage>
.
<pub-id pub-id-type="pmid">17921491</pub-id>
</mixed-citation>
</ref>
<ref id="bib49">
<label>49.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ma</surname>
<given-names>B</given-names>
</name>
,
<name name-style="western">
<surname>Tromp</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Li</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>PatternHunter: faster and more sensitive homology search</article-title>
.
<source>Bioinformatics</source>
.
<year>2002</year>
;
<volume>18</volume>
(
<issue>3</issue>
):
<fpage>440</fpage>
<lpage>5</lpage>
.
<pub-id pub-id-type="pmid">11934743</pub-id>
</mixed-citation>
</ref>
<ref id="bib50">
<label>50.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
,
<name name-style="western">
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Khoshraftar</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Seeds for effective oligonucleotide design</article-title>
.
<source>BMC Genomics</source>
.
<year>2011</year>
;
<volume>12</volume>
(
<issue>1</issue>
):
<fpage>280</fpage>
.
<pub-id pub-id-type="pmid">21627845</pub-id>
</mixed-citation>
</ref>
<ref id="bib51">
<label>51.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Smith</surname>
<given-names>TF</given-names>
</name>
,
<name name-style="western">
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Identification of common molecular subsequences</article-title>
.
<source>J Mol Biol</source>
.
<year>1981</year>
;
<volume>147</volume>
(
<issue>1</issue>
):
<fpage>195</fpage>
<lpage>7</lpage>
.
<pub-id pub-id-type="pmid">7265238</pub-id>
</mixed-citation>
</ref>
<ref id="bib52">
<label>52.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chao</surname>
<given-names>KM</given-names>
</name>
,
<name name-style="western">
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
,
<name name-style="western">
<surname>Miller</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Aligning two sequences within a specified diagonal band</article-title>
.
<source>Bioinformatics</source>
.
<year>1992</year>
;
<volume>8</volume>
(
<issue>5</issue>
):
<fpage>481</fpage>
<lpage>7</lpage>
.</mixed-citation>
</ref>
<ref id="bib53">
<label>53.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Landès</surname>
<given-names>C</given-names>
</name>
,
<name name-style="western">
<surname>Risler</surname>
<given-names>JL</given-names>
</name>
</person-group>
<article-title>Fast databank searching with a reduced amino-acid alphabet</article-title>
.
<source>Comput Appl Biosci</source>
.
<year>1994</year>
;
<volume>10</volume>
(
<issue>4</issue>
):
<fpage>453</fpage>
<lpage>4</lpage>
.
<pub-id pub-id-type="pmid">7804879</pub-id>
</mixed-citation>
</ref>
<ref id="bib54">
<label>54.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Murphy</surname>
<given-names>LR</given-names>
</name>
,
<name name-style="western">
<surname>Wallqvist</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>Levy</surname>
<given-names>RM</given-names>
</name>
</person-group>
<article-title>Simplified amino acid alphabets for protein fold recognition and implications for folding</article-title>
.
<source>Protein Eng Des Sel</source>
.
<year>2000</year>
;
<volume>13</volume>
(
<issue>3</issue>
):
<fpage>149</fpage>
<lpage>52</lpage>
.</mixed-citation>
</ref>
<ref id="bib55">
<label>55.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Peterson</surname>
<given-names>EL</given-names>
</name>
,
<name name-style="western">
<surname>Kondev</surname>
<given-names>J</given-names>
</name>
,
<name name-style="western">
<surname>Theriot</surname>
<given-names>JA</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment</article-title>
.
<source>Bioinformatics</source>
.
<year>2009</year>
;
<volume>25</volume>
(
<issue>11</issue>
):
<fpage>1356</fpage>
<lpage>62</lpage>
.
<pub-id pub-id-type="pmid">19351620</pub-id>
</mixed-citation>
</ref>
<ref id="bib56">
<label>56.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Local homology recognition and distance measures in linear time using compressed amino acid alphabets</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2004</year>
;
<volume>32</volume>
(
<issue>1</issue>
):
<fpage>380</fpage>
<lpage>5</lpage>
.
<pub-id pub-id-type="pmid">14729922</pub-id>
</mixed-citation>
</ref>
<ref id="bib57">
<label>57.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ye</surname>
<given-names>Y</given-names>
</name>
,
<name name-style="western">
<surname>Choi</surname>
<given-names>JH</given-names>
</name>
,
<name name-style="western">
<surname>Tang</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>RAPSearch: a fast protein similarity search tool for short reads</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2011</year>
;
<volume>12</volume>
(
<issue>1</issue>
):
<fpage>159</fpage>
.
<pub-id pub-id-type="pmid">21575167</pub-id>
</mixed-citation>
</ref>
<ref id="bib58">
<label>58.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Gibbons</surname>
<given-names>TR</given-names>
</name>
,
<name name-style="western">
<surname>Mount</surname>
<given-names>SM</given-names>
</name>
,
<name name-style="western">
<surname>Cooper</surname>
<given-names>ED</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2015</year>
;
<volume>16</volume>
:
<fpage>218</fpage>
.
<pub-id pub-id-type="pmid">26160651</pub-id>
</mixed-citation>
</ref>
<ref id="bib59">
<label>59.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
,
<name name-style="western">
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>An efficient algorithm for large-scale detection of protein families</article-title>
.
<source>Nucleic Acids Res</source>
.
<year>2002</year>
;
<volume>30</volume>
(
<issue>7</issue>
):
<fpage>1575</fpage>
<lpage>84</lpage>
.
<pub-id pub-id-type="pmid">11917018</pub-id>
</mixed-citation>
</ref>
<ref id="bib60">
<label>60.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Emms</surname>
<given-names>DM</given-names>
</name>
,
<name name-style="western">
<surname>Kelly</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy</article-title>
.
<source>Genome Biol</source>
.
<year>2015</year>
;
<volume>16</volume>
(
<issue>1</issue>
):
<fpage>157</fpage>
.
<pub-id pub-id-type="pmid">26243257</pub-id>
</mixed-citation>
</ref>
<ref id="bib61">
<label>61.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Davis</surname>
<given-names>JJ</given-names>
</name>
,
<name name-style="western">
<surname>Gerdes</surname>
<given-names>S</given-names>
</name>
,
<name name-style="western">
<surname>Olsen</surname>
<given-names>GJ</given-names>
</name>
,
<etal>et al</etal>
.</person-group>
<article-title>PATtyFams: protein families for the microbial genomes in the PATRIC database</article-title>
.
<source>Front Microbiol</source>
.
<year>2016</year>
;
<volume>7</volume>
:
<fpage>118</fpage>
.
<pub-id pub-id-type="pmid">26903996</pub-id>
</mixed-citation>
</ref>
<ref id="bib62">
<label>62.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Frey</surname>
<given-names>BJ</given-names>
</name>
,
<name name-style="western">
<surname>Dueck</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Clustering by passing messages between data points</article-title>
.
<source>Science</source>
.
<year>2007</year>
;
<volume>315</volume>
(
<issue>5814</issue>
):
<fpage>972</fpage>
<lpage>6</lpage>
.
<pub-id pub-id-type="pmid">17218491</pub-id>
</mixed-citation>
</ref>
<ref id="bib63">
<label>63.</label>
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Lam</surname>
<given-names>SK</given-names>
</name>
,
<name name-style="western">
<surname>Pitrou</surname>
<given-names>A</given-names>
</name>
,
<name name-style="western">
<surname>Seibert</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Numba: a LLVM-based Python JIT compiler</article-title>
. In:
<source>Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX</source>
.
<publisher-loc>New York, NY</publisher-loc>
:
<publisher-name>ACM</publisher-name>
;
<year>2015</year>
, doi:
<pub-id pub-id-type="doi">10.1145/2833157.2833162</pub-id>
.</mixed-citation>
</ref>
<ref id="bib64">
<label>64.</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hu</surname>
<given-names>X</given-names>
</name>
,
<name name-style="western">
<surname>Friedberg</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Supporting data for “SwiftOrtho: a fast, memory-efficient, multiple genome orthology classifier.”</article-title>
.
<source>GigaScience Database</source>
.
<year>2019</year>
<pub-id pub-id-type="doi">10.5524/100633</pub-id>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D029 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000D029 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021