Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000268 ( Pmc/Corpus ); précédent : 0002679; suivant : 0002690 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs</title>
<author>
<name sortKey="Li, Dinghua" sort="Li, Dinghua" uniqKey="Li D" first="Dinghua" last="Li">Dinghua Li</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huang, Yukun" sort="Huang, Yukun" uniqKey="Huang Y" first="Yukun" last="Huang">Yukun Huang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leung, Chi Ming" sort="Leung, Chi Ming" uniqKey="Leung C" first="Chi-Ming" last="Leung">Chi-Ming Leung</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Luo, Ruibang" sort="Luo, Ruibang" uniqKey="Luo R" first="Ruibang" last="Luo">Ruibang Luo</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ting, Hing Fung" sort="Ting, Hing Fung" uniqKey="Ting H" first="Hing-Fung" last="Ting">Hing-Fung Ting</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lam, Tak Wah" sort="Lam, Tak Wah" uniqKey="Lam T" first="Tak-Wah" last="Lam">Tak-Wah Lam</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29072142</idno>
<idno type="pmc">5657035</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657035</idno>
<idno type="RBID">PMC:5657035</idno>
<idno type="doi">10.1186/s12859-017-1825-3</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000268</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000268</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs</title>
<author>
<name sortKey="Li, Dinghua" sort="Li, Dinghua" uniqKey="Li D" first="Dinghua" last="Li">Dinghua Li</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huang, Yukun" sort="Huang, Yukun" uniqKey="Huang Y" first="Yukun" last="Huang">Yukun Huang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leung, Chi Ming" sort="Leung, Chi Ming" uniqKey="Leung C" first="Chi-Ming" last="Leung">Chi-Ming Leung</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Luo, Ruibang" sort="Luo, Ruibang" uniqKey="Luo R" first="Ruibang" last="Luo">Ruibang Luo</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ting, Hing Fung" sort="Ting, Hing Fung" uniqKey="Ting H" first="Hing-Fung" last="Ting">Hing-Fung Ting</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lam, Tak Wah" sort="Lam, Tak Wah" uniqKey="Lam T" first="Tak-Wah" last="Lam">Tak-Wah Lam</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">L3 Bioinformatics Limited, Western District, Hong Kong</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of
<italic>de Bruijn</italic>
graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 
<italic>k</italic>
-mer size for graph construction and whatever choice of
<italic>k</italic>
will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of
<italic>de Bruijn</italic>
graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of
<italic>k</italic>
-mers, which would have been an effective way to differentiate between erroneous
<italic>k</italic>
-mers and correct
<italic>k</italic>
-mers.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative
<italic>de Bruijn</italic>
graphs to take full advantage of multiple
<italic>k</italic>
-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct
<italic>de Bruijn</italic>
graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a
<italic>de Bruijn</italic>
graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of
<italic>k</italic>
-mers for building better HMM model.</p>
<p id="Par3">We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7–19.3% more contigs than Xander, and these contigs were assigned to 10–25% more gene references. In our experiments, MegaGTA, depending on the number of
<italic>k</italic>
-mers used, is two to ten times faster than Xander.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par4">MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/HKU-BAL/megagta">https://github.com/HKU-BAL/megagta</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nagarajan, N" uniqKey="Nagarajan N">N Nagarajan</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, C" uniqKey="Miller C">C Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yuan, C" uniqKey="Yuan C">C Yuan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eddy, Sr" uniqKey="Eddy S">SR Eddy</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hart, Pe" uniqKey="Hart P">PE Hart</name>
</author>
<author>
<name sortKey="Nilsson, Nj" uniqKey="Nilsson N">NJ Nilsson</name>
</author>
<author>
<name sortKey="Raphael, B" uniqKey="Raphael B">B Raphael</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolger, Am" uniqKey="Bolger A">AM Bolger</name>
</author>
<author>
<name sortKey="Lohse, M" uniqKey="Lohse M">M Lohse</name>
</author>
<author>
<name sortKey="Usadel, B" uniqKey="Usadel B">B Usadel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rho, M" uniqKey="Rho M">M Rho</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Ye, Y" uniqKey="Ye Y">Y Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29072142</article-id>
<article-id pub-id-type="pmc">5657035</article-id>
<article-id pub-id-type="publisher-id">1825</article-id>
<article-id pub-id-type="doi">10.1186/s12859-017-1825-3</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Software</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Dinghua</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Huang</surname>
<given-names>Yukun</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Leung</surname>
<given-names>Chi-Ming</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Luo</surname>
<given-names>Ruibang</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ting</surname>
<given-names>Hing-Fung</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Lam</surname>
<given-names>Tak-Wah</given-names>
</name>
<address>
<email>twlam@cs.hku.hk</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121742757</institution-id>
<institution-id institution-id-type="GRID">grid.194645.b</institution-id>
<institution>Department of Computer Science,</institution>
<institution>University of Hong Kong,</institution>
</institution-wrap>
Pokfulam, Hong Kong</aff>
<aff id="Aff2">
<label>2</label>
L3 Bioinformatics Limited, Western District, Hong Kong</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>16</day>
<month>10</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>16</day>
<month>10</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>18</volume>
<issue>Suppl 12</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</issue-sponsor>
<elocation-id>408</elocation-id>
<permissions>
<copyright-statement>© The Author(s). 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of
<italic>de Bruijn</italic>
graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 
<italic>k</italic>
-mer size for graph construction and whatever choice of
<italic>k</italic>
will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of
<italic>de Bruijn</italic>
graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of
<italic>k</italic>
-mers, which would have been an effective way to differentiate between erroneous
<italic>k</italic>
-mers and correct
<italic>k</italic>
-mers.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative
<italic>de Bruijn</italic>
graphs to take full advantage of multiple
<italic>k</italic>
-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct
<italic>de Bruijn</italic>
graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a
<italic>de Bruijn</italic>
graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of
<italic>k</italic>
-mers for building better HMM model.</p>
<p id="Par3">We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7–19.3% more contigs than Xander, and these contigs were assigned to 10–25% more gene references. In our experiments, MegaGTA, depending on the number of
<italic>k</italic>
-mers used, is two to ten times faster than Xander.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par4">MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/HKU-BAL/megagta">https://github.com/HKU-BAL/megagta</ext-link>
.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Metagenomics</kwd>
<kwd>Assembly</kwd>
<kwd>De Bruijn graph</kwd>
<kwd>Targeted gene</kwd>
</kwd-group>
<conference>
<conf-name>12th International Symposium on Bioinformatics Research and Applications (ISBRA 2016)</conf-name>
<conf-acronym>ISBRA 2016</conf-acronym>
<conf-loc>Minsk, Belarus</conf-loc>
<conf-date>5-8 June 2016</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par15">Next generation sequencing has greatly promoted the study of metagenomics in recent years. These studies often involve de novo assembling of millions to billions of reads into contigs for gene annotation. This has triggered the study of advanced algorithms to significantly enhance the computational efficiency for metagenome assembly [
<xref ref-type="bibr" rid="CR1">1</xref>
<xref ref-type="bibr" rid="CR3">3</xref>
]. On the other hand, due to the prevalence of uneven coverage and cross-genome repeats [
<xref ref-type="bibr" rid="CR4">4</xref>
], it is common to get fragmented gene sequences. To overcome these drawbacks, several gene-targeted assembly methods, including EMIRGE [
<xref ref-type="bibr" rid="CR5">5</xref>
], REAGO [
<xref ref-type="bibr" rid="CR6">6</xref>
], SAT-Assembler [
<xref ref-type="bibr" rid="CR7">7</xref>
], and Xander [
<xref ref-type="bibr" rid="CR8">8</xref>
], have been published. Unlike the first three methods which attempt to sort out the reads that might have originated from targeted genes before assembly, Xander constructs a
<italic>de Bruijn</italic>
graph (DBG) whose nodes are the
<italic>k</italic>
-mers of all reads and searches the genes by a guided traversal through the
<italic>k</italic>
-mers on-the-fly. The assembly of a specific gene is guided by a profile Hidden Markov Model (HMM) [
<xref ref-type="bibr" rid="CR9">9</xref>
], which is built using the results of multiple sequence alignment of the underlying gene family. Starting from a node, Xander makes decisions at the branches in graph and outputs a unique path that results in the highest probability of the HMM. This overcomes the problem of de novo assembly that intrinsically stops at branches. More specifically, the Xander algorithm is operated on a combined graph of DBG and HMM. Its workflow is shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
and will be explained in detail in the next section.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>The workflow of Xander (left) and MegaGTA (right). Their differences are highlighted in bold</p>
</caption>
<graphic xlink:href="12859_2017_1825_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par16">Although Xander produces longer and higher-quality gene specific contigs than previous methods, there is still a lot of room for improvement. The followings are three improvements we considered in this paper:
<list list-type="simple">
<list-item>
<label>A.</label>
<p id="Par17">
<bold>Use multiple</bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer sizes.</bold>
Xander uses a fixed
<italic>k</italic>
to build a
<italic>de Bruijn</italic>
graph of
<italic>k</italic>
-mers. This leads to a classic dilemma, in which a large
<italic>k</italic>
results in gaps among low-coverage genomic regions, and genes coming from these regions are unlikely to be assembled [
<xref ref-type="bibr" rid="CR10">10</xref>
]; and a small
<italic>k</italic>
may collapse short repetitive regions and result in excessive branches in the
<italic>de Bruijn</italic>
graph. Though HMM-guided assembly targets to resolve a repeat by choosing a best path that “suggested” by HMM, it is not impossible for two parts of different genes be combined into a chimeric contig via a repeat. In this regard, a small
<italic>k</italic>
tends to produce more misassemblies. Iterative
<italic>de Bruijn</italic>
graph [
<xref ref-type="bibr" rid="CR10">10</xref>
], which leverages
<italic>k</italic>
-mers with multiple sizes, has showed its advantages in several de novo assemblers [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
<xref ref-type="bibr" rid="CR13">13</xref>
]. Benchmarks hereinafter show that HMM-guided assembly can benefit from iterative
<italic>de Bruijn</italic>
graphs.</p>
</list-item>
<list-item>
<label>B.</label>
<p id="Par18">
<bold>Filter erroneous</bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mers in de Bruijn graph.</bold>
<italic>k</italic>
-mers that appear only once in a given set of reads are error prone. In order to achieve a higher sensitivity of low coverage genes, Xander opted not to filter
<italic>k</italic>
-mers with low multiplicity. Instead, Xander relies on the HMM to avoid erroneous
<italic>k</italic>
-mers. But it still results in many contigs with either structural errors or incorrect bases, especially when a small size of
<italic>k</italic>
-mer is used. To avoid this defect, we penalize
<italic>k</italic>
-mers that occur only once during the HMM-guided searching (equivalent to set a prior erroneous probability to these
<italic>k</italic>
-mers).</p>
</list-item>
<list-item>
<label>C.</label>
<p id="Par19">
<bold>Avoid false positive</bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mers caused by probabilistic data structure</bold>
. To achieve better memory efficiency, Xander represents de Bruijn graph using a Bloom filter [
<xref ref-type="bibr" rid="CR14">14</xref>
], a probabilistic data structure that contains a certain rate of false positive (but free of false negative) members. In our solution, we replace Bloom filter with succinct
<italic>de Bruijn</italic>
graph [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR15">15</xref>
], a state-of-the-art memory-efficient data structure free from both false positives and false negatives. To study the improvement, we identified and benchmarked the Xander’s misassembled contigs due to
<italic>k</italic>
-mers, by querying the exact representation of a
<italic>de Bruijn</italic>
graph.</p>
</list-item>
</list>
</p>
<p id="Par20">The three aforementioned improvements have been implemented and integrated into a new gene-targeted assembler called MegaGTA (workflow and the differences to Xander also shown in Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
). We demonstrate the effectiveness of MegaGTA with two datasets, first on an HMP-defined mock community, and second on a large rhizosphere soil metagenomic sample [
<xref ref-type="bibr" rid="CR8">8</xref>
]. With iterative de Bruijn graphs, MegaGTA achieved higher sensitivity and accuracy than Xander (even with well-tuned parameters). More interestingly, MegaGTA, even with more calculations due to multiple
<italic>k</italic>
-mer sizes, was still about two to four times faster than Xander when tested on a 24-core server, while using a similar amount of peak memory. If one runs Xander repeatedly with different
<italic>k</italic>
-mer sizes in a way similar to MegaGTA, then the relative speed-up of MegaGTA is even more significant.</p>
<p id="Par21">With respect to the rhizosphere soil dataset, we found that 0.02, 0.39 and 10.52% of contigs generated by Xander contain false positive
<italic>k</italic>
-mers, when using a Bloom filter size of 256GB, 128GB and 64GB, respectively. Apparently Bloom filter causes accuracy issues when memory is constrained. Succinct
<italic>de Bruijn</italic>
graph overcomes the inexactness of Bloom filter, while allows faster graph construction.</p>
</sec>
<sec id="Sec2">
<title>Implementations</title>
<sec id="Sec3">
<title>HMM-guided assembly on combined weighted assembly graphs</title>
<p id="Par22">A combined weighted assembly graph (CAG) is the key structure of the HMM-guided algorithm exploited by Xander, as well as our implementation MegaGTA. A CAG is a combination of a
<italic>de Bruijn</italic>
graph and a profile HMM.</p>
<p id="Par23">
<italic>De Bruijn</italic>
graphs (DBG) are used in most short-read assemblers. In the context of genome assembly, each node in a
<italic>de Bruijn</italic>
graph is a
<italic>k</italic>
-mer, a length-
<italic>k</italic>
string in nucleotide or peptide alphabet. If the (
<italic>k</italic>
-1)-long suffix of a node
<italic>u</italic>
is the same as the (
<italic>k</italic>
-1)-long prefix of a node
<italic>v</italic>
, there is a directed edge from
<italic>u</italic>
to
<italic>v</italic>
. The
<italic>k</italic>
-mers typically comes from a set of unassembled reads.</p>
<p id="Par24">A profile HMM [
<xref ref-type="bibr" rid="CR7">7</xref>
] is a directed graph that represents a set of aligned sequences. A node of HMM corresponds to a column (or position) of the alignment, and there are three kinds of states, namely match, insertion and deletion for each node. Each edge is associated with a transition probability (
<bold>
<italic>P</italic>
</bold>
<sub>
<italic>transition</italic>
</sub>
), modelling the likelihood of the transition from a position with a certain state to another with the same or another state. For nodes of match states only, emission probability (
<bold>
<italic>P</italic>
</bold>
<sub>
<italic>emission</italic>
</sub>
) is a property denotes how likely a base (nucleotide or peptide) would appear at that position.</p>
<p id="Par25">Let
<italic>V(G)</italic>
and
<italic>E(G)</italic>
be the vertex set and edge set of a graph
<italic>G</italic>
, respectively. Conceptually, given a
<italic>de Bruijn</italic>
graph
<italic>D</italic>
, and an HMM
<italic>H</italic>
, the vertices of the CAG
<italic>C</italic>
of
<italic>D</italic>
and
<italic>H</italic>
is the Cartesian product of
<italic>V(D)</italic>
and
<italic>V(H)</italic>
. For a vertex
<italic>w</italic>
of
<italic>C,</italic>
we denote
<italic>w</italic>
 . 
<italic>u</italic>
 ∈ 
<italic>V</italic>
(
<italic>D</italic>
) and
<italic>w</italic>
 . 
<italic>v</italic>
 ∈ 
<italic>V</italic>
(
<italic>H</italic>
) the
<italic>de Bruijn</italic>
graph component and HMM component of
<italic>w</italic>
, respectively. An edge (
<italic>w</italic>
, 
<italic>w</italic>
<sup></sup>
) exists in
<italic>E(C)</italic>
if and only if it satisfies one of the following conditions:
<list list-type="bullet">
<list-item>
<p id="Par26">(
<italic>w</italic>
 . 
<italic>u</italic>
, 
<italic>w</italic>
' . 
<italic>u</italic>
) ∈ 
<italic>E</italic>
(
<italic>D</italic>
) and (
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
) ∈ 
<italic>E</italic>
(
<italic>D</italic>
), and
<italic>w</italic>
' . 
<italic>v</italic>
is a match or insertion state of HMM;</p>
</list-item>
<list-item>
<p id="Par27">
<italic>w</italic>
 . 
<italic>u</italic>
 = 
<italic>w</italic>
' . 
<italic>u</italic>
and (
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
) ∈ 
<italic>E</italic>
(
<italic>D</italic>
), and
<italic>w</italic>
<sup>'</sup>
 . 
<italic>v</italic>
is a deletion state of HMM.</p>
</list-item>
</list>
</p>
<p id="Par28">Every edge in a CAG is assigned a weight:
<list list-type="bullet">
<list-item>
<p id="Par29">
<italic>weight</italic>
(
<italic>w</italic>
, 
<italic>w</italic>
') = log[
<italic>P</italic>
<sub>
<italic>transition</italic>
</sub>
(
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
)] + log[
<italic>P</italic>
<sub>
<italic>emission</italic>
</sub>
(
<italic>v</italic>
<sub>
<italic>j</italic>
</sub>
, 
<italic>c</italic>
)], where
<italic>c</italic>
is the last character
<italic>w</italic>
' . 
<italic>u</italic>
, if
<italic>w</italic>
' . 
<italic>v</italic>
is a match state;</p>
</list-item>
<list-item>
<p id="Par30">
<italic>weight</italic>
(
<italic>w</italic>
, 
<italic>w</italic>
') = log[
<italic>P</italic>
<sub>
<italic>transition</italic>
</sub>
(
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
)], if
<italic>w</italic>
' . 
<italic>v</italic>
is an insertion or deletion state.</p>
</list-item>
</list>
</p>
<p id="Par31">Given a
<italic>de Bruijn</italic>
graph
<italic>D</italic>
and an HMM
<italic>H</italic>
, the algorithm of Xander searches for a path from a starting vertex to a terminating vertex, with the highest sum of edge weights on the CAG of
<italic>D</italic>
and
<italic>H</italic>
using the A* algorithm [
<xref ref-type="bibr" rid="CR16">16</xref>
]. A starting vertex is identified by a
<italic>k</italic>
-mer appearing in the reads and exactly matching a set of aligned reference sequences. Such a
<italic>k</italic>
-mer and the HMM state implied by the matched position in the reference sequences form a starting vertex of the CAG. A terminating vertex here means a vertex in the CAG whose HMM component is an ending vertex of the HMM. A reverse search guided by a reverse HMM
<italic>H′</italic>
is also needed, and the two sequences spelled from the
<italic>de Bruijn</italic>
graph component of the two best paths are merged to create a contig.</p>
</sec>
<sec id="Sec4">
<title>Adding low coverage penalty to a CAG</title>
<p id="Par32">In MegaGTA, we introduce a penalty for the vertices with low multiplicity, i.e.
<italic>k</italic>
-mers that appear only once in the set of reads (multiplicity = 1). More precisely, for an edge (
<italic>w</italic>
, 
<italic>w</italic>
<sup></sup>
) of a CAG, if
<italic>w</italic>
<sup></sup>
 . 
<italic>u</italic>
appears only once in the set of reads, the weight of
<italic>weight</italic>
(
<italic>w</italic>
, 
<italic>w</italic>
') becomes:
<list list-type="bullet">
<list-item>
<p id="Par33">
<italic>weight</italic>
(
<italic>w</italic>
, 
<italic>w</italic>
') = log[
<italic>P</italic>
<sub>
<italic>transition</italic>
</sub>
(
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
)] + log[
<italic>P</italic>
<sub>
<italic>emission</italic>
</sub>
(
<italic>v</italic>
<sub>
<italic>j</italic>
</sub>
, 
<italic>c</italic>
)] + log(
<italic>α</italic>
), where
<italic>c</italic>
is the last character
<italic>w</italic>
<sup></sup>
 . 
<italic>u</italic>
, or</p>
</list-item>
<list-item>
<p id="Par34">
<italic>weight</italic>
(
<italic>w</italic>
, 
<italic>w</italic>
<sup></sup>
) = log[
<italic>P</italic>
<sub>
<italic>transition</italic>
</sub>
(
<italic>w</italic>
 . 
<italic>v</italic>
, 
<italic>w</italic>
' . 
<italic>v</italic>
)] + log(
<italic>α</italic>
), if
<italic>w</italic>
' . 
<italic>v</italic>
is an insertion state,</p>
</list-item>
</list>
where
<italic>α</italic>
is a user-defined threshold (0.5 by default in MegaGTA, which means that we assume the prior probability of a k-mer with multiplicity = 1 being an erroneous
<italic>k</italic>
-mer is 0.5). Intuitively, we reduce the probability of entering a CAG vertex with count-1 
<italic>k</italic>
-mer by a factor
<italic>α</italic>
. This leads the search onto high-coverage paths that are more likely to be correct.</p>
</sec>
<sec id="Sec5">
<title>Iterative
<italic>de Bruijn</italic>
graphs</title>
<p id="Par35">The selection of
<italic>k</italic>
-mer size affects the character of a
<italic>de Bruijn</italic>
graph, and further affects the result of an HMM-guided assembly. Basically, a large
<italic>k</italic>
makes the graph more fragmented, especially for low-coverage regions, due to the absence of overlapping
<italic>k</italic>
-mers. A fragmented graph is less sensitive for both de novo assembly and gene-targeted assembly. A small
<italic>k</italic>
makes the graph collapse at short repeats, and though ideally an HMM could resolve the repeats, it is still possible for different genes to be incorrectly fused via a repetitive segment. With a less stringent overlapping requirement, a small
<italic>k</italic>
also results in a graph with more simple bubbles or complex grids. This expands the number of paths to be examined by the HMM-guided graph traversal and makes it likely to result in a path with more mismatches.</p>
<p id="Par36">To benefit from both small and large
<italic>k</italic>
-mer sizes, we adopt the HMM-guided algorithm on iterative
<italic>de Bruijn</italic>
graphs. Let DBG(
<italic>R</italic>
, 
<italic>k</italic>
) be the
<italic>de Bruijn</italic>
graph whose
<italic>k</italic>
-mer size is
<italic>k</italic>
and constructed form a set of reads
<italic>R</italic>
. Given a set of
<italic>n</italic>
integers
<italic>k</italic>
<sub>1</sub>
 < 
<italic>k</italic>
<sub>2</sub>
 < … < 
<italic>k</italic>
<sub>
<italic>n</italic>
</sub>
and a set of reads
<italic>R</italic>
, an iterative
<italic>de Bruijn</italic>
graph of
<italic>R</italic>
up to 
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
G(
<italic>R</italic>
, 
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
) is defined recursively as follows:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathrm{G}\left(R,{k}_1\right)=\mathrm{DBG}\left(R,{k}_1\right); $$\end{document}</tex-math>
<mml:math id="M2" display="block">
<mml:mi mathvariant="normal">G</mml:mi>
<mml:mfenced close=")" open="(" separators=",">
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>DBG</mml:mi>
<mml:mfenced close=")" open="(" separators=",">
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mfenced>
<mml:mo>;</mml:mo>
</mml:math>
<graphic xlink:href="12859_2017_1825_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<disp-formula id="Equb">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathrm{C}\left({k}_i\right)=\mathrm{the}\ \mathrm{set}\ \mathrm{of}\ de\ novo\ \mathrm{assembled}\ \mathrm{contigs}\ \mathrm{from}\ \mathrm{G}\left(R,{k}_i\right); $$\end{document}</tex-math>
<mml:math id="M4" display="block">
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mfenced close=")" open="(">
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mtext>the</mml:mtext>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mi>set</mml:mi>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>of</mml:mtext>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mi mathvariant="italic">de</mml:mi>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext mathvariant="italic">novo</mml:mtext>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>assembled contigs from</mml:mtext>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mi mathvariant="normal">G</mml:mi>
<mml:mfenced close=")" open="(" separators=",">
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfenced>
<mml:mo>;</mml:mo>
</mml:math>
<graphic xlink:href="12859_2017_1825_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<list list-type="bullet">
<list-item>
<p id="Par37">G(
<italic>R</italic>
, 
<italic>k</italic>
<sub>
<italic>i</italic>
 + 1</sub>
) = DBG(
<italic>R</italic>
 ∪ C(
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
), 
<italic>k</italic>
<sub>
<italic>i</italic>
 + 1</sub>
).</p>
</list-item>
</list>
</p>
<p id="Par38">For simplicity, we call G(
<italic>R</italic>
) = G(
<italic>R</italic>
, 
<italic>k</italic>
<sub>
<italic>n</italic>
</sub>
) the iterative
<italic>de Bruijn</italic>
graph of
<italic>R</italic>
. Intuitively, for each
<italic>i</italic>
, some
<italic>k</italic>
<sub>
<italic>i</italic>
 + 1</sub>
-mers absent from
<italic>R</italic>
(due to insufficient read coverage) could be de novo assembled from G(
<italic>R</italic>
, 
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
). The intermediate contigs C(
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
) are dependent on the de novo assembly algorithm used. In our implementation, tip removal and bubble merging [
<xref ref-type="bibr" rid="CR17">17</xref>
] are done prior to output the sequences of maximum paths without branches as contigs. The HMM-guided algorithm is applied on G(
<italic>R</italic>
) to search for targeted gene sequences.</p>
<p id="Par39">In gene-targeted assembly, it is possible to replace intermediate contigs C(
<italic>k</italic>
<sub>
<italic>i</italic>
</sub>
) with a set of HMM-guided assembled contigs. However, we only applied traditional assembly graph pruning tactics due to the reason that HMM-guided contigs at a smaller
<italic>k</italic>
-mer size are error-prone in practice; the errors would be accumulated into the final graph G(
<italic>R</italic>
). Traditional de novo assembly graph pruning methods, such as tips removal, bubbles merging, et cetera are empirically more accurate.</p>
</sec>
<sec id="Sec6">
<title>Succinct
<italic>de Bruijn</italic>
graphs</title>
<p id="Par40">In our implementation, we represent a
<italic>de Bruijn</italic>
graph with a compressed data structure, namely Succinct
<italic>de Bruijn</italic>
Graph (SdBG) [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR15">15</xref>
]. Unlike Xander that uses a Bloom filter to represent a
<italic>de Bruijn</italic>
graph, which may incur false positive
<italic>k</italic>
-mers, we choose an SdBG for the following reasons: First, it is not only memory-efficient but also an exact representation of a
<italic>de Bruijn</italic>
graph. Second, there is a highly parallelized algorithm to construct an SdBG rapidly, which is essential since we need to build multiple intermediate graphs until the final iterative
<italic>de Bruijn</italic>
graph is obtained. Third, de novo assembly, as required by building an iterative
<italic>de Bruijn</italic>
graph, is an uneasy job with a Bloom filter. The inexactness of bloom filter could be solved by marking false
<italic>k</italic>
-mers in a Bloom filter with extra memory, but by using a disk-based algorithm [
<xref ref-type="bibr" rid="CR18">18</xref>
], which is time-consuming. In contrast, it is easy to do in-memory, multi-threaded de novo assembly with an SdBG.</p>
</sec>
</sec>
<sec id="Sec7">
<title>Results and discussion</title>
<p id="Par41">We conducted five experiments using two metagenomic NGS datasets. The first three experiments were carried out on a mock metagenomic community dataset with known reference gene sequences. Thus, we can evaluate the sensitivity and accuracy of assembling results by MegaGTA and Xander directly. We evaluated the trade-off between sensitivity and accuracy in
<italic>k</italic>
-mer size selection (Section 3.1), the effectiveness of low-coverage penalty strategy on accuracy (Section 3.2), and the improvements in sensitivity and accuracy brought by iterative
<italic>de Brujin</italic>
graphs (Section 3.3). The other two experiments were conducted using a real, large and complex soil metagenomic sample. We showed that the MegaGTA not only assembled more contigs as well as genes than Xander on real dataset, but also achieve a higher speed, which is essential to large metagenomic samples (Section 3.4). The false positive effect of Bloom filters was also evaluated using this dataset (Section 3.5). All experiments were run on a server equipped with 24 2.6GHz Intel CPU cores and 1 T DDR3 RAM. Both MegaGTA and Xander were configured with 24 threads though Xander only supported multi-threading for its starting
<italic>k</italic>
-mer finding component.</p>
<sec id="Sec8">
<title>Trade-off in
<italic>k</italic>
-mer size selection</title>
<p id="Par42">We ran MegaGTA (using single
<italic>k</italic>
) and Xander on an HMP-defined mock community dataset (SRR172902 and SRR172903), that contains 22 known microorganisms, to observe the differences in gene sensitivity and accuracy with different
<italic>k</italic>
-mer sizes. We assembled the rplB genes from the sequencing data, using the protein/nucleotide reference sequences and the HMM used in the Xander paper. The reads were firstly preprocessed by Trimmomatic [
<xref ref-type="bibr" rid="CR19">19</xref>
] to remove adaptor sequences and trim low quality bases (at a quality score of 2 to both ends of a read). Raw contigs of length at least 450 nucleotides (or 150 amino acid) were passed through a series of post-processing steps as suggested by Xander, including clustering at 99% amino acid identity and choosing the longest one as a representative. Then UCHIME [
<xref ref-type="bibr" rid="CR20">20</xref>
] was used for chimeric removal against a set of reference DNA sequences. The contigs after the post-processing steps were then aligned to the know rplB gene sequences for analysis.</p>
<p id="Par43">Among the 22 microorganisms in the mock community, only 20 known rplB gene sequences could be downloaded from the NCBI as of Jan 2016. We evaluated the sensitivity (number of genes recovered and their gene fractions), accuracy (number of misassemblies and mismatches), and duplication ratio for each assembly using metaQUAST [
<xref ref-type="bibr" rid="CR21">21</xref>
] on these 20 known gene sequences.</p>
<p id="Par44">We chose three
<italic>k</italic>
-mer sizes, 30, 36 and 45, for both MegaGTA and Xander. The evaluation results given by metaQUAST are shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
. In total, 10 rplB genes were recovered by either MegaGTA or Xander. Both MegaGTA and Xander recovered more genes with a
<italic>k</italic>
-mer size of 30. Both reported fewer misassemblies, partially unaligned contigs and mismatches as the
<italic>k</italic>
-mer size went larger. The duplication ratio was also higher with a smaller
<italic>k</italic>
-mer size, except for the assembly of Xander with
<italic>k</italic>
 = 36. By checking the 36-mers or 45-mers in the contigs assembled by
<italic>k</italic>
 = 30, we found that some of them were unrecovered because they were missing in low coverage regions. It is interesting that some genes (for example, the rplB gene of
<italic>Staphylococcus epidermidis</italic>
) were only assembled by
<italic>k</italic>
 = 45. When looking at the contigs before chimeric removal, we found that the missing genes were actually “assembled”, but contained too many mismatches and hence been removed by UCHIME. This again indicates that a small
<italic>k</italic>
-mer size is prone to produce chimeric or erroneous contigs. Regarding time efficiency, using small
<italic>k</italic>
required more time than large
<italic>k</italic>
, due to excessive number of branches in the correponding
<italic>de Bruijn</italic>
graphs. MegaGTA was 8.8 to 14.9 times faster than Xander depending on the
<italic>k</italic>
-mer size used.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Assembly statistics of different
<italic>k</italic>
-mer sizes</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th colspan="3">MegaGTA</th>
<th colspan="3">Xander</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<italic>k</italic>
-mer size</td>
<td>30</td>
<td>36</td>
<td>45</td>
<td>30</td>
<td>36</td>
<td>45</td>
</tr>
<tr>
<td># of contigs</td>
<td>16</td>
<td>7</td>
<td>4</td>
<td>14</td>
<td>7</td>
<td>4</td>
</tr>
<tr>
<td># of gene recoverd</td>
<td>9</td>
<td>5</td>
<td>4</td>
<td>8</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>duplication ratio</td>
<td>1.82</td>
<td>1.46</td>
<td>1.00</td>
<td>1.75</td>
<td>1.82</td>
<td>1.00</td>
</tr>
<tr>
<td># misassembled contigs</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td># partially unaligned contigs</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td># mismatches per 100kbp</td>
<td>148</td>
<td>150</td>
<td>96</td>
<td>534</td>
<td>278</td>
<td>64</td>
</tr>
<tr>
<td>Wall time (second)</td>
<td>101</td>
<td>73</td>
<td>65</td>
<td>1264</td>
<td>1090</td>
<td>573</td>
</tr>
<tr>
<td colspan="7">The gene fraction of each recovered rplB genes (%)</td>
</tr>
<tr>
<td>
<italic>Acinetobacter baumannii</italic>
</td>
<td>84.8</td>
<td></td>
<td></td>
<td>84.8</td>
<td></td>
<td></td>
</tr>
<tr>
<td>
<italic>Bacteroides vulgatus</italic>
</td>
<td>82.5</td>
<td>82.5</td>
<td></td>
<td>82.5</td>
<td>82.5</td>
<td></td>
</tr>
<tr>
<td>
<italic>Deinococcus radiodurans</italic>
</td>
<td>99.6</td>
<td>99.6</td>
<td>81.5</td>
<td>99.6</td>
<td>99.6</td>
<td>81.5</td>
</tr>
<tr>
<td>
<italic>Escherichia coli</italic>
</td>
<td>81.4</td>
<td></td>
<td></td>
<td>81.4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>
<italic>Propionibacterium acnes</italic>
</td>
<td>78.1</td>
<td></td>
<td></td>
<td>78.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>
<italic>Rhodobacter sphaeroides</italic>
</td>
<td>98.2</td>
<td>64.3</td>
<td></td>
<td>98.2</td>
<td>64.3</td>
<td></td>
</tr>
<tr>
<td>
<italic>Staphylococcus aureus</italic>
</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
</tr>
<tr>
<td>
<italic>Staphylococcus epidermidis</italic>
</td>
<td></td>
<td></td>
<td>99.6</td>
<td></td>
<td></td>
<td>99.6</td>
</tr>
<tr>
<td>
<italic>Streptococcus mutans</italic>
</td>
<td>55.0</td>
<td>55.0</td>
<td>93.2</td>
<td></td>
<td></td>
<td>93.9</td>
</tr>
<tr>
<td>
<italic>Streptococcus pneumoniae</italic>
</td>
<td>62.2</td>
<td></td>
<td></td>
<td>62.2</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par45">In conclusion, small
<italic>k</italic>
-mer size is more sensitive, but tends to yield erroneous contigs. Large
<italic>k</italic>
-mer size could assemble genes accurately, at the expense of losing low coverage genes.</p>
</sec>
<sec id="Sec9">
<title>Low coverage penalty improves the accuracy of gene-targeted assembly</title>
<p id="Par46">It is shown in Table
<xref rid="Tab1" ref-type="table">1</xref>
that MegaGTA and Xander generated slightly different results for each
<italic>k</italic>
-mer size. This is attribute to the low coverage penalty of MegaGTA. We picked
<italic>k</italic>
 = 36 as an example to evaluate its effectiveness. In addition, we evaluated how much the low coverage penalty could substitute the chimeric removal using UCHIME.</p>
<p id="Par47">As shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
, without low coverage penalty, MegaGTA had almost the same results (after UCHIME) as Xander (as shown in Table
<xref rid="Tab1" ref-type="table">1</xref>
). With low coverage penalty enabled, the number of mismatches decreased significantly. The effectiveness of the penalty was more salient before chimeric removal. It also reduced the number of partially unaligned contigs.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Assembly result with or without low coverage penalty</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2"></th>
<th colspan="2">Before UCHIME</th>
<th colspan="2">After UCHIME</th>
</tr>
<tr>
<th>with penalty</th>
<th>without penalty</th>
<th>with penalty</th>
<th>without penalty</th>
</tr>
</thead>
<tbody>
<tr>
<td># of gene contigs</td>
<td>13</td>
<td>14</td>
<td>7</td>
<td>7</td>
</tr>
<tr>
<td># of gene recoverd</td>
<td>6</td>
<td>6</td>
<td>5</td>
<td>4</td>
</tr>
<tr>
<td># misassembled contigs</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td># partially unaligned contigs</td>
<td>2</td>
<td>3</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td># mismatches per 100kbp</td>
<td>543.9</td>
<td>997.1</td>
<td>149.9</td>
<td>208.8</td>
</tr>
<tr>
<td colspan="5">The gene fraction of each recovered rplB genes (%)</td>
</tr>
<tr>
<td>
<italic>Bacteroides vulgatus</italic>
</td>
<td>82.5</td>
<td>82.5</td>
<td>82.5</td>
<td>82.5</td>
</tr>
<tr>
<td>
<italic>Deinococcus radiodurans</italic>
</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
</tr>
<tr>
<td>
<italic>Rhodobacter_sphaeroides</italic>
</td>
<td>64.3</td>
<td>64.3</td>
<td>64.3</td>
<td>64.3</td>
</tr>
<tr>
<td>
<italic>Staphylococcus_aureus</italic>
</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
<td>99.6</td>
</tr>
<tr>
<td>
<italic>Staphylococcus_epidermidis</italic>
</td>
<td>99.6</td>
<td>99.6</td>
<td></td>
<td></td>
</tr>
<tr>
<td>
<italic>Streptococcus_mutans</italic>
</td>
<td>84.3</td>
<td>84.3</td>
<td>55.0</td>
<td></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par48">With low coverage penalty, MegaGTA produced one extra contig corresponding to
<italic>Streptococcus mutans</italic>
, and it covered 55% of the gene after chimeric removal. By manual inspection, we found that the contig was also assembled by MegaGTA without the penalty, but had been merged into a longer contig (which has three more mismatches than that of low coverage penalty) at 99% amino acid clustering similarity and then removed by UCHIME. Thus, although UCHIME can remove erroneous contigs, the accuracy of the raw contigs, which may affect the clustering result, is still important. In this regard, the low coverage penalty is really helpful.</p>
</sec>
<sec id="Sec10">
<title>Iterative
<italic>de Bruijn</italic>
graph outperforms merging contigs of individual
<italic>k</italic>
-mers</title>
<p id="Par49">We evaluated the effectiveness of iterative
<italic>de Bruijn</italic>
graph approach of MegaGTA on the HMP-defined dataset. We iterate the
<italic>de Bruijn</italic>
graph on 3 
<italic>k</italic>
-mer sizes 30, 36 and 45. In order to conduct a fair comparison with Xander that ran with a single
<italic>k</italic>
only, we manually combined the raw contigs outputted by Xander with the 3 
<italic>k</italic>
-mer sizes, to maximize its sensitivity. The same post-processing procedures as Xander were applied to the combined contigs.</p>
<p id="Par50">Evaluation results given by metaQUAST are presented in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. MegaGTA achieved the same or higher fraction for every microorganism, and much fewer misassemblies and mismatches. By combining the contigs of different
<italic>k</italic>
-mer sizes, Xander gained higher gene sensitivity, but many misassembled or erroneous contigs assembled by
<italic>k</italic>
 = 30 were also included. Moreover, the duplication ratio went higher after the combination, and the running time of Xander was ten times greater than that of MegaGTA.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Assembly results of MegaGTA (using iterative
<italic>de Bruijn</italic>
graph) and Xander (merging contigs of three
<italic>k</italic>
-mer sizes)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th>MegaGTA (iterates on
<italic>k</italic>
 = 30,36,45)</th>
<th>Xander (Union of
<italic>k</italic>
 = 30,36,45)</th>
</tr>
</thead>
<tbody>
<tr>
<td># of gene contigs</td>
<td>10</td>
<td>19</td>
</tr>
<tr>
<td># of genes recovered</td>
<td>10</td>
<td>10</td>
</tr>
<tr>
<td>duplication ratio</td>
<td>1</td>
<td>1.79</td>
</tr>
<tr>
<td># misassembled contigs</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td># partially unaligned contigs</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td># mismatches per 100kbp</td>
<td>13.52</td>
<td>453.05</td>
</tr>
<tr>
<td>Time (second)</td>
<td>277</td>
<td>2927</td>
</tr>
<tr>
<td colspan="3">The gene fraction of each recovered rplB genes (%)</td>
</tr>
<tr>
<td>
<italic>Acinetobacter_baumannii</italic>
</td>
<td>98.77</td>
<td>84.77</td>
</tr>
<tr>
<td>
<italic>Bacteroides_vulgatus</italic>
</td>
<td>82.48</td>
<td>82.48</td>
</tr>
<tr>
<td>
<italic>Deinococcus_radiodurans</italic>
</td>
<td>99.64</td>
<td>99.64</td>
</tr>
<tr>
<td>
<italic>Escherichia_coli</italic>
</td>
<td>81.39</td>
<td>81.39</td>
</tr>
<tr>
<td>
<italic>Propionibacterium_acnes</italic>
</td>
<td>78.14</td>
<td>78.14</td>
</tr>
<tr>
<td>
<italic>Rhodobacter_sphaeroides</italic>
</td>
<td>98.21</td>
<td>98.21</td>
</tr>
<tr>
<td>
<italic>Staphylococcus_aureus</italic>
</td>
<td>99.64</td>
<td>99.64</td>
</tr>
<tr>
<td>
<italic>Staphylococcus_epidermidis</italic>
</td>
<td>99.64</td>
<td>99.64</td>
</tr>
<tr>
<td>
<italic>Streptococcus_mutans</italic>
</td>
<td>99.29</td>
<td>99.29</td>
</tr>
<tr>
<td>
<italic>Streptococcus_pneumoniae</italic>
</td>
<td>63.31</td>
<td>62.23</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par51">In
<italic>Streptococcus mutans</italic>
, although metaQUAST reported a fraction of 99.29% for its rplB gene assembled by Xander, we found that the gene (840 bp) was covered by two shorter contigs of length 783 bp and 468 bp, respectively. In contrast, MegaGTA assembled one contig of length 828 bp that covered the almost full length of the gene. Therefore, for some genes, a small
<italic>k</italic>
or a large
<italic>k</italic>
alone could only assemble part of their sequence accurately. However, a longer path that correctly encodes the target gene and is detectable by the HMM-guided search could exist in an iterative graph. Arguably, an iterative
<italic>de Bruijn</italic>
graph provides a better solution to assemble such genes.</p>
</sec>
<sec id="Sec11">
<title>MegaGTA achieves higher sensitivity on real dataset</title>
<p id="Par52">To test the performance of MegaGTA on real dataset, we compare the performance of MegaGTA with the gene-targeted assembler Xander and the de novo metagenome assembler MEGAHIT [
<xref ref-type="bibr" rid="CR3">3</xref>
] (v1.0.5) on assembling one phylogenetic marker gene (rplB) and two functional marker genes (nifH and nirK) from a corn rhizosphere soil metagenomic sample [
<xref ref-type="bibr" rid="CR8">8</xref>
]. The reads were trimmed at the first bases with the quality score of 2. 327Gbp remained after the quality trimming.</p>
<p id="Par53">We ran MegaGTA in its default iterative mode (
<italic>k</italic>
 = 30, 36 and 45), and ran Xander with
<italic>k</italic>
 = 45 and a Bloom filter size of 128G (allocating 200GB JAVA virtual machine memory) as suggested in its paper. We also tried to run Xander with
<italic>k</italic>
 = 30 and 784GB JAVA virtual machine memory, but the process did not finish after 2 weeks.</p>
<p id="Par54">We ran MEGAHIT with “meta-large” preset and used FragGeneScan [
<xref ref-type="bibr" rid="CR22">22</xref>
] (v1.30) to predict genes from the assembled contigs. HMMER [
<xref ref-type="bibr" rid="CR23">23</xref>
] was then applied (v3.1b2) to identify the genes of rplB, nifH and nirK. Only gene sequence with bit-score > = 50 against the profile-HMMs were retained as gene contigs assembled by MEGAHIT.</p>
<p id="Par55">All raw gene contigs constructed by the above assemblers with length longer than 450 bp were clustered at 99% amino acid identity, and chimeras were removed using UCHIME against a set of reference sequences. We also lowered the clustering identity threshold to 95%, as this value was also used in [
<xref ref-type="bibr" rid="CR8">8</xref>
] for analysis. Similar to the experiments described in [
<xref ref-type="bibr" rid="CR8">8</xref>
], we used Framebot [
<xref ref-type="bibr" rid="CR24">24</xref>
] to find the closest matches to a set of reference sequences. We found that all contigs of MegaGTA and Xander were matched by Framebot, while quite a few MEGAHIT contigs were not (1.3, 10 and 66.6% of rplB, nifH and nirK, respectively). These unmatched contigs were discarded.</p>
<p id="Par56">Table 
<xref rid="Tab4" ref-type="table">4</xref>
summarizes the assembly result. At a threshold of 99% amino acid clustering identity, MegaGTA assembled 6.5–16.6% more contigs than Xander in total length, and this number became 9.7–19.3% at 95% clustering identity. MegaGTA also matched 7.7–10% and 10.9–25% more genes by Framebot at these two clustering thresholds, while retained a similar level of median identity. This indicates that MegaGTA, by using iterative
<italic>de Bruijn</italic>
graph, achieves higher sensitivity for the real dataset, and the improvement is more significant at lower clustering identity threshold (indicating higher taxonomy level).
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Performance of MegaGTA, Xander and MEGAHIT on the rhizosphere soil metagenomic sample</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th colspan="2">MegaGTA</th>
<th colspan="2">Xander</th>
<th colspan="2">MEGAHIT</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">Gene rplB</td>
</tr>
<tr>
<td> Cluster Identity</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
</tr>
<tr>
<td> # of gene contigs aligned by Framebot</td>
<td>17,668</td>
<td>5079</td>
<td>15,933</td>
<td>4237</td>
<td>578</td>
<td>465</td>
</tr>
<tr>
<td> Total length (bp)</td>
<td>13.9 M</td>
<td>3.96 M</td>
<td>12.5 M</td>
<td>3.32 M</td>
<td>378 k</td>
<td>311 k</td>
</tr>
<tr>
<td> Median length (bp)</td>
<td>822</td>
<td>822</td>
<td>822</td>
<td>822</td>
<td>639</td>
<td>660</td>
</tr>
<tr>
<td> # of matched reference genes</td>
<td>491</td>
<td>427</td>
<td>456</td>
<td>385</td>
<td>208</td>
<td>193</td>
</tr>
<tr>
<td> Median % aa identity</td>
<td>76.73</td>
<td>76.00</td>
<td>77.46</td>
<td>76.90</td>
<td>77.50</td>
<td>77.07</td>
</tr>
<tr>
<td colspan="7">Gene nifH</td>
</tr>
<tr>
<td> Cluster Identity</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
</tr>
<tr>
<td> # of gene contigs aligned by Framebot</td>
<td>33</td>
<td>11</td>
<td>31</td>
<td>10</td>
<td>9</td>
<td>5</td>
</tr>
<tr>
<td> Total length (bp)</td>
<td>27.8 k</td>
<td>9225</td>
<td>25.3 k</td>
<td>8412</td>
<td>7368</td>
<td>4464</td>
</tr>
<tr>
<td> Median length (bp)</td>
<td>888</td>
<td>888</td>
<td>882</td>
<td>883.5</td>
<td>930</td>
<td>930</td>
</tr>
<tr>
<td> # of matched reference genes</td>
<td>13</td>
<td>10</td>
<td>12</td>
<td>8</td>
<td>8</td>
<td>5</td>
</tr>
<tr>
<td> Median % aa identity</td>
<td>91.55</td>
<td>90.54</td>
<td>92.96</td>
<td>91.19</td>
<td>85.14</td>
<td>83.99</td>
</tr>
<tr>
<td colspan="7">Gene nirK</td>
</tr>
<tr>
<td> Cluster Identity</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
<td>99%</td>
<td>95%</td>
</tr>
<tr>
<td> # of gene contigs aligned by Framebot</td>
<td>1336</td>
<td>392</td>
<td>1242</td>
<td>336</td>
<td>203</td>
<td>179</td>
</tr>
<tr>
<td> Total length (bp)</td>
<td>1.09 M</td>
<td>321 k</td>
<td>1.02 M</td>
<td>277 k</td>
<td>170 k</td>
<td>153 k</td>
</tr>
<tr>
<td> Median length (bp)</td>
<td>687</td>
<td>787.5</td>
<td>690</td>
<td>748.5</td>
<td>735</td>
<td>750</td>
</tr>
<tr>
<td> # of matched reference genes</td>
<td>55</td>
<td>53</td>
<td>50</td>
<td>47</td>
<td>71</td>
<td>66</td>
</tr>
<tr>
<td> Median % aa identity</td>
<td>89.29</td>
<td>86.61</td>
<td>89.06</td>
<td>87.30</td>
<td>66.41</td>
<td>65.97</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par57">It is not surprising that gene-targeted assembler MegaGTA and Xander assembled much more gene sequences than de novo assembler MEGAHIT. Moreover, the median aa identity of nifH and nirK contigs assembled by MegaGTA and Xander were significantly higher than MEGAHIT’s. For nirK, although MEGAHIT’s contig matched more reference genes, MegaGTA and Xander found more genes with high identity (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). All nirK contigs of MegaGTA were matched with >55% identity by Framebot to 55 and 53 nirK genes at 99 and 95% clustering identity respectively. With the same cutoff, only 73.4 and 73.2% of MEGAHIT’s contigs were matched by Framebot against 52 and 50 nirK genes respectively.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Number of matched nirk reference genes (clustered at 99% identity) v.s. minimum aa identity reported by Framebot</p>
</caption>
<graphic xlink:href="12859_2017_1825_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par58">By using a similar amount of RAM, MegaGTA was twice as fast as Xander, even though it had to build multiple
<italic>de Bruijn</italic>
graphs. The speed-up ratio is consistent with the experiment on the mock community (see Table
<xref rid="Tab1" ref-type="table">1</xref>
and Table
<xref rid="Tab2" ref-type="table">2</xref>
). MegaGTA was highly parallelized and got the whole gene assembly process done in 4.4 days, which is reasonably fast enough for such a large and complex dataset. MEGAHIT is faster than Xander, but a bit slower than MegaGTA.</p>
</sec>
<sec id="Sec12">
<title>False positive effects of bloom filters</title>
<p id="Par59">To evaluate how false
<italic>k</italic>
-mers in Bloom filters affect the assembly, we queried the rplB contigs of the rhizosphere metagenomic dataset assembled by Xander, with Bloom filter sizes of 256GB, 128GB and 64GB, respectively.</p>
<p id="Par60">As shown in Table 
<xref rid="Tab5" ref-type="table">5</xref>
, the number of contigs containing false
<italic>k</italic>
-mers increased as the Bloom filter size decreased, but was still acceptably small with a Bloom filter size of 256GB or 128GB (only 0.02 and 0.39%). Recall that MegaGTA, based on succinct
<italic>de Bruijn</italic>
graphs, required 242GB to assemble this dataset; Bloom filters, when using a similar amount of memory, performed well. However, when the Bloom filter size decreased to 64GB, more than 10% of the contigs contained false
<italic>k</italic>
-mers, and even worst, most of the false
<italic>k</italic>
-mers located amid a contig. Note that an internal false
<italic>k</italic>
-mer may lead to a chimeric contig. Therefore, one should be careful with the size of the Bloom filter.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Number of contigs with false
<italic>k</italic>
-mers v.s. different Bloom filter sizes</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Bloom filter size (GB)</th>
<th>256</th>
<th>128</th>
<th>64</th>
</tr>
</thead>
<tbody>
<tr>
<td># contigs</td>
<td>15,929</td>
<td>15,933</td>
<td>16,107</td>
</tr>
<tr>
<td># contigs with false
<italic>k</italic>
-mers</td>
<td>3</td>
<td>62</td>
<td>1694</td>
</tr>
<tr>
<td># contigs with internal false
<italic>k</italic>
-mers</td>
<td>1</td>
<td>46</td>
<td>1523</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec13">
<title>Conclusion</title>
<p id="Par61">By utilizing information of known genes, gene-targeted assembly is a higher resolution manner to assemble and annotated genes of interests. Our work is an improvement of Xander, a gene-targeted assembler that combines
<italic>de Bruijn</italic>
graphs and profile-HMMs. We observed that a single
<italic>k</italic>
-mer size has a trade-off in HMM-guided metagenomic assembly: a small
<italic>k</italic>
tends to produce more erroneous contigs, and a large
<italic>k</italic>
leads to missing low-coverage genes. We applied iterative
<italic>de Bruijn</italic>
graph approach to tackle this challenge. This idea, along with low coverage penalty and succinct
<italic>de Bruijn</italic>
graph representation, have been implemented in a new gene-targeted assembler, MegaGTA.</p>
<p id="Par62">We used MegaGTA and Xander to assemble rplB gene from the HMP mock community sequencing data, and found MegaGTA demonstrated higher sensitivity and accuracy than Xander. MegaGTA scales up easily to assemble very large and complex metagenomic dataset in an acceptable amount of time. It had been used to assemble a much larger and more challenging metagenomic sequencing dataset of rhizosphere corn soil, and produced more rplB, nifH and nirK genes than both Xander and MEGAHIT. As a side note, the advantage of MegaGTA is more substantial at higher taxonomy level.</p>
<p id="Par63">It is of practical interests whether Bloom filters, the probabilistic data structure used by Xander, will introduce many false contigs. We confirmed that when being configured to use a substantial amount of memory, Bloom filters result in a low proportion of false contigs, and are suitable for HMM-guided assembly.</p>
<p id="Par64">MegaGTA still has a lot of room for improvement. A more versatile
<italic>de Bruijn</italic>
graph, for example, annotated with read threading and paired-end information, could possibly be used to design a better HMM-guided algorithm. How to construct and make use of a more versatile graph is an interesting future direction.</p>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>Bp</term>
<def>
<p id="Par5">Base pair</p>
</def>
</def-item>
<def-item>
<term>DBG</term>
<def>
<p id="Par6">de Bruijn graph</p>
</def>
</def-item>
<def-item>
<term>Gbp</term>
<def>
<p id="Par7">Giga base pairs</p>
</def>
</def-item>
<def-item>
<term>HMM</term>
<def>
<p id="Par8">Hidden Markov model</p>
</def>
</def-item>
<def-item>
<term>Kbp</term>
<def>
<p id="Par9">Thousand base pairs</p>
</def>
</def-item>
<def-item>
<term>
<italic>k</italic>
-mer</term>
<def>
<p id="Par10">Length-
<italic>k</italic>
substring</p>
</def>
</def-item>
<def-item>
<term>nifH</term>
<def>
<p id="Par11">Nitrogenase reductase gene</p>
</def>
</def-item>
<def-item>
<term>nirK</term>
<def>
<p id="Par12">Nitrite reductase gene</p>
</def>
</def-item>
<def-item>
<term>rplB</term>
<def>
<p id="Par13">Ribosomal protein L2 gene</p>
</def>
</def-item>
<def-item>
<term>SdBG</term>
<def>
<p id="Par14">Succinct de Bruijn graph</p>
</def>
</def-item>
</def-list>
</glossary>
<ack>
<title>Acknowledgements</title>
<p>A 2-page abstract of this article has been published in Lecture notes in computer science: Bioinformatics research and applications. We thank Tewei Luo for improving the readability of this article.</p>
<sec id="FPar1">
<title>Funding</title>
<p id="Par65">This research was partially supported by ITF Grant ITS/155/15FP. The publication costs for this work were funded by the same grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</sec>
<sec id="FPar2">
<title>Availability of data and materials</title>
<p id="Par66">The source code of MegaGTA is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/HKU-BAL/megagta">https://github.com/HKU-BAL/megagta</ext-link>
under the GPLv3 License.</p>
</sec>
<sec id="FPar7">
<title>About this supplement</title>
<p id="Par71">This article has been published as part of BMC Bioinformatics Volume 18 Supplement 12, 2017: Selected articles from the 12th International Symposium on Bioinformatics Research and Applications (ISBRA-16): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-12.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>TL and DL initiated this project. DL, YH, CL, TL and HT contributed to the algorithm design. DL and YH implemented MegaGTA and carried out the experiments. DL, TL, CL and RL wrote the manuscript. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="FPar3">
<title>Ethics approval and consent to participate</title>
<p id="Par67">Not applicable.</p>
</sec>
<sec id="FPar4">
<title>Consent for publication</title>
<p id="Par68">Not applicable.</p>
</sec>
<sec id="FPar5">
<title>Competing interests</title>
<p id="Par69">None declared.</p>
</sec>
<sec id="FPar6">
<title>Publisher’s Note</title>
<p id="Par70">Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<mixed-citation publication-type="other">Brown, C.T., et al.,
<italic>A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data.</italic>
arXiv:1203.4802, 2012.</mixed-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pell</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Scaling metagenome sequence assembly with probabilistic de Bruijn graphs</article-title>
<source>Proc Natl Acad Sci</source>
<year>2012</year>
<volume>109</volume>
<issue>33</issue>
<fpage>13272</fpage>
<lpage>13277</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1121464109</pub-id>
<pub-id pub-id-type="pmid">22847406</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<mixed-citation publication-type="other">Li, D., et al.,
<italic>MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.</italic>
Bioinformatics, 2015.</mixed-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nagarajan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Sequence assembly demystified</article-title>
<source>Nat Rev Genet</source>
<year>2013</year>
<volume>14</volume>
<issue>3</issue>
<fpage>157</fpage>
<lpage>167</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3367</pub-id>
<pub-id pub-id-type="pmid">23358380</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miller</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data</article-title>
<source>Genome Biol</source>
<year>2011</year>
<volume>12</volume>
<issue>5</issue>
<fpage>R44</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2011-12-5-r44</pub-id>
<pub-id pub-id-type="pmid">21595876</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Reconstructing 16S rRNA genes in metagenomic data</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>12</issue>
<fpage>i35</fpage>
<lpage>i43</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv231</pub-id>
<pub-id pub-id-type="pmid">26072503</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
</person-group>
<article-title>A scalable and accurate targeted gene assembly tool (SAT-assembler) for next-generation sequencing data</article-title>
<source>PLoS Comput Biol</source>
<year>2014</year>
<volume>10</volume>
<issue>8</issue>
<fpage>e1003737</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1003737</pub-id>
<pub-id pub-id-type="pmid">25122209</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Xander: employing a novel method for efficient gene-targeted metagenomic assembly</article-title>
<source>Microbiome</source>
<year>2015</year>
<volume>3</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1186/s40168-015-0093-6</pub-id>
<pub-id pub-id-type="pmid">25621171</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eddy</surname>
<given-names>SR</given-names>
</name>
</person-group>
<article-title>What is a hidden Markov model?</article-title>
<source>Nat Biotechnol</source>
<year>2004</year>
<volume>22</volume>
<issue>10</issue>
<fpage>1315</fpage>
<lpage>1316</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1004-1315</pub-id>
<pub-id pub-id-type="pmid">15470472</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<mixed-citation publication-type="other">Peng Y, et al.
<italic>IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler</italic>
, in
<italic>Research in Computational Molecular Biology</italic>
, B. In: Berger: Springer Berlin Heidelberg; 2010. p. 426–40.</mixed-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bankevich</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
<source>J Comput Biol</source>
<year>2012</year>
<volume>19</volume>
<issue>5</issue>
<fpage>455</fpage>
<lpage>477</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0021</pub-id>
<pub-id pub-id-type="pmid">22506599</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler</article-title>
<source>GigaScience</source>
<year>2012</year>
<volume>1</volume>
<issue>1</issue>
<fpage>18</fpage>
<pub-id pub-id-type="doi">10.1186/2047-217X-1-18</pub-id>
<pub-id pub-id-type="pmid">23587118</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>11</issue>
<fpage>1420</fpage>
<lpage>1428</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts174</pub-id>
<pub-id pub-id-type="pmid">22495754</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bloom</surname>
<given-names>BH</given-names>
</name>
</person-group>
<article-title>Space/time trade-offs in hash coding with allowable errors</article-title>
<source>Commun ACM</source>
<year>1970</year>
<volume>13</volume>
<issue>7</issue>
<fpage>422</fpage>
<lpage>426</lpage>
<pub-id pub-id-type="doi">10.1145/362686.362692</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<mixed-citation publication-type="other">Bowe, A., et al.,
<italic>Succinct de Bruijn Graphs</italic>
, in
<italic>Algorithms in Bioinformatics</italic>
, B. Raphael and J. Tang, Editors. 2012, Springer Berlin Heidelberg. p. 225-235.</mixed-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hart</surname>
<given-names>PE</given-names>
</name>
<name>
<surname>Nilsson</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Raphael</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>A formal basis for the heuristic determination of minimum cost paths</article-title>
<source>IEEE Trans Syst Man Cybern</source>
<year>1968</year>
<volume>4</volume>
<issue>2</issue>
<fpage>100</fpage>
<lpage>107</lpage>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rizk</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Space-efficient and exact de Bruijn graph representation based on a bloom filter</article-title>
<source>Algorithms Mol Biol</source>
<year>2013</year>
<volume>8</volume>
<issue>1</issue>
<fpage>22</fpage>
<pub-id pub-id-type="doi">10.1186/1748-7188-8-22</pub-id>
<pub-id pub-id-type="pmid">24040893</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bolger</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Lohse</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Usadel</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Trimmomatic: a flexible trimmer for Illumina sequence data</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>15</issue>
<fpage>2114</fpage>
<lpage>2120</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu170</pub-id>
<pub-id pub-id-type="pmid">24695404</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>UCHIME improves sensitivity and speed of chimera detection</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>16</issue>
<fpage>2194</fpage>
<lpage>2200</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr381</pub-id>
<pub-id pub-id-type="pmid">21700674</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<mixed-citation publication-type="other">Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2016; 32(7):1088–90.</mixed-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rho</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>FragGeneScan: predicting genes in short and error-prone reads</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
<volume>38</volume>
<issue>20</issue>
<fpage>e191</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq747</pub-id>
<pub-id pub-id-type="pmid">20805240</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<mixed-citation publication-type="other">Mistry J, et al. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41(12):e121.</mixed-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted Metagenomics using FrameBot, a new informatics tool</article-title>
<source>MBio</source>
<year>2013</year>
<volume>4</volume>
<issue>5</issue>
<fpage>e00592</fpage>
<lpage>e00513</lpage>
<pub-id pub-id-type="doi">10.1128/mBio.00592-13</pub-id>
<pub-id pub-id-type="pmid">24045641</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000268  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000268  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021