Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0002700 ( Pmc/Corpus ); précédent : 0002699; suivant : 0002701 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Norgal: extraction and de novo assembly of mitochondrial DNA from whole-genome sequencing data</title>
<author>
<name sortKey="Al Nakeeb, Kosai" sort="Al Nakeeb, Kosai" uniqKey="Al Nakeeb K" first="Kosai" last="Al-Nakeeb">Kosai Al-Nakeeb</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Petersen, Thomas Nordahl" sort="Petersen, Thomas Nordahl" uniqKey="Petersen T" first="Thomas Nordahl" last="Petersen">Thomas Nordahl Petersen</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sicheritz Ponten, Thomas" sort="Sicheritz Ponten, Thomas" uniqKey="Sicheritz Ponten T" first="Thomas" last="Sicheritz-Pontén">Thomas Sicheritz-Pontén</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29162031</idno>
<idno type="pmc">5699183</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5699183</idno>
<idno type="RBID">PMC:5699183</idno>
<idno type="doi">10.1186/s12859-017-1927-y</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000270</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000270</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Norgal: extraction and de novo assembly of mitochondrial DNA from whole-genome sequencing data</title>
<author>
<name sortKey="Al Nakeeb, Kosai" sort="Al Nakeeb, Kosai" uniqKey="Al Nakeeb K" first="Kosai" last="Al-Nakeeb">Kosai Al-Nakeeb</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Petersen, Thomas Nordahl" sort="Petersen, Thomas Nordahl" uniqKey="Petersen T" first="Thomas Nordahl" last="Petersen">Thomas Nordahl Petersen</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sicheritz Ponten, Thomas" sort="Sicheritz Ponten, Thomas" uniqKey="Sicheritz Ponten T" first="Thomas" last="Sicheritz-Pontén">Thomas Sicheritz-Pontén</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Whole-genome sequencing (WGS) projects provide short read nucleotide sequences from nuclear and possibly organelle DNA depending on the source of origin. Mitochondrial DNA is present in animals and fungi, while plants contain DNA from both mitochondria and chloroplasts. Current techniques for separating organelle reads from nuclear reads in WGS data require full reference or partial seed sequences for assembling.</p>
</sec>
<sec>
<title>Results</title>
<p>Norgal (de Novo ORGAneLle extractor) avoids this requirement by identifying a high frequency subset of k-mers that are predominantly of mitochondrial origin and performing a de novo assembly on a subset of reads that contains these k-mers. The method was applied to WGS data from a panda, brown algae seaweed, butterfly and filamentous fungus. We were able to extract full circular mitochondrial genomes and obtained sequence identities to the reference sequences in the range from 98.5 to 99.5%. We also assembled the chloroplasts of grape vines and cucumbers using Norgal together with seed-based de novo assemblers.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Norgal is a pipeline that can extract and assemble full or partial mitochondrial and chloroplast genomes from WGS short reads without prior knowledge. The program is available at:
<ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/kosaidtu/norgal">https://bitbucket.org/kosaidtu/norgal</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-017-1927-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Bruggen, Efjv" uniqKey="Bruggen E">EFJV Bruggen</name>
</author>
<author>
<name sortKey="Borst, P" uniqKey="Borst P">P Borst</name>
</author>
<author>
<name sortKey="Ruttenberg, Gjcm" uniqKey="Ruttenberg G">GJCM Ruttenberg</name>
</author>
<author>
<name sortKey="Gruber, M" uniqKey="Gruber M">M Gruber</name>
</author>
<author>
<name sortKey="Kroon, Am" uniqKey="Kroon A">AM Kroon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hahn, C" uniqKey="Hahn C">C Hahn</name>
</author>
<author>
<name sortKey="Bachmann, L" uniqKey="Bachmann L">L Bachmann</name>
</author>
<author>
<name sortKey="Chevreux, B" uniqKey="Chevreux B">B Chevreux</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dierckxsens, N" uniqKey="Dierckxsens N">N Dierckxsens</name>
</author>
<author>
<name sortKey="Mardulyn, P" uniqKey="Mardulyn P">P Mardulyn</name>
</author>
<author>
<name sortKey="Smits, G" uniqKey="Smits G">G Smits</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robin, Ed" uniqKey="Robin E">ED Robin</name>
</author>
<author>
<name sortKey="Wong, R" uniqKey="Wong R">R Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haddad, Nj" uniqKey="Haddad N">NJ Haddad</name>
</author>
<author>
<name sortKey="Al Nakeeb, K" uniqKey="Al Nakeeb K">K Al-Nakeeb</name>
</author>
<author>
<name sortKey="Petersen, B" uniqKey="Petersen B">B Petersen</name>
</author>
<author>
<name sortKey="Dalen, L" uniqKey="Dalen L">L Dalén</name>
</author>
<author>
<name sortKey="Blom, N" uniqKey="Blom N">N Blom</name>
</author>
<author>
<name sortKey="Sicheritz Ponten, T" uniqKey="Sicheritz Ponten T">T Sicheritz-Pontén</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schubert, M" uniqKey="Schubert M">M Schubert</name>
</author>
<author>
<name sortKey="Lindgreen, S" uniqKey="Lindgreen S">S Lindgreen</name>
</author>
<author>
<name sortKey="Orlando, L" uniqKey="Orlando L">L Orlando</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, D" uniqKey="Li D">D Li</name>
</author>
<author>
<name sortKey="Liu, Cm" uniqKey="Liu C">CM Liu</name>
</author>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
<author>
<name sortKey="Lam, Tw" uniqKey="Lam T">TW Lam</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Camacho, C" uniqKey="Camacho C">C Camacho</name>
</author>
<author>
<name sortKey="Coulouris, G" uniqKey="Coulouris G">G Coulouris</name>
</author>
<author>
<name sortKey="Avagyan, V" uniqKey="Avagyan V">V Avagyan</name>
</author>
<author>
<name sortKey="Ma, N" uniqKey="Ma N">N Ma</name>
</author>
<author>
<name sortKey="Papadopoulos, J" uniqKey="Papadopoulos J">J Papadopoulos</name>
</author>
<author>
<name sortKey="Bealer, K" uniqKey="Bealer K">K Bealer</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aquadro, Cf" uniqKey="Aquadro C">CF Aquadro</name>
</author>
<author>
<name sortKey="Greenberg, Bd" uniqKey="Greenberg B">BD Greenberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ward, Bl" uniqKey="Ward B">BL Ward</name>
</author>
<author>
<name sortKey="Anderson, Rs" uniqKey="Anderson R">RS Anderson</name>
</author>
<author>
<name sortKey="Bendich, Aj" uniqKey="Bendich A">AJ Bendich</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lopez, Jv" uniqKey="Lopez J">JV Lopez</name>
</author>
<author>
<name sortKey="Yuhki, N" uniqKey="Yuhki N">N Yuhki</name>
</author>
<author>
<name sortKey="Masuda, R" uniqKey="Masuda R">R Masuda</name>
</author>
<author>
<name sortKey="Modi, W" uniqKey="Modi W">W Modi</name>
</author>
<author>
<name sortKey="O Rien, Sj" uniqKey="O Rien S">SJ O’Brien</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29162031</article-id>
<article-id pub-id-type="pmc">5699183</article-id>
<article-id pub-id-type="publisher-id">1927</article-id>
<article-id pub-id-type="doi">10.1186/s12859-017-1927-y</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Software</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Norgal: extraction and de novo assembly of mitochondrial DNA from whole-genome sequencing data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-3432-3628</contrib-id>
<name>
<surname>Al-Nakeeb</surname>
<given-names>Kosai</given-names>
</name>
<address>
<email>kosai@bioinformatics.dtu.dk</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Petersen</surname>
<given-names>Thomas Nordahl</given-names>
</name>
<address>
<email>tnp@bioinformatics.dtu.dk</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sicheritz-Pontén</surname>
<given-names>Thomas</given-names>
</name>
<address>
<email>thomas@bioinformatics.dtu.dk</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2181 8870</institution-id>
<institution-id institution-id-type="GRID">grid.5170.3</institution-id>
<institution>Department of Bio and Health Informatics,</institution>
<institution>Technical University of Denmark,</institution>
</institution-wrap>
Kemitorvet, Building 208, Kgs Lyngby, 2800 Denmark</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>21</day>
<month>11</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>21</day>
<month>11</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>18</volume>
<elocation-id>510</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>5</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>6</day>
<month>11</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Whole-genome sequencing (WGS) projects provide short read nucleotide sequences from nuclear and possibly organelle DNA depending on the source of origin. Mitochondrial DNA is present in animals and fungi, while plants contain DNA from both mitochondria and chloroplasts. Current techniques for separating organelle reads from nuclear reads in WGS data require full reference or partial seed sequences for assembling.</p>
</sec>
<sec>
<title>Results</title>
<p>Norgal (de Novo ORGAneLle extractor) avoids this requirement by identifying a high frequency subset of k-mers that are predominantly of mitochondrial origin and performing a de novo assembly on a subset of reads that contains these k-mers. The method was applied to WGS data from a panda, brown algae seaweed, butterfly and filamentous fungus. We were able to extract full circular mitochondrial genomes and obtained sequence identities to the reference sequences in the range from 98.5 to 99.5%. We also assembled the chloroplasts of grape vines and cucumbers using Norgal together with seed-based de novo assemblers.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Norgal is a pipeline that can extract and assemble full or partial mitochondrial and chloroplast genomes from WGS short reads without prior knowledge. The program is available at:
<ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/kosaidtu/norgal">https://bitbucket.org/kosaidtu/norgal</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-017-1927-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Mitochondrial dna</kwd>
<kwd>K-mer</kwd>
<kwd>Next-generation sequencing</kwd>
<kwd>De novo assembly</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>Certain organelles such as mitochondria have their own distinct genomes. The mitochondrial genome - the mitogenome - differs significantly from eukaryotic nuclear genomes e.g. by typically being circular and smaller in size [
<xref ref-type="bibr" rid="CR1">1</xref>
]. The mitogenome can be sequenced experimentally by isolating the mitochondria, amplifying the mitochondrial DNA (mtDNA) with PCR using primers from mtDNA of closely related organisms and sequencing the PCR products. With high-throughput whole-genome sequencing (WGS), the data typically contains mitochondrial DNA in addition to nuclear DNA and does not require the isolation of mitochondria before-hand. This makes WGS data a valuable resource for extracting and assembling mitogenomes, and can potentially replace targeted sequencing.</p>
<p>Current methods to extract mtDNA from WGS data require a short seed sequence to initiate assembly [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
]. However, for unknown organisms whose mitogenomes differ significantly from the currently known mitogenomes, this can be inconvenient and challenging. To avoid this problem, we developed a reference-independent method based on k-mer frequencies that takes advantage of mitochondria being present 10-100 times more in a cell than the nucleus [
<xref ref-type="bibr" rid="CR4">4</xref>
].</p>
<p>This means that in sequencing experiments the mitogenome will have a higher read depth compared to the nuclear genome and this difference in the read depth levels can be used to separate the reads into two groups; those of nuclear and those of mitochondrial origin.</p>
<p>The separation of the two types of reads is done by counting occurrences of subsequences of length k in the reads - k-mers - and classifying reads that have k-mers that are found more times than the nuclear read depth as being of non-nuclear origin. These non-nuclear reads with k-mers above the nuclear read depth threshold may come from the mitochondria and plastids or from certain regions in the nuclear genome such as repeats, NUMT’s etc. The predominantly mitochondrial reads can then be de novo assembled into non-nuclear sequences where it is reasonable to assume that the longest contig in this assembly would be from mitochondria or plastids as the longer nuclear genome would not be assembled. Norgal is our implementation of this assembly method and provides annotation and evaluation of the final sequence. In the case where an assembly is partial or fragmented, the user can use this sequence as a reference for one of the current reference-based extraction tools. Recently, the mitochondrial genome of the Oriental hornet (Vespa orientalis) was published using a Norgal assembly [
<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
</sec>
<sec id="Sec2">
<title>Implementation</title>
<p>Norgal uses raw short NGS reads from WGS data as input and outputs either a full or partial mitogenome. Norgal is written in python3 but is backwards compatible with python2.7 and requires java and the python library matplotlib for plotting. It relies on a range of bundled software for the different steps in the pipeline. Figure 
<xref rid="Fig1" ref-type="fig">1</xref>
shows the workflow of Norgal which has the following steps:
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Workflow of Norgal. This diagram shows how Norgal seperates mitochondrial reads from nuclear reads and assembles the mitochondrial reads into a partial or complete mitogenome</p>
</caption>
<graphic xlink:href="12859_2017_1927_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>
<list list-type="order">
<list-item>
<p>Trim and remove adapters from NGS reads using
<italic>AdapterRemoval</italic>
[
<xref ref-type="bibr" rid="CR6">6</xref>
] and perform a de novo assembly using
<italic>MEGAHIT</italic>
[
<xref ref-type="bibr" rid="CR7">7</xref>
].</p>
</list-item>
<list-item>
<p>Map the reads back to the longest assembled sequence using
<italic>bwa mem</italic>
[
<xref ref-type="bibr" rid="CR8">8</xref>
] and calculate the read depths for each position in order to determine the nuclear depth threshold (ND threshold).</p>
</list-item>
<list-item>
<p>Count kmers of size 31 in all reads and only keep a subset of reads that contains at least one 31-kmer with a frequency that is greater than the ND threshold. This is done using the program
<italic>BBTools</italic>
[
<xref ref-type="bibr" rid="CR9">9</xref>
].</p>
</list-item>
<list-item>
<p>Perform a de novo assembly using
<italic>idba_ud</italic>
[
<xref ref-type="bibr" rid="CR10">10</xref>
] with the reads containing the frequent kmers and extract either the longest contig or optionally the longest contig with a predicted cytochrome c oxidase subunit 1 (COI) gene.</p>
</list-item>
<list-item>
<p>Examine circularity of the longest contig, determine read depth, identify potential mitochondrial and chloroplast contigs, and output plots comparing depths between this contig and the longest contig from the assembly in step (1).</p>
</list-item>
</list>
</p>
<p>These steps are explained in more details in the following sections.</p>
<sec id="Sec3">
<title>Pre-processing reads</title>
<p>Raw reads may contain non-biological DNA sequences from the sequencing process, such as adapter and primer sequences. If these are not removed before-hand, Norgal removes adapters and trims NGS reads using
<italic>AdapterRemoval</italic>
with --minlength 30 and default settings.</p>
</sec>
<sec id="Sec4">
<title>Estimating nuclear read depth threshold</title>
<p>If no reference sequence from the nuclear genome is provided, an initial de novo assembly is performed using the program
<italic>MEGAHIT</italic>
with default settings and the k-mer range: 21, 49, 77 and 105. Norgal assumes that the longest assembled sequence (contig) is nuclear. The reads are then mapped back to the longest assembled contig using
<italic>bwa mem</italic>
with default settings. If the longest assembled contig is longer than 100,000 base pairs, only the first 100,000 base pairs are used as it should be enough to determine the depth. The read depths of the mapped reads to this contig are used to determine the nuclear depth threshold (ND threshold) which is defined as the mean of all non-zero read depths from the 25
<sup>
<italic>th</italic>
</sup>
to the 75
<sup>
<italic>th</italic>
</sup>
percentile range multiplied with five:
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \text{ND threshold} = 5\cdot \frac{\sum_{i=25^{th} percentile}^{75^{th} percentile}d_{i}}{n} $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mtext>ND threshold</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>·</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">th</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mtext mathvariant="italic">percentile</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">th</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mtext mathvariant="italic">percentile</mml:mtext>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2017_1927_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Here,
<italic>d</italic>
<sub>
<italic>i</italic>
</sub>
is the read depth at index
<italic>i</italic>
in a sorted array of non-zero read depths from the the longest assembled contig and
<italic>n</italic>
is the number of non-zero read depths in the percentile range. If all read depths are non-zero,
<italic>n</italic>
is half of the length of the contig.</p>
<p>The mitochondrial copy numbers have previously been determined to be in the range of 10 to 100 times higher than the nuclear read depth [
<xref ref-type="bibr" rid="CR4">4</xref>
]. Norgal uses the multiplication factor 5 in Eq. (
<xref rid="Equ1" ref-type="">1</xref>
) as it lies between the lowest reported number of mitochondria in the literature and the nuclear depth. This threshold can be set manually by the user and should be slightly higher than the depth.</p>
</sec>
<sec id="Sec5">
<title>Binning reads based on k-mer occurrences</title>
<p>There is a direct correlation between genome depth and k-mer counts (also called k-mer depths) [
<xref ref-type="bibr" rid="CR11">11</xref>
]:
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ N = M\cdot \frac{L}{L-k+1},\text{where k \(<\) L+1} $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mi>N</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mo>·</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mtext>where k</mml:mtext>
<mml:mspace width="1em"></mml:mspace>
<mml:mo><</mml:mo>
<mml:mtext>L+1</mml:mtext>
</mml:math>
<graphic xlink:href="12859_2017_1927_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>N</italic>
is the genome depth,
<italic>M</italic>
is the k-mer depth,
<italic>L</italic>
is the read length and
<italic>k</italic>
is the k-mer size.</p>
<p>While it may not be feasible to determine the depth over each read, it is much less computationally intensive to determine which k-mers are present in each read and how often these k-mers are found in the total read pool and then translating this to read depth. This can be done because the number of times a k-mer is found in the total read pool corresponds to the k-mer depth,
<italic>M</italic>
, in the above Eq. (
<xref rid="Equ2" ref-type="">2</xref>
). Since the kmer size,
<italic>k</italic>
, is known before-hand and the read length,
<italic>L</italic>
, can be determined effortlessly, it is straight-forward to calculate the genomic depth,
<italic>N</italic>
, of the region from which the read originated if
<italic>M</italic>
is known. However, depending on the k-mer size, it is reasonable to assume that k-mers are not unique to the genomic region they are found in, and thus the calculated genomic depth may be overestimated. Binning reads based on the estimated read depths using this equation may therefore result in
<italic>false positive</italic>
mitochondrial reads, i.e. reads from the nuclear genome binned as mitochondrial reads. This may lead to a number of small nuclear contigs in the mitochondrial assembly.</p>
<p>When the k-mer counts in the read pool have been calculated, the reads that come from genomic regions with depths above the ND threshold can be identified and extracted using the above Eq. (
<xref rid="Equ2" ref-type="">2</xref>
). The counting and binning can be done by the program
<italic>BBTools</italic>
. As the number of k-mers in a read pool can be very large and may not fit into computer memory,
<italic>BBTools</italic>
instead stores the k-mers in a probabilistic data structure called a Count-Min Sketch (CMS) invented in 2004 [
<xref ref-type="bibr" rid="CR12">12</xref>
] which is based on a set of bit-arrays and hash-functions.
<italic>BBTools</italic>
’s implementation of CMS can keep track of k-mers and their counts, but may overestimate some k-mer depths because of possible hash collisions, which as mentioned before may lead to small nuclear contigs in the assembly.</p>
<p>In Norgal’s usage scenario it is acceptable not to discard reads with non-frequent k-mers (nuclear reads - false positives) as these will only result in small contigs. On the other hand, it is not acceptable to discard reads with frequent k-mers (mitochondrial reads - false negatives) as this may lead to a partial mitochondrial assembly. This makes a CMS optimal for this problem as it can only be inaccurate when overestimating k-mer counts. This means that no reads with a higher read depth than the threshold can be discarded.</p>
</sec>
<sec id="Sec6">
<title>Assembly with high-frequency k-mers</title>
<p>The binned reads with high-frequency k-mers are used for an assembly with
<italic>idba_ud</italic>
with default settings which does multiple assemblies with different k-mer sizes in the range: 20, 40, 60, 80 and 100. This second assembly only contains contigs that have a high read-depth of at least the ND threshold.</p>
</sec>
<sec id="Sec7">
<title>Annotation and validation</title>
<p>The contigs are sorted after length and per default the longest contig is extracted. Another option is to select the longest contig that has the best hits to full RefSeq mitochondrial or pastid genomes. The extracted contig is tested for circularity by comparing the ends of the contig and finding overlaps. Any overlapping base pairs are cut and the final sequence is reported as a potential mtDNA candidate. The reads are mapped back to this potential mtDNA sequence and Norgal outputs a graph with the read depths as well as the read depths of a section of the nuclear DNA (the assembled longest contig from the first assembly) spanning the same length as the mtDNA candidate. This graph with the two sets of read depths may be used for validation of the mtDNA candidate, so if the depths over the mtDNA candidate is around 10-100 higher than the depths over the nuclear region, it increases the evidence that the candidate is from the mitogenome.</p>
<p>Norgal searches the full assembly for both complete mitochondrial and plastid genomes using BLAST [
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR14">14</xref>
] with default values and reports the best 10 hits sorted by bit-score.</p>
</sec>
</sec>
<sec id="Sec8">
<title>Results and discussion</title>
<p>Twenty WGS datasets were downloaded from the Short Read Archive (SRA) (
<ext-link ext-link-type="uri" xlink:href="http://ncbi.nlm.nih.gov/sra">ncbi.nlm.nih.gov/sra</ext-link>
). The results of Norgal on these datasets can be seen in the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S4. Norgal extracted and assembled the full circular mitogenomes in 10 of the 20 cases, while only partially assembling the mitogenomes (and chloroplasts) for the rest, ranging from 1–49% coverage.</p>
<p>Table 
<xref rid="Tab1" ref-type="table">1</xref>
shows the reports that Norgal outputs for a subset of the datasets. It shows that the longest contig is usually the mitochondrial or plastid genome.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Norgal BLAST output for a subset of the datasets</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Organism</th>
<th align="left">Type</th>
<th align="left">Scaffold:Scaffold-length</th>
<th align="left">Identity</th>
<th align="left">Align. length</th>
<th align="left">Ref. length</th>
<th align="left">E-value</th>
<th align="left">Bit-score</th>
<th align="left">Best-hit reference</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">A. melanoleuca</td>
<td align="left">m</td>
<td align="left">scaffold_0:16876</td>
<td align="left">99.54</td>
<td align="left">16181</td>
<td align="left">16805</td>
<td align="left">0</td>
<td align="left">29438</td>
<td align="justify">Ailuropoda melanoleuca mitochondrion</td>
</tr>
<tr>
<td align="left">S. japonica</td>
<td align="left">m</td>
<td align="left">scaffold_0:37756</td>
<td align="left">100</td>
<td align="left">35932</td>
<td align="left">37654</td>
<td align="left">0</td>
<td align="left">66354</td>
<td align="justify">Saccharina sp. ye-C12 mitochondrion</td>
</tr>
<tr>
<td align="left">P. glaucus</td>
<td align="left">m</td>
<td align="left">scaffold_0:15378</td>
<td align="left">100</td>
<td align="left">7814</td>
<td align="left">15306</td>
<td align="left">0</td>
<td align="left">14430</td>
<td align="justify">Papilio glaucus mitochondrion</td>
</tr>
<tr>
<td align="left">A. niger</td>
<td align="left">m</td>
<td align="left">scaffold_0:31289</td>
<td align="left">99.12</td>
<td align="left">9284</td>
<td align="left">31103</td>
<td align="left">0</td>
<td align="left">16661</td>
<td align="justify">Aspergillus niger mitochondrion</td>
</tr>
<tr>
<td align="left">P. papatasi</td>
<td align="left">m</td>
<td align="left">scaffold_0:15338</td>
<td align="left">99.54</td>
<td align="left">14927</td>
<td align="left">15557</td>
<td align="left">0</td>
<td align="left">27180</td>
<td align="justify">Phlebotomus papatasi mitochondrion</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Note how the best hit for each organisms is always scaffold_0 which is also the longest scaffold in the assembly. A full table of the 10 best hits for each organisms can be found in the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S1</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>The assembled mitogenomes were generally highly similar to the reference sequences, though rearrangements of shorter sequences, especially in the hypervariable regions of the control regions [
<xref ref-type="bibr" rid="CR15">15</xref>
], were occasionally observed.</p>
<sec id="Sec9">
<title>Comparison with current methods</title>
<p>Norgal was benchmarked against two other tools, MITOBim and NOVOPlasty, which both require at least a seed sequence to initiate an assembly. To our knowledge, there is no current tool that can assemble mitogenomes completely independently of reference or seed sequences. Both MITObim and NOVOPlasty can use relatively small sequences as a seed, such as a single gene sequence from the target mitogenome or from a more distantly related organism. In comparison, Norgal requires no seed or reference sequence and relies solely on differential k-mer frequencies in the reads which it automatically detects to de novo assemble the mitogenome. Table 
<xref rid="Tab2" ref-type="table">2</xref>
shows the performance of the three tools on a subset of the tested datasets spanning different eukaryote organism groups. The benchmark was run on a computer cluster node with 4 CPU’s and 120 GB of memory. The accuracy was comparable among all three methods and they all produced full circular mitochondrial genomes that covered the reference sequence entirely.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Benchmarking of Norgal and comparison with MITOBim and NOVOPlasty</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="2">Norgal</th>
<th align="left" colspan="2">MITOBim v1.9</th>
<th align="left" colspan="2">NOVOPlasty v2.6.2</th>
</tr>
<tr>
<th align="left">Organism</th>
<th align="left">Identity to reference sequence</th>
<th align="left">Input</th>
<th align="left">Identity to reference sequence</th>
<th align="left">Input</th>
<th align="left">Identity to reference sequence</th>
<th align="left">Input</th>
</tr>
</thead>
<tbody>
<tr>
<td align="justify">
<italic>A. melanoleuca</italic>
(Giant Panda)</td>
<td align="justify">
<italic>99.5</italic>
%</td>
<td align="left">Raw reads</td>
<td align="justify">98.8%</td>
<td align="justify">Trimmed and interleaved reads, reference mitogenome (NC_009492.1)</td>
<td align="justify">99.1%</td>
<td align="justify">Raw reads, insert size, read length, reference COI sequence (DQ093081.1)</td>
</tr>
<tr>
<td align="justify">
<italic>S. japonica</italic>
(Japanese Seaweed)</td>
<td align="justify">
<italic>99.8</italic>
%</td>
<td align="left">Raw reads</td>
<td align="justify">99.0%</td>
<td align="justify">Trimmed and interleaved reads, reference mitogenome (NC_013476.1)</td>
<td align="justify">
<italic>99.8</italic>
%</td>
<td align="justify">Raw reads, mitogenome size range, insert size, read length, reference COI sequence (JN873222.1)</td>
</tr>
<tr>
<td align="justify">
<italic>P. glaucus</italic>
(Swallowtail butterfly)</td>
<td align="justify">99.8%</td>
<td align="left">Raw reads</td>
<td align="justify">98.5%</td>
<td align="justify">Trimmed and interleaved reads, reference mitogenome (KR822739.1)</td>
<td align="justify">
<italic>100.0</italic>
%</td>
<td align="justify">Raw reads, insert size, read length, reference COI sequence (KT286455.1)</td>
</tr>
<tr>
<td align="justify">
<italic>A. niger</italic>
</td>
<td align="justify">98.7%</td>
<td align="left">Raw reads</td>
<td align="justify">97.8%</td>
<td align="justify">Trimmed and interleaved reads, reference mitogenome (NC_007445.1)</td>
<td align="justify">
<italic>98.9</italic>
%</td>
<td align="justify">Raw reads, mitogenome size range, insert size, read length, reference COI sequence (EF180096.1)</td>
</tr>
<tr>
<td align="justify">
<italic>P. papatasi</italic>
(Sand fly)</td>
<td align="justify">98.5%</td>
<td align="left">Raw reads</td>
<td align="justify">99.0%</td>
<td align="justify">Trimmed and interleaved reads, reference mitogenome (NC_028042.1)</td>
<td align="justify">
<italic>99.9</italic>
%</td>
<td align="justify">Raw reads, insert size, read length, reference COI sequence (KU659597.1)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The full results of the benchmark can be seen in the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S3</p>
<p>The reference sequences were determined by mapping the reads to the NCBI references and correcting the nucleotide differences</p>
<p>The highest identity scores are italicized</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>The peak memory usage was 38-48 GB for Norgal, 1-13 GB for MITOBim and 33-53 GB for NOVOPlasty.</p>
<p>In terms of runtime Norgal is the slowest by using nine hours on average to assemble the mitogenome. MITOBim used three hours on average while NOVOPlasty only used half an hour. These runtimes exclude the time for preparing the input data for the programs. The reason Norgal is slower is because of the initial full assembly and mapping that determines the nuclear depth. This part consists of multiple assemblies of the whole read pool with a range of different k-mers. If a subsequence of the nuclear genome or the depth of coverage is given to Norgal, the runtime decreases significantly.</p>
<p>Regarding ease of use, all programs run on the command line. Norgal requires the path to the raw reads and a name for the output folder. MITOBim can run in several modes including a 2-step mode where an initial assembly with the program MIRA is used as input. The mode used in this comparison requires only trimmed and interleaved reads as input as well as the seed sequence. NOVOPlasty uses a single configuration file as input which can be modified with the different input parameters such as the path to a reference or seed sequence.</p>
<p>In short, Norgal does not require a reference or short seed sequences compared to MITOBim and NOVOPlasty while still achieving similar accuracy. However, both MITOBim and NOVOPlasty are significantly faster and use less resources.</p>
</sec>
<sec id="Sec10">
<title>Extraction of plastid DNA using a 2-step procedure</title>
<p>Plants have long mitogenomes compared to e.g. vertebrates [
<xref ref-type="bibr" rid="CR16">16</xref>
] and additionally have chloroplasts genomes which are present in high copy numbers [
<xref ref-type="bibr" rid="CR17">17</xref>
]. An assembly of reads with highly frequent k-mers would most likely contain fragmented chloroplast and mitochondrial contigs. Norgal saves the assembly made from the reads with highly frequent k-mers in addition to the extracted mitogenome candidate and a report with best BLAST-hits. Contigs from this assembly can be used as the input seed sequence for current plastid assembly programs such as MITOBim and NOVOPlasty. This can be relevant in projects involving a large number of diverse and unknown organisms. Norgal’s output can in this scenario be used to automatically select relevant seeds for a further assembly.</p>
<p>This approach was tried with a fragmented assembly of the grape plant from Norgal and then using NOVOPlasty v1.1 on the longest contigs. The second-longest contig resulted in the full chloroplast genome with an identity of 98% to the reference sequence and a combined runtime of 12 h (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S2).</p>
<p>The approach was also tested on a cucumber sample. Cucumbers have large mitogenomes that are split into three separate chromosomes. Norgal outputted a series of contigs from the chloroplasts and mitochondria. The chloroplast contig was used as a seed sequence for NOVOPlasty and resulted in the full cucumber chloroplast genome with 100% identity to the reference chloroplast.</p>
<p>For users interested in completely unknown chloroplast or other organelle genomes for which there are no known sequences, the following approach is suggested:
<list list-type="order">
<list-item>
<p>Extract contigs of interests from the Norgal assembly, such as the ten longest contigs or the contigs with hits from the BLAST-search</p>
</list-item>
<list-item>
<p>Run MITOBim or NOVOPlasty or another assembler that can extend seed sequences on each of the ten contigs</p>
</list-item>
<list-item>
<p>Validate the output by:
<list list-type="alpha-lower">
<list-item>
<p>mapping reads back to the contigs and compare depths to the nuclear depth</p>
</list-item>
<list-item>
<p>checking for circularity in the contigs</p>
</list-item>
<list-item>
<p>annotating the contigs with relevant features e.g. mitochondrial genes etc.</p>
</list-item>
</list>
</p>
</list-item>
</list>
</p>
</sec>
<sec id="Sec11">
<title>Assembly complications</title>
<p>As Norgal is based on differences in k-mer frequencies it is not suited for metagenomics datasets or datasets where the reads are evenly distributed across the mitogenome and nuclear genome (for example organisms with low copy numbers of mitochondria or samples with many PCR duplicates). This might result in fragmented assemblies as seen in the grape and cucumber case, where the longest assembled scaffolds were partial sequences of the mitochondria or chloroplast. This also means that Norgal in general requires a high depth of coverage in order to accurately separate the reads.</p>
<p>The nuclear genome can have sequences of mitochondrial origin (NUMTs) which are not part of the mitogenome [
<xref ref-type="bibr" rid="CR18">18</xref>
]. As Norgal counts k-mers in reads it may include reads from those NUMT regions, as reads that come from these regions may share k-mers with reads from similar regions in the mitogenome. They will consequently not be discarded before assembly and may be incorporated in the final assembled mitogenome sequence. This is undesirable and a BLAST search with some of the assembled mitogenomes against the nuclear genomes did suggest that they had incorporated some NUMT sequences.</p>
<p>As de novo assemblers based on De Bruijn graphs can theoretically struggle with repeat regions that span the insert size of read libraries [
<xref ref-type="bibr" rid="CR10">10</xref>
], such a case may lead to fragmented assemblies when using paired end reads with short insert sizes.</p>
<p>Irregular and complex mitochondria (e.g. cucumber mitochondrial genomes that are split into multiple chromosomes, one of which is very long) may further complicate assembly. Some organisms have fewer mitochondria in their cells compared to what is expected from the litterature. This would require setting the depth cut-off manually instead of using the ND threshold.</p>
</sec>
</sec>
<sec id="Sec12" sec-type="conclusion">
<title>Conclusion</title>
<p>Norgal is a tool for extracting mitochondrial DNA from WGS data, especially in situations where reference sequences are unavailable. Plastid genomes were assembled using a proposed 2-step procedure that uses Norgal output as a seed to existing plastid assemblers. Nogal’s success with the 2-step procedure shows that Norgal is optimal in scenarios where the mitochondrial genome is completely unknown and cannot be assembled from any known reference or seed sequences. This tool contributes to the field of discovering and assembling novel mitochondrial sequences from WGS data.</p>
</sec>
<sec id="Sec13">
<title>Availability and requirements</title>
<p>The datasets analysed during the current study are available in the NCBI SRA repository,
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra">https://www.ncbi.nlm.nih.gov/sra</ext-link>
under the following accession numbers: SRR1801279, SRR2089773, SRR2089774, SRR2089775, SRR1707287, SRR543219, SRR1997462, SRR2015301, SRR899957, SRR1291041, SRR958464, SRR504904, SRR942310, SRR1993099, ERR1437502, ERR771129, SRR2984940, SRR494422, SRR494432, and SRR2043182.</p>
<p>
<bold>Project name:</bold>
Norgal</p>
<p>
<bold>Project home page:</bold>
<ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/kosaidtu/norgal">https://bitbucket.org/kosaidtu/norgal</ext-link>
</p>
<p>
<bold>Archived version:</bold>
<ext-link ext-link-type="uri" xlink:href="https://github.com/kosaidtu/norgal/releases/download/v1.0/norgal.tar">https://github.com/kosaidtu/norgal/releases/download/v1.0/norgal.tar</ext-link>
</p>
<p>
<bold>Operating system(s):</bold>
Linux</p>
<p>
<bold>Programming language:</bold>
Python3</p>
<p>
<bold>Other requirements:</bold>
bash, java, matplotlib (python3 package)</p>
<p>
<bold>License:</bold>
MIT License (BBTools is copyrighted to The Regents of the University of California, through Lawrence Berkeley National Laboratory.</p>
<p>
<bold>Any restrictions to use by non-academics:</bold>
MIT License</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec14">
<title>Additional file</title>
<p>
<media position="anchor" xlink:href="12859_2017_1927_MOESM1_ESM.docx" id="MOESM1">
<label>Additional file 1</label>
<caption>
<p>A.docx-document with full results and detailed benchmarking between Norgal and MITOBim and NOVOPlasty. Section S1: Full Norgal output of subset of test data. Section S2: Extraction of chloroplast from Vittis vinifera (Grape vine). Section S3: Benchmarking against other methods. Section S4: Mitochondrial test data sets. (DOCX 1485 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>BLAST</term>
<def>
<p>Basic local alignment search tool</p>
</def>
</def-item>
<def-item>
<term>bp</term>
<def>
<p>Base pairs</p>
</def>
</def-item>
<def-item>
<term>DNA</term>
<def>
<p>Deoxyribonucleic acid</p>
</def>
</def-item>
<def-item>
<term>k-mer</term>
<def>
<p>DNA subsequence of length k</p>
</def>
</def-item>
<def-item>
<term>mitogenome</term>
<def>
<p>Mitochondrial genome</p>
</def>
</def-item>
<def-item>
<term>mtDNA</term>
<def>
<p>Mitochondrial DNA</p>
</def>
</def-item>
<def-item>
<term>ND threshold</term>
<def>
<p>Nuclear depth threshold</p>
</def>
</def-item>
<def-item>
<term>NGS</term>
<def>
<p>Next-generation sequencing</p>
</def>
</def-item>
<def-item>
<term>NUMTs</term>
<def>
<p>Nuclear mitochondrial DNA segment</p>
</def>
</def-item>
<def-item>
<term>PCR</term>
<def>
<p>Polymerase chain reaction</p>
</def>
</def-item>
<def-item>
<term>WGS</term>
<def>
<p>Whole-genome sequencing</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (doi:10.1186/s12859-017-1927-y) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>We thank the editor and reviewers.</p>
<sec id="d29e1248">
<title>Funding</title>
<p>Not applicable.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>KA, TNP and TSP conceived of the study. KA designed, implemented and tested the pipeline. TNP and TSP contributed ideas to the design of the pipeline. KA wrote the manuscript. TNP and TSP edited the manuscript. All authors have read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="d29e1259">
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e1264">
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e1269">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="d29e1274">
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bruggen</surname>
<given-names>EFJV</given-names>
</name>
<name>
<surname>Borst</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ruttenberg</surname>
<given-names>GJCM</given-names>
</name>
<name>
<surname>Gruber</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kroon</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>Circular mitochondrial dna</article-title>
<source>Biochim Biophys Acta (BBA) - Nucleic Acids Protein Synth</source>
<year>1966</year>
<volume>119</volume>
<issue>2</issue>
<fpage>437</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1016/0005-2787(66)90210-3</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hahn</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Bachmann</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chevreux</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Reconstructing mitochondrial genomes directly from genomic next-generation sequencing reads—a baiting and iterative mapping approach</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<issue>13</issue>
<fpage>129</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt371</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dierckxsens</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Mardulyn</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Smits</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>NOVOPlasty: de novo assembly of organelle genomes from whole genome data</article-title>
<source>Nucleic Acids Res</source>
<year>2016</year>
<volume>9</volume>
<issue>4</issue>
<fpage>955</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkw955</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robin</surname>
<given-names>ED</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Mitochondrial DNA molecules and virtual number of mitochondria per cell in mammalian cells</article-title>
<source>J Cell Physiol</source>
<year>1988</year>
<volume>136</volume>
<issue>3</issue>
<fpage>507</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1002/jcp.1041360316</pub-id>
<pub-id pub-id-type="pmid">3170646</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haddad</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Al-Nakeeb</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Petersen</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Dalén</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Blom</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sicheritz-Pontén</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Complete mitochondrial genome of the Oriental Hornet, Vespa orientalis F. (Hymenoptera: Vespidae)</article-title>
<source>Mitochondrial DNA B</source>
<year>2017</year>
<volume>2</volume>
<issue>1</issue>
<fpage>139</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="doi">10.1080/23802359.2017.1292480</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schubert</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lindgreen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Orlando</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>AdapterRemoval v2: rapid adapter trimming, identification, and read merging</article-title>
<source>BMC Res Notes</source>
<year>2016</year>
<volume>9</volume>
<issue>1</issue>
<fpage>88</fpage>
<pub-id pub-id-type="doi">10.1186/s13104-016-1900-2</pub-id>
<pub-id pub-id-type="pmid">26868221</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>TW</given-names>
</name>
</person-group>
<article-title>Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>10</issue>
<fpage>1674</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv033</pub-id>
<pub-id pub-id-type="pmid">25609793</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Fast and accurate long-read alignment with Burrows–Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>5</issue>
<fpage>589</fpage>
<lpage>95</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp698</pub-id>
<pub-id pub-id-type="pmid">20080505</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<mixed-citation publication-type="other">Bushnell B. BBMap Short Read Aligner.
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/bbmap">http://sourceforge.net/projects/bbmap</ext-link>
. Accessed 3 Nov 2017.</mixed-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
</person-group>
<article-title>IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>11</issue>
<fpage>1420</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts174</pub-id>
<pub-id pub-id-type="pmid">22495754</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>116</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-11-r116</pub-id>
<pub-id pub-id-type="pmid">20423531</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<mixed-citation publication-type="other">Cormode G, Muthukrishnan S. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Berlin, Heidelberg: Springer; 2004, pp. 29–38.
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/978-3-540-24698-5_7">http://dx.doi.org/10.1007/978-3-540-24698-5_7</ext-link>
.</mixed-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Camacho</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Coulouris</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Avagyan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Papadopoulos</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bealer</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
</person-group>
<article-title>BLAST+: architecture and applications</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>1</issue>
<fpage>421</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-421</pub-id>
<pub-id pub-id-type="pmid">20003500</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aquadro</surname>
<given-names>CF</given-names>
</name>
<name>
<surname>Greenberg</surname>
<given-names>BD</given-names>
</name>
</person-group>
<article-title>Human Mitochondrial DNA Variation and Evolution: Analysis of Nucleotide Sequences from Seven Individuals</article-title>
<source>Genetics</source>
<year>1983</year>
<volume>103</volume>
<issue>2</issue>
<fpage>287</fpage>
<lpage>312</lpage>
<pub-id pub-id-type="pmid">6299878</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ward</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Bendich</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<article-title>The mitochondrial genome is large and variable in a family of plants (Cucurbitaceae)</article-title>
<source>Cell</source>
<year>1981</year>
<volume>25</volume>
<issue>3</issue>
<fpage>793</fpage>
<lpage>803</lpage>
<pub-id pub-id-type="doi">10.1016/0092-8674(81)90187-2</pub-id>
<pub-id pub-id-type="pmid">6269758</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<mixed-citation publication-type="other">Heldt HW, Piechulla B. 20 - A plant cell has three different genomes BT - Plant Biochemistry (Fourth Edition). San Diego: Academic Press; 2011, pp. 487–526. doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/B978-0-12-384986-1.00020-X">10.1016/B978-0-12-384986-1.00020-X</ext-link>
.
<ext-link ext-link-type="uri" xlink:href="http://www.sciencedirect.com/science/article/pii/B978012384986100020X">http://www.sciencedirect.com/science/article/pii/B978012384986100020X</ext-link>
.</mixed-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lopez</surname>
<given-names>JV</given-names>
</name>
<name>
<surname>Yuhki</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Masuda</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Modi</surname>
<given-names>W</given-names>
</name>
<name>
<surname>O’Brien</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<article-title>Numt, a recent transfer and tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat</article-title>
<source>J Mol Evol</source>
<year>1994</year>
<volume>39</volume>
<issue>2</issue>
<fpage>174</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="pmid">7932781</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0002700 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0002700 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021