Serveur d'exploration Covid (26 mars)

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes

Identifieur interne : 000657 ( Pmc/Corpus ); précédent : 000656; suivant : 000658

Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes

Auteurs : Mohammed Sahli ; Tetsuo Shibuya

Source :

RBID : PMC:3441218

Abstract

Background

Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers.

Findings

In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for Influenza Virus A. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous k-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or k-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive.

Conclusions

Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds.

Arapan-S is available for free to the public. The binary files for Arapan-S are available through http://sourceforge.net/projects/dnascissor/files/.


Url:
DOI: 10.1186/1756-0500-5-243
PubMed: 22591859
PubMed Central: 3441218

Links to Exploration step

PMC:3441218

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes</title>
<author>
<name sortKey="Sahli, Mohammed" sort="Sahli, Mohammed" uniqKey="Sahli M" first="Mohammed" last="Sahli">Mohammed Sahli</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shibuya, Tetsuo" sort="Shibuya, Tetsuo" uniqKey="Shibuya T" first="Tetsuo" last="Shibuya">Tetsuo Shibuya</name>
<affiliation>
<nlm:aff id="I2">Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22591859</idno>
<idno type="pmc">3441218</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441218</idno>
<idno type="RBID">PMC:3441218</idno>
<idno type="doi">10.1186/1756-0500-5-243</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000657</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000657</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes</title>
<author>
<name sortKey="Sahli, Mohammed" sort="Sahli, Mohammed" uniqKey="Sahli M" first="Mohammed" last="Sahli">Mohammed Sahli</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shibuya, Tetsuo" sort="Shibuya, Tetsuo" uniqKey="Shibuya T" first="Tetsuo" last="Shibuya">Tetsuo Shibuya</name>
<affiliation>
<nlm:aff id="I2">Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Research Notes</title>
<idno type="eISSN">1756-0500</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers.</p>
</sec>
<sec>
<title>Findings</title>
<p>In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for
<italic>Influenza Virus A</italic>
. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous
<italic>k</italic>
-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or
<italic>k</italic>
-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds.</p>
<p>Arapan-S is available for free to the public. The binary files for Arapan-S are available through
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/dnascissor/files/">http://sourceforge.net/projects/dnascissor/files/</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="White, O" uniqKey="White O">O White</name>
</author>
<author>
<name sortKey="Adams, Md" uniqKey="Adams M">MD Adams</name>
</author>
<author>
<name sortKey="Kerlavage, Ar" uniqKey="Kerlavage A">AR Kerlavage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Madan, A" uniqKey="Madan A">A Madan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
<author>
<name sortKey="Yang, Sp" uniqKey="Yang S">SP Yang</name>
</author>
<author>
<name sortKey="Hillier, L" uniqKey="Hillier L">L Hillier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chevreux, B" uniqKey="Chevreux B">B Chevreux</name>
</author>
<author>
<name sortKey="Pfisterer, T" uniqKey="Pfisterer T">T Pfisterer</name>
</author>
<author>
<name sortKey="Drescher, B" uniqKey="Drescher B">B Drescher</name>
</author>
<author>
<name sortKey="Driesel, Aj" uniqKey="Driesel A">AJ Driesel</name>
</author>
<author>
<name sortKey="Muller, Weg" uniqKey="Muller W">WEG Müller</name>
</author>
<author>
<name sortKey="Wetter, T" uniqKey="Wetter T">T Wetter</name>
</author>
<author>
<name sortKey="Suhai, S" uniqKey="Suhai S">S Suhai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warren, Rl" uniqKey="Warren R">RL Warren</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
<author>
<name sortKey="Holt, Ra" uniqKey="Holt R">RA Holt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Mcewen, Gk" uniqKey="Mcewen G">GK McEwen</name>
</author>
<author>
<name sortKey="Margulies, Eh" uniqKey="Margulies E">EH Margulies</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Butler, J" uniqKey="Butler J">J Butler</name>
</author>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I MacCallum</name>
</author>
<author>
<name sortKey="Kleber, M" uniqKey="Kleber M">M Kleber</name>
</author>
<author>
<name sortKey="Shlyakhter, Ia" uniqKey="Shlyakhter I">IA Shlyakhter</name>
</author>
<author>
<name sortKey="Belmonte, Mk" uniqKey="Belmonte M">MK Belmonte</name>
</author>
<author>
<name sortKey="Lander, Es" uniqKey="Lander E">ES Lander</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Jaffe, Db" uniqKey="Jaffe D">DB Jaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I Maccallum</name>
</author>
<author>
<name sortKey="Przybylski, D" uniqKey="Przybylski D">D Przybylski</name>
</author>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S Gnerre</name>
</author>
<author>
<name sortKey="Burton, J" uniqKey="Burton J">J Burton</name>
</author>
<author>
<name sortKey="Shlyakhter, I" uniqKey="Shlyakhter I">I Shlyakhter</name>
</author>
<author>
<name sortKey="Gnirke, A" uniqKey="Gnirke A">A Gnirke</name>
</author>
<author>
<name sortKey="Malek, J" uniqKey="Malek J">J Malek</name>
</author>
<author>
<name sortKey="Mckernan, K" uniqKey="Mckernan K">K McKernan</name>
</author>
<author>
<name sortKey="Ranade, S" uniqKey="Ranade S">S Ranade</name>
</author>
<author>
<name sortKey="Shea, Tp" uniqKey="Shea T">TP Shea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Qian, W" uniqKey="Qian W">W Qian</name>
</author>
<author>
<name sortKey="Fang, X" uniqKey="Fang X">X Fang</name>
</author>
<author>
<name sortKey="Shi, Z" uniqKey="Shi Z">Z Shi</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Shan, G" uniqKey="Shan G">G Shan</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bryant, Dw" uniqKey="Bryant D">DW Bryant</name>
</author>
<author>
<name sortKey="Wong, Wk" uniqKey="Wong W">WK Wong</name>
</author>
<author>
<name sortKey="Mockler, Tc" uniqKey="Mockler T">TC Mockler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sommer, Dd" uniqKey="Sommer D">DD Sommer</name>
</author>
<author>
<name sortKey="Dlecher, Al" uniqKey="Dlecher A">AL Dlecher</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Res Notes</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Res Notes</journal-id>
<journal-title-group>
<journal-title>BMC Research Notes</journal-title>
</journal-title-group>
<issn pub-type="epub">1756-0500</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22591859</article-id>
<article-id pub-id-type="pmc">3441218</article-id>
<article-id pub-id-type="publisher-id">1756-0500-5-243</article-id>
<article-id pub-id-type="doi">10.1186/1756-0500-5-243</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Technical Note</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Sahli</surname>
<given-names>Mohammed</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mohammed@hgc.jp</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Shibuya</surname>
<given-names>Tetsuo</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>tshibuya@hgc.jp</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan</aff>
<aff id="I2">
<label>2</label>
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>16</day>
<month>5</month>
<year>2012</year>
</pub-date>
<volume>5</volume>
<fpage>243</fpage>
<lpage>243</lpage>
<history>
<date date-type="received">
<day>12</day>
<month>10</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>5</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2012 Sahli and Shibuya; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Sahli and Shibuya; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1756-0500/5/243"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers.</p>
</sec>
<sec>
<title>Findings</title>
<p>In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for
<italic>Influenza Virus A</italic>
. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous
<italic>k</italic>
-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or
<italic>k</italic>
-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds.</p>
<p>Arapan-S is available for free to the public. The binary files for Arapan-S are available through
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/dnascissor/files/">http://sourceforge.net/projects/dnascissor/files/</ext-link>
.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Sequencing technologies have been providing us with thousands of sets of genomic reads (sometimes called fragments or segments), with each set being taken from a specific genome. Bringing these reads all together in order to reconstruct the original sequence (the genome) is commonly known as the (whole-) genome assembly problem. This problem has been studied extensively and many assemblers, along with some assembly models, have been proposed. Most models are based either on the overlap graph approach or the de Bruijn graph-based approach. The overlap graph is a graph whose nodes represent the genomic reads, while its edges correspond to the overlaps of these reads. It was the pillar of the first assemblers that appeared on the market, such as: TIGR [
<xref ref-type="bibr" rid="B1">1</xref>
], CAP3 [
<xref ref-type="bibr" rid="B2">2</xref>
], PCAP [
<xref ref-type="bibr" rid="B3">3</xref>
], the string graph of Myers [
<xref ref-type="bibr" rid="B4">4</xref>
] and MIRA [
<xref ref-type="bibr" rid="B5">5</xref>
]. The second category of assemblers is based on the de Bruijn graph, in which the nodes represent the substrings (
<italic>k</italic>
-mers) of the genomic reads (which are of the same length), while the edges correspond to the overlaps of these substrings. The de Bruijn graph has become the standard pillar of the so-called “de novo” assemblers. Some of the assemblers based on this approach include: Euler assembler [
<xref ref-type="bibr" rid="B6">6</xref>
], SSAKE [
<xref ref-type="bibr" rid="B7">7</xref>
], EULER-SR [
<xref ref-type="bibr" rid="B8">8</xref>
], Velvet [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
], ALLPATHS [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
], ABySS [
<xref ref-type="bibr" rid="B13">13</xref>
], and SOAPdenovo [
<xref ref-type="bibr" rid="B14">14</xref>
]. Although the assemblers share the same graph structure, they use different (but sometimes similar) algorithms to walk through the graph. To our knowledge, there is no proof that the shortest or the longest path, or the Hamiltonian or Eulerian paths will represent the genome in its natural form; therefore, we developed an algorithm that selects only the reliable nodes in the de Bruijn graph in order to reconstruct the original sequence of small genomes or long contigs when the graph is sparse.</p>
<p>Because of the diversity of genomes, creating a general assembler that is able to solve all cases will not be as effective and fast as a specific assembler that focuses on solving particular cases. For instance, ploidy can be a serious problem when dealing with plant genomes in which tetraploidy is common. Concerning very small genomes, we believe that we can improve the accuracy of assembly of such genomes by creating an assembler that is devoted to solving small genomes. That is the reason we aimed to create an assembler (named Arapan-S) dedicated to solving small genomes. As a result, the Arapan-S assembler was able to reconstruct one very highly accurate supercontig in most cases. To check the accuracy of Arapan-S, we performed a BLAST sequence similarity search against the EBI (European Bioinformatics Institute) database, which includes the complete genomes of our dataset. This analysis showed that the Arapan-S assemblies were more than 99% accurate. We also compared Arapan-S with other well known assemblers in the assembly of viral genomes.</p>
</sec>
<sec>
<title>Findings</title>
<sec>
<title>Arapan-S parameters</title>
<p>Arapan-S was written in C/C++ language under a programming framework called Qt on a 64-bit Linux machine and was also compiled in Windows. The input data must represent each
<italic>k</italic>
-mer (i.e. de Bruijn sequence), along with its frequency in the same line, separated by a whitespace character. Note that all frequency values of generated
<italic>k</italic>
-mers are based on the coverage level of the dataset. In other words, we have used such frequency values instead of the coverage value. A tool called
<italic>kmerBuilder</italic>
, which is one of several assembly pipelines included in the Arapan software package, can generate
<italic>k</italic>
-mer files for Arapan-S (i.e. the dataset must be prepared independently from our assembler). The project acronym (Arapan) represents our primary goal to produce a software system that includes a set of open-source tools dedicated to solving and analyzing the whole genome assembly problem.</p>
<p>The Arapan-S assembler is very sensitive to the length of
<italic>k</italic>
of short reads, and because of its architecture our tool always tries to find one supercontig along with its reverse complement. Nevertheless, if the length of
<italic>k</italic>
is very short, Arapan-S will encounter some difficulties in constructing the original sequence. Also, if
<italic>k</italic>
is very long, the result of the assembly will not be significant. There is always a trade-off between the specificity and sensitivity of choosing the length of
<italic>k</italic>
. By experiment, the most appropriate value of
<italic>k</italic>
is when 20 ≤ 
<italic>k</italic>
 ≤ 35.</p>
<p>Arapan-S has only one parameter, which is the merging function: the frequency function or the
<italic>k</italic>
-mer length function. The graphical user interface of Arapan-S represents this parameter by a check-box. During the experiments, it was preferable to choose the frequency function, since it usually leads to a more accurate result. We have considered the frequency function to be the only objective function in our experiments.</p>
</sec>
<sec>
<title>BLAST similarity search</title>
<p>We downloaded some real datasets from the NCBI Trace Archive (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/">ftp://ftp.ncbi.nih.gov/pub/TraceDB/</ext-link>
). The data were cleaned and prepared by a trimming tool (
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/dnascissor/files/DNA%20Scissor/">http://sourceforge.net/projects/dnascissor/files/DNA%20Scissor/</ext-link>
). A minimum quality value cut-off of 20 (i.e. the accuracy of the base call was 99%) was set for most of the genomes, and the low-quality end regions were trimmed at the 5′-end and 3′-end of every read. The short reads (
<italic>k</italic>
-mers) were generated by the same trimming tool for each set of reads. The Arapan-S assembler was very fast, used less memory and provided us with one supercontig along with its reverse complement in many cases. For checking the accuracy of our assembler, we searched for the obtained supercontigs (the complete genome) on the EBI database using the NCBI BLAST Similarity Search. The input data are given in Table
<xref ref-type="table" rid="T1">1</xref>
, while Table
<xref ref-type="table" rid="T2">2</xref>
, Table
<xref ref-type="table" rid="T3">3</xref>
, Table
<xref ref-type="table" rid="T4">4</xref>
and Table
<xref ref-type="table" rid="T5">5</xref>
show the results.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>The input data include seven Virus Genomes</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="center">
<bold>Species</bold>
</th>
<th align="center">
<bold>Accession number</bold>
</th>
<th align="right">
<bold>Number of reads</bold>
</th>
<th align="right">
<bold>Read average length (bp)</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Bovine Respiratory Coronavirus AH187
<hr></hr>
</td>
<td align="left" valign="bottom">FJ938065.1
<hr></hr>
</td>
<td align="right" valign="bottom">635
<hr></hr>
</td>
<td align="right" valign="bottom">995
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Calf-giraffe Coronavirus US/OH3/2006
<hr></hr>
</td>
<td align="left" valign="bottom">EF424624.1
<hr></hr>
</td>
<td align="right" valign="bottom">548
<hr></hr>
</td>
<td align="right" valign="bottom">935
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Waterbuck Coronavirus US/OH-WD358-TC/1994
<hr></hr>
</td>
<td align="left" valign="bottom">FJ425184.1
<hr></hr>
</td>
<td align="right" valign="bottom">576
<hr></hr>
</td>
<td align="right" valign="bottom">984
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">White-tailed Deer Coronavirus US/OH-WD470/1994
<hr></hr>
</td>
<td align="left" valign="bottom">FJ425187.1
<hr></hr>
</td>
<td align="right" valign="bottom">503
<hr></hr>
</td>
<td align="right" valign="bottom">918
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Antelope coronavirus US/OH1/2003
<hr></hr>
</td>
<td align="left" valign="bottom">EF424621.1
<hr></hr>
</td>
<td align="right" valign="bottom">616
<hr></hr>
</td>
<td align="right" valign="bottom">991
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Influenza A Virus (A/Memphis/1/71(H3N2))
<hr></hr>
</td>
<td align="left" valign="bottom">From CY006211.1
<break></break>
To CY006218.1
<hr></hr>
</td>
<td align="right" valign="bottom">132
<hr></hr>
</td>
<td align="right" valign="bottom">570
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Influenza A Virus (A/Swine/Colorado/1/77/(H3N2))
<hr></hr>
</td>
<td align="left" valign="bottom">Q288Y7 (EBI)
<hr></hr>
</td>
<td align="right" valign="bottom">159
<hr></hr>
</td>
<td align="right" valign="bottom">596
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Influenza A Virus (A/Weiss/43/(H1N1))</td>
<td align="left">From CY009452.1
<break></break>
To CY009459.1</td>
<td align="right">168</td>
<td align="right">519</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>We considered eight viruses. The genome of each Influenza A Virus consists of eight segments while the others have only one long segment. The datasets represent Sanger reads. The raw data were downloaded from NCBI Trace Archive (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/.">ftp://ftp.ncbi.nih.gov/pub/TraceDB/.</ext-link>
)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>The Alignment Results By Using the EBI database (BLAST Similarity Search) on seven Virus Genomes</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="right">
<bold>Species</bold>
</th>
<th align="right">
<bold>Total length</bold>
<break></break>
<bold>(bp)</bold>
</th>
<th align="right">
<bold>Genome length</bold>
<break></break>
<bold>(EBI)</bold>
</th>
<th align="right">
<bold>Alignment score</bold>
</th>
<th align="right">
<bold>Identities</bold>
</th>
<th align="right">
<bold>Expect value</bold>
<break></break>
<bold>
<italic>E</italic>
</bold>
<bold>()</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Bovine Respiratory Coronavirus AH187
<hr></hr>
</td>
<td align="right" valign="bottom">30936
<hr></hr>
</td>
<td align="right" valign="bottom">30969
<hr></hr>
</td>
<td align="right" valign="bottom">30875
<hr></hr>
</td>
<td align="right" valign="bottom">99.803%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Calf-giraffe Coronavirus US/OH3/2006
<hr></hr>
</td>
<td align="right" valign="bottom">30831
<hr></hr>
</td>
<td align="right" valign="bottom">30979
<hr></hr>
</td>
<td align="right" valign="bottom">30762
<hr></hr>
</td>
<td align="right" valign="bottom">99.776%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Waterbuck Coronavirus US/OH-WD358-TC/1994
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30934
<hr></hr>
</td>
<td align="right" valign="bottom">99.803%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">White-tailed Deer Coronavirus US/OH-WD470/1994
<hr></hr>
</td>
<td align="right" valign="bottom">31018
<hr></hr>
</td>
<td align="right" valign="bottom">31020
<hr></hr>
</td>
<td align="right" valign="bottom">30957
<hr></hr>
</td>
<td align="right" valign="bottom">99.803%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Influenza A virus (A/Memphis/1/71(H3N2))
<hr></hr>
</td>
<td align="right" valign="bottom">12598
<hr></hr>
</td>
<td align="right" valign="bottom">13397
<hr></hr>
</td>
<td align="right" valign="bottom">12503
<hr></hr>
</td>
<td align="right" valign="bottom">99.246%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Influenza A Virus (A/Swine/colorado/1/77/(H3N2))
<hr></hr>
</td>
<td align="right" valign="bottom">13019
<hr></hr>
</td>
<td align="right" valign="bottom">13304
<hr></hr>
</td>
<td align="right" valign="bottom">12969
<hr></hr>
</td>
<td align="right" valign="bottom">99.616%
<hr></hr>
</td>
<td align="right" valign="bottom">0.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Influenza A Virus (A/Weiss/43/(H1N1))</td>
<td align="right">13300</td>
<td align="right">13371</td>
<td align="right">13208</td>
<td align="right">99.308%</td>
<td align="right">0.0</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The Total length is the length of the obtained result, while Genome length (EBI) is the genome’s supposed length according to the EBI database. The values of Identities were calculated by dividing Alignment scores by the corresponding Total lengths. The Expect-value was calculated by EBI’s NCBI BLAST Similarity Search engine.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Comparison of Arapan-S with ABySS, SSAKE, Velvet, QSRA, Minimus and Mira assemblers on four Benchmark Virus Genomes</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="center">
<bold>Species</bold>
</th>
<th align="left">
<bold>Assembler</bold>
</th>
<th align="right">
<bold>Contigs ≥ 800 bp</bold>
</th>
<th align="right">
<bold>Total</bold>
<break></break>
<bold>length</bold>
</th>
<th align="right">
<bold>Mean size (bp)</bold>
</th>
<th align="right">
<bold>N50 (bp)</bold>
</th>
<th align="right">
<bold>Largest contig (bp)</bold>
</th>
<th align="right">
<bold>Genome coverage (%)</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<bold>Bovine Respiratory Coronavirus AH187</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30937
<hr></hr>
</td>
<td align="right" valign="bottom">30937
<hr></hr>
</td>
<td align="right" valign="bottom">30937
<hr></hr>
</td>
<td align="right" valign="bottom">30937
<hr></hr>
</td>
<td align="right" valign="bottom">99.90
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30924
<hr></hr>
</td>
<td align="right" valign="bottom">30924.00
<hr></hr>
</td>
<td align="right" valign="bottom">30924
<hr></hr>
</td>
<td align="right" valign="bottom">30924
<hr></hr>
</td>
<td align="right" valign="bottom">99.85
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">9
<hr></hr>
</td>
<td align="right" valign="bottom">27428
<hr></hr>
</td>
<td align="right" valign="bottom">3047.56
<hr></hr>
</td>
<td align="right" valign="bottom">3447
<hr></hr>
</td>
<td align="right" valign="bottom">9868
<hr></hr>
</td>
<td align="right" valign="bottom">88.57
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">3
<hr></hr>
</td>
<td align="right" valign="bottom">30951
<hr></hr>
</td>
<td align="right" valign="bottom">10317.00
<hr></hr>
</td>
<td align="right" valign="bottom">25461
<hr></hr>
</td>
<td align="right" valign="bottom">25461
<hr></hr>
</td>
<td align="right" valign="bottom">99.94
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">29617
<hr></hr>
</td>
<td align="right" valign="bottom">3702.125
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">11695
<hr></hr>
</td>
<td align="right" valign="bottom">95.63
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">31026
<hr></hr>
</td>
<td align="right" valign="bottom">31026
<hr></hr>
</td>
<td align="right" valign="bottom">31026
<hr></hr>
</td>
<td align="right" valign="bottom">31026
<hr></hr>
</td>
<td align="right" valign="bottom">100.18
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Mira</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">28803
<hr></hr>
</td>
<td align="right" valign="bottom">3600.37
<hr></hr>
</td>
<td align="right" valign="bottom">3192
<hr></hr>
</td>
<td align="right" valign="bottom">12305
<hr></hr>
</td>
<td align="right" valign="bottom">93.51
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Calf-giraffe Coronavirus</bold>
<break></break>
<bold>US/OH3/2006</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30836
<hr></hr>
</td>
<td align="right" valign="bottom">30836
<hr></hr>
</td>
<td align="right" valign="bottom">30836
<hr></hr>
</td>
<td align="right" valign="bottom">30836
<hr></hr>
</td>
<td align="right" valign="bottom">99.53
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">2
<hr></hr>
</td>
<td align="right" valign="bottom">30652
<hr></hr>
</td>
<td align="right" valign="bottom">15326.00
<hr></hr>
</td>
<td align="right" valign="bottom">18956
<hr></hr>
</td>
<td align="right" valign="bottom">18956
<hr></hr>
</td>
<td align="right" valign="bottom">98.94
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">11
<hr></hr>
</td>
<td align="right" valign="bottom">17005
<hr></hr>
</td>
<td align="right" valign="bottom">1545.91
<hr></hr>
</td>
<td align="right" valign="bottom">892
<hr></hr>
</td>
<td align="right" valign="bottom">2683
<hr></hr>
</td>
<td align="right" valign="bottom">54.89
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">3
<hr></hr>
</td>
<td align="right" valign="bottom">30951
<hr></hr>
</td>
<td align="right" valign="bottom">10317.00
<hr></hr>
</td>
<td align="right" valign="bottom">25461
<hr></hr>
</td>
<td align="right" valign="bottom">25461
<hr></hr>
</td>
<td align="right" valign="bottom">99.91
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">2
<hr></hr>
</td>
<td align="right" valign="bottom">2107
<hr></hr>
</td>
<td align="right" valign="bottom">1053.5
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">1173
<hr></hr>
</td>
<td align="right" valign="bottom">6.80
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30979
<hr></hr>
</td>
<td align="right" valign="bottom">30979
<hr></hr>
</td>
<td align="right" valign="bottom">30979
<hr></hr>
</td>
<td align="right" valign="bottom">30979
<hr></hr>
</td>
<td align="right" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Mira</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">5
<hr></hr>
</td>
<td align="right" valign="bottom">33850
<hr></hr>
</td>
<td align="right" valign="bottom">6770
<hr></hr>
</td>
<td align="right" valign="bottom">20763
<hr></hr>
</td>
<td align="right" valign="bottom">20763
<hr></hr>
</td>
<td align="right" valign="bottom">109.28
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Waterbuck Coronavirus US/OH-WD358-TC/1994</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995.00
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30944
<hr></hr>
</td>
<td align="right" valign="bottom">30944.00
<hr></hr>
</td>
<td align="right" valign="bottom">30944
<hr></hr>
</td>
<td align="right" valign="bottom">30944
<hr></hr>
</td>
<td align="right" valign="bottom">99.86
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">13
<hr></hr>
</td>
<td align="right" valign="bottom">21780
<hr></hr>
</td>
<td align="right" valign="bottom">1675.38
<hr></hr>
</td>
<td align="right" valign="bottom">1063
<hr></hr>
</td>
<td align="right" valign="bottom">5343
<hr></hr>
</td>
<td align="right" valign="bottom">70.27
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">12505
<hr></hr>
</td>
<td align="right" valign="bottom">1563.12
<hr></hr>
</td>
<td align="right" valign="bottom">967
<hr></hr>
</td>
<td align="right" valign="bottom">2162
<hr></hr>
</td>
<td align="right" valign="bottom">40.34
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">5
<hr></hr>
</td>
<td align="right" valign="bottom">4638
<hr></hr>
</td>
<td align="right" valign="bottom">927.6
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">1174
<hr></hr>
</td>
<td align="right" valign="bottom">14.96
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">30995
<hr></hr>
</td>
<td align="right" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Mira</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">6
<hr></hr>
</td>
<td align="right" valign="bottom">34011
<hr></hr>
</td>
<td align="right" valign="bottom">5668.5
<hr></hr>
</td>
<td align="right" valign="bottom">10510
<hr></hr>
</td>
<td align="right" valign="bottom">10983
<hr></hr>
</td>
<td align="right" valign="bottom">109.73
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>White-tailed Deer Coronavirus US/OH-WD470/1994</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">31018
<hr></hr>
</td>
<td align="right" valign="bottom">31018.00
<hr></hr>
</td>
<td align="right" valign="bottom">31018
<hr></hr>
</td>
<td align="right" valign="bottom">31018
<hr></hr>
</td>
<td align="right" valign="bottom">99.99
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">2
<hr></hr>
</td>
<td align="right" valign="bottom">30943
<hr></hr>
</td>
<td align="right" valign="bottom">15471.50
<hr></hr>
</td>
<td align="right" valign="bottom">21535
<hr></hr>
</td>
<td align="right" valign="bottom">21535
<hr></hr>
</td>
<td align="right" valign="bottom">99.75
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">5
<hr></hr>
</td>
<td align="right" valign="bottom">13925
<hr></hr>
</td>
<td align="right" valign="bottom">2785.00
<hr></hr>
</td>
<td align="right" valign="bottom">956
<hr></hr>
</td>
<td align="right" valign="bottom">6100
<hr></hr>
</td>
<td align="right" valign="bottom">44.89
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">10
<hr></hr>
</td>
<td align="right" valign="bottom">17800
<hr></hr>
</td>
<td align="right" valign="bottom">1780.00
<hr></hr>
</td>
<td align="right" valign="bottom">1090
<hr></hr>
</td>
<td align="right" valign="bottom">3430
<hr></hr>
</td>
<td align="right" valign="bottom">57.38
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">7422
<hr></hr>
</td>
<td align="right" valign="bottom">927.75
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">1323
<hr></hr>
</td>
<td align="right" valign="bottom">23.93
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">31019
<hr></hr>
</td>
<td align="right" valign="bottom">31019
<hr></hr>
</td>
<td align="right" valign="bottom">31019
<hr></hr>
</td>
<td align="right" valign="bottom">31019
<hr></hr>
</td>
<td align="right" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="center"> </td>
<td align="left">
<bold>Mira</bold>
</td>
<td align="right">10</td>
<td align="right">34892</td>
<td align="right">3489.2</td>
<td align="right">6174</td>
<td align="right">9191</td>
<td align="right">112.48</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Only contigs whose lengths ≥ 800 were selected. When the assembler generated only one contig, the N50 value and the mean size are equal to the size of the corresponding contig. Genome coverage was calculated by dividing the total length by the genome length (EBI).</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption>
<p>Comparison of Arapan-S with all the assemblers on Three Genomes Composed of eight Segments</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="center">
<bold>Species</bold>
</th>
<th align="left">
<bold>Assembler</bold>
</th>
<th align="right">
<bold>Contigs ≥ 400 bp</bold>
</th>
<th align="right">
<bold>Total Length</bold>
</th>
<th align="right">
<bold>Mean size (bp)</bold>
</th>
<th align="right">
<bold>N50 (bp)</bold>
</th>
<th align="right">
<bold>Largest contig (bp)</bold>
</th>
<th align="right">
<bold>Genome coverage (%)</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<bold>Influenza A Virus</bold>
<break></break>
<bold>A/Memphis/1/71(H3N2)</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">12598
<hr></hr>
</td>
<td align="right" valign="bottom">1574.75
<hr></hr>
</td>
<td align="right" valign="bottom">1584
<hr></hr>
</td>
<td align="right" valign="bottom">2311
<hr></hr>
</td>
<td align="right" valign="bottom">94.03
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">14
<hr></hr>
</td>
<td align="right" valign="bottom">12897
<hr></hr>
</td>
<td align="right" valign="bottom">921.21
<hr></hr>
</td>
<td align="right" valign="bottom">1280
<hr></hr>
</td>
<td align="right" valign="bottom">1801
<hr></hr>
</td>
<td align="right" valign="bottom">96.27
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">555
<hr></hr>
</td>
<td align="right" valign="bottom">555.00
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">555
<hr></hr>
</td>
<td align="right" valign="bottom">4.14
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">14
<hr></hr>
</td>
<td align="right" valign="bottom">10774
<hr></hr>
</td>
<td align="right" valign="bottom">769.57
<hr></hr>
</td>
<td align="right" valign="bottom">789
<hr></hr>
</td>
<td align="right" valign="bottom">1781
<hr></hr>
</td>
<td align="right" valign="bottom">80.42
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">17
<hr></hr>
</td>
<td align="right" valign="bottom">12570
<hr></hr>
</td>
<td align="right" valign="bottom">739.41
<hr></hr>
</td>
<td align="right" valign="bottom">700
<hr></hr>
</td>
<td align="right" valign="bottom">1828
<hr></hr>
</td>
<td align="right" valign="bottom">93.83
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">9
<hr></hr>
</td>
<td align="right" valign="bottom">13156
<hr></hr>
</td>
<td align="right" valign="bottom">1461.78
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom">2242
<hr></hr>
</td>
<td align="right" valign="bottom">98.20
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Mira</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">14
<hr></hr>
</td>
<td align="right" valign="bottom">14399
<hr></hr>
</td>
<td align="right" valign="bottom">1028.50
<hr></hr>
</td>
<td align="right" valign="bottom">1396
<hr></hr>
</td>
<td align="right" valign="bottom">2080
<hr></hr>
</td>
<td align="right" valign="bottom">107.48
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Influenza A Virus</bold>
<break></break>
<bold>A/Swine/Colorado/1/77/(H3N2)</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">13120
<hr></hr>
</td>
<td align="right" valign="bottom">1640
<hr></hr>
</td>
<td align="right" valign="bottom">2151
<hr></hr>
</td>
<td align="right" valign="bottom">2310
<hr></hr>
</td>
<td align="right" valign="bottom">99.13
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">9
<hr></hr>
</td>
<td align="right" valign="bottom">12478
<hr></hr>
</td>
<td align="right" valign="bottom">1386.44
<hr></hr>
</td>
<td align="right" valign="bottom">1634
<hr></hr>
</td>
<td align="right" valign="bottom">2262
<hr></hr>
</td>
<td align="right" valign="bottom">93.79
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">6
<hr></hr>
</td>
<td align="right" valign="bottom">4287
<hr></hr>
</td>
<td align="right" valign="bottom">714.50
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">1409
<hr></hr>
</td>
<td align="right" valign="bottom">32.22
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">12
<hr></hr>
</td>
<td align="right" valign="bottom">9783
<hr></hr>
</td>
<td align="right" valign="bottom">815.25
<hr></hr>
</td>
<td align="right" valign="bottom">494
<hr></hr>
</td>
<td align="right" valign="bottom">1867
<hr></hr>
</td>
<td align="right" valign="bottom">73.53
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">16
<hr></hr>
</td>
<td align="right" valign="bottom">9400
<hr></hr>
</td>
<td align="right" valign="bottom">587.50
<hr></hr>
</td>
<td align="right" valign="bottom">468
<hr></hr>
</td>
<td align="right" valign="bottom">1200
<hr></hr>
</td>
<td align="right" valign="bottom">70.65
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">13325
<hr></hr>
</td>
<td align="right" valign="bottom">1665.62
<hr></hr>
</td>
<td align="right" valign="bottom">2199
<hr></hr>
</td>
<td align="right" valign="bottom">2309
<hr></hr>
</td>
<td align="right" valign="bottom">100.16
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Mira</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">10
<hr></hr>
</td>
<td align="right" valign="bottom">14678
<hr></hr>
</td>
<td align="right" valign="bottom">1467.80
<hr></hr>
</td>
<td align="right" valign="bottom">1780
<hr></hr>
</td>
<td align="right" valign="bottom">2371
<hr></hr>
</td>
<td align="right" valign="bottom">110.33
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Influenza A Virus</bold>
<break></break>
<bold>A/Weiss/43/(H1N1)</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">13300
<hr></hr>
</td>
<td align="right" valign="bottom">1662.50
<hr></hr>
</td>
<td align="right" valign="bottom">2194
<hr></hr>
</td>
<td align="right" valign="bottom">2313
<hr></hr>
</td>
<td align="right" valign="bottom">99.47
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>ABySS</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">11
<hr></hr>
</td>
<td align="right" valign="bottom">13108
<hr></hr>
</td>
<td align="right" valign="bottom">1191.64
<hr></hr>
</td>
<td align="right" valign="bottom">1716
<hr></hr>
</td>
<td align="right" valign="bottom">2274
<hr></hr>
</td>
<td align="right" valign="bottom">98.03
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SSAKE</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">3
<hr></hr>
</td>
<td align="right" valign="bottom">1616
<hr></hr>
</td>
<td align="right" valign="bottom">538.67
<hr></hr>
</td>
<td align="right" valign="bottom">-
<hr></hr>
</td>
<td align="right" valign="bottom">572
<hr></hr>
</td>
<td align="right" valign="bottom">12.09
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Velvet</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">9
<hr></hr>
</td>
<td align="right" valign="bottom">9764
<hr></hr>
</td>
<td align="right" valign="bottom">1084.89
<hr></hr>
</td>
<td align="right" valign="bottom">1006
<hr></hr>
</td>
<td align="right" valign="bottom">1696
<hr></hr>
</td>
<td align="right" valign="bottom">73.02
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">16
<hr></hr>
</td>
<td align="right" valign="bottom">11755
<hr></hr>
</td>
<td align="right" valign="bottom">734.69
<hr></hr>
</td>
<td align="right" valign="bottom">573
<hr></hr>
</td>
<td align="right" valign="bottom">1916
<hr></hr>
</td>
<td align="right" valign="bottom">87.91
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">8
<hr></hr>
</td>
<td align="right" valign="bottom">13369
<hr></hr>
</td>
<td align="right" valign="bottom">1671.12
<hr></hr>
</td>
<td align="right" valign="bottom">2194
<hr></hr>
</td>
<td align="right" valign="bottom">2313
<hr></hr>
</td>
<td align="right" valign="bottom">99.98
<hr></hr>
</td>
</tr>
<tr>
<td align="center"> </td>
<td align="left">
<bold>Mira</bold>
</td>
<td align="right">11</td>
<td align="right">15139</td>
<td align="right">1376.27</td>
<td align="right">1583</td>
<td align="right">2359</td>
<td align="right">113.22</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Only contigs whose lengths ≥ 400 were selected. Each species has eight segments that constitute its genome.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption>
<p>
<bold>Comparison of Arapan-S with all QSRA, Minimus and Mira assemblers on</bold>
<bold>
<italic>Antelope coronavirus US/OH1/2003</italic>
</bold>
<bold>genome</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="center">
<bold>Species</bold>
</th>
<th align="left">
<bold>Assembler</bold>
</th>
<th align="right">
<bold>Contigs ≥ 400 bp</bold>
</th>
<th align="right">
<bold>Total Length</bold>
</th>
<th align="right">
<bold>Mean size (bp)</bold>
</th>
<th align="right">
<bold>N50 (bp)</bold>
</th>
<th align="right">
<bold>Largest contig (bp)</bold>
</th>
<th align="right">
<bold>Genome coverage</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<bold>Antelope coronavirus US/OH1/2003</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Arapan-S</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">26280
<hr></hr>
</td>
<td align="right" valign="bottom">26280
<hr></hr>
</td>
<td align="right" valign="bottom">26280
<hr></hr>
</td>
<td align="right" valign="bottom">26280
<hr></hr>
</td>
<td align="right" valign="bottom">98.89
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>QSRA</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
<td align="right" valign="bottom">0
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Minimus</bold>
<hr></hr>
</td>
<td align="right" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom">30994
<hr></hr>
</td>
<td align="right" valign="bottom">30994
<hr></hr>
</td>
<td align="right" valign="bottom">30994
<hr></hr>
</td>
<td align="right" valign="bottom">30994
<hr></hr>
</td>
<td align="right" valign="bottom">116.63
<hr></hr>
</td>
</tr>
<tr>
<td align="center"> </td>
<td align="left">
<bold>Mira</bold>
</td>
<td align="right">5</td>
<td align="right">33793</td>
<td align="right">6758.6</td>
<td align="right">31042</td>
<td align="right">31042</td>
<td align="right">116.81</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Only contigs whose lengths ≥ 400 were selected.</p>
</table-wrap-foot>
</table-wrap>
<p>The total length of each genome was very close to the genome length obtained from the EBI database, and yielded very high identities (Table
<xref ref-type="table" rid="T2">2</xref>
). Moreover, to show the robustness of Arapan-S, we compared its results to other well-known assemblers: ABySS-1.2.7 [
<xref ref-type="bibr" rid="B13">13</xref>
], SSAKE 3.7 [
<xref ref-type="bibr" rid="B7">7</xref>
], Velvet 1.1.3 [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
] and QSRA [
<xref ref-type="bibr" rid="B15">15</xref>
]. The Overlap-Layout-Consensus-based assemblers that were included for comparison were: Minimus [
<xref ref-type="bibr" rid="B16">16</xref>
] and Mira [
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B17">17</xref>
]. The selected version of each assembler was the latest release, except for the SSAKE assembler for which we chose the release SSAKE 3.7 instead of SSAKE 3.8 because of installation problems. All assemblers have been run with default parameters.</p>
</sec>
<sec>
<title>Comparison</title>
<p>Because of its architecture (de Bruijn graph), Arapan-S is classified as a de novo assembler. However, since our datasets are Sanger reads, we compared our assembler with de novo assemblers and also Overlap-Layout-Consensus assemblers. Note that the current version of QSRA assembler is not able to deal with different read lengths. To solve this problem we used our tool,
<italic>kmerBuilder,</italic>
which is also in the Arapan package, to generate reads of the same length (200 bp for QSRA) from shotgun data.</p>
<sec>
<title>De novo assembler competitors</title>
<p>Concerning the de novo assemblers, the most competitive assembler to Arapan-S was ABySS in Table
<xref ref-type="table" rid="T3">3</xref>
. As with Arapan-S, ABySS was also able to produce only one supercontig for the
<italic>Bovine Respiratory Coronavirus AH187</italic>
genome and the
<italic>Waterbuck Coronavirus US/OH WD358 TC/1994</italic>
genome. However, in contrast to ABySS, Arapan-S achieved the greatest genome coverage and only one supercontig in all cases. Since Arapan-S generated only one contig in all cases, it produced the largest contigs compared to other assemblers. In contrast, the other assemblers generated more contigs and SSAKE had the lowest genome coverage every time and more contigs most of the time. QSRA also did not work well with small genomes.</p>
<p>The
<italic>Influenza A Virus</italic>
genome consists of eight segments (
<ext-link ext-link-type="uri" xlink:href="http://bioafrica.mrc.ac.za/rnavirusdb/virus.php?id=335341">http://bioafrica.mrc.ac.za/rnavirusdb/virus.php?id=335341</ext-link>
). Table
<xref ref-type="table" rid="T4">4</xref>
shows that Arapan-S was able to detect the eight contigs of different genomes of type Influenza A Virus. According to our empirical results, SSAKE failed to deal with small viral genomes. N50 values of SSAKE were not computed because its results did not cover half of the entire genome. ABySS was again the second best assembler after Arapan-S. However, our assembler succeeded in determining the eight segments of each genome, such that its N50 values, as well as the largest contig, were always the highest compared to other assemblers.</p>
</sec>
<sec>
<title>Overlap-layout-consensus competitors</title>
<p>Among the Overlap-Layout-Consensus-based assemblers, Arapan-S was comparable to Minimus. Minimus failed in one case,
<italic>Influenza A Virus A/Memphis/1/71(H3N2)</italic>
, in which it produced nine contigs instead of eight (Table
<xref ref-type="table" rid="T4">4</xref>
). Our assembler showed good approximation compared to Minimus for the
<italic>Antelope coronavirus US/OH1/2003</italic>
genome (Table
<xref ref-type="table" rid="T5">5</xref>
). They achieved almost the same result for the
<italic>Waterbuck Coronavirus US/OH-WD358-TC/1994</italic>
and the
<italic>White-tailed Deer Coronavirus US/OH-WD470/1994</italic>
genomes. On the other hand, Mira did not work well with small genomes, as shown in Tables 
<xref ref-type="table" rid="T3">3</xref>
,
<xref ref-type="table" rid="T4">4</xref>
and
<xref ref-type="table" rid="T5">5</xref>
.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>We have relied on only one objective function “
<italic>the frequency function”</italic>
for the sequence assembly algorithm. In fact, one may also consider another function, which is, “the k-mer length function”,
<inline-formula>
<mml:math id="M1" name="1756-0500-5-243-i1" overflow="scroll">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>Σ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mi>N</mml:mi>
</mml:msup>
<mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
such that
<inline-formula>
<mml:math id="M2" name="1756-0500-5-243-i2" overflow="scroll">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>
such that is the set of k-mer lengths. This function is based on the assumption that nodes whose
<italic>k</italic>
-mers have longer, relative to shorter, lengths are more probably generated from trustworthy consecutive nodes, that is to say, a chain that has fewer or no sequencing errors. However, we have considered only the frequency function in the analysis presented here.</p>
<p>In the case of non-uniform coverage of some areas in the genome [
<xref ref-type="bibr" rid="B18">18</xref>
], the frequency function may suffer from less accuracy. On the other hand, we believe that the
<italic>k</italic>
-mer length function can be a good choice in the case of coverage non-uniformity. Building an algorithm that combines the two objective functions and switches from one to another may lead to more accurate results. Creating such an effective algorithm is an important issue for future research.</p>
<p>Another thing that can be said about the objective function is that the assembly algorithm does not look for the optimal solution. As a matter of fact, the algorithm starts at a determined node whose associated
<italic>k</italic>
-mer has the longest length, then starts going forward and backward in the graph selecting nodes that have the highest scores (greatest frequency values) locally in order to construct a contiguous path in a given connected component.</p>
<p>We have noticed that most genome assemblers, which were built for tackling medium or large genomes, could not successfully deal with tiny and small genomes. Arapan-S, ABySS and Minimus were able to deal with such cases. In future work a comparison would be worthwhile for all genome assemblers to determine the efficiency field of each set of assemblers.</p>
<p>Since our aim was creating a genome assembler for tackling only tiny genomes, dealing with repeats was not an essential task, since they do not regularly appear in very small genomes and the confrontation with tandem repeats does not generally mislead the assembly process (according to our experience). However, in the future, we aim to build another version of the Arapan-S assembler that can handle longer genomes.</p>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>According to our experiments, we have found that general assemblers are not always as effective as the Arapan-S assembler in dealing with tiny genomes. We have used only long reads in our experiments, because the raw data of small genomes can be easily found in the NCBI Trace Archive. However, our assembler can work with any other sequencing technology, such as Illumina/Solexa, SOLiD and 454 sequencing technologies. The raw data are converted into a set of
<italic>k</italic>
-mers by kmerBuilder (
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/dnascissor/files/kmerBuilder/">http://sourceforge.net/projects/dnascissor/files/kmerBuilder/</ext-link>
). The user can run Arapan-S assembler by providing it with the
<italic>k</italic>
-mer file. This feature represents another advantage of our assembler compared to other assemblers. Arapan-S is fast and uses less memory. However, because we are dealing with small genomes, the time and space complexities of all assemblers were negligible. Our assembler is not designed to be applied to medium or large genomes.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<p>The assembly process consists of four major phases. In the first phase, the de Bruijn graph is straightforwardly constructed. The second phase (called the
<italic>cleaning process</italic>
) is a very important step in which the graph is simplified as much as possible by collapsing paths, removing tips and solving bubbles, as well as handling a few other different structures in the graph. In the third phase the graph components are detected before starting the assembly algorithm in the fourth step.</p>
<p>Our algorithm differs from previous works in the following ways:</p>
<p>1. The cleaning process simplifies the graph by a few iterations without incorporating time-consuming algorithms, such as the Dijkstra-like breadth-first search in Velvet [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
] and the Dijkstra algorithm in SOAPdenovo [
<xref ref-type="bibr" rid="B14">14</xref>
].</p>
<p>2. An algorithm was created to solve only simple bubbles (Figure
<xref ref-type="fig" rid="F1">1</xref>
), but by involving other algorithms (i.e. paths collapsing, tips, etc.) all complex bubbles are solved after a few iterations of the cleaning algorithm.</p>
<p>3. The assembly algorithm uses the frequency values and lengths of
<italic>k</italic>
-mers in order to construct contigs as will be described below.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Flowchart.</bold>
The different phases of the cleaning algorithm. </p>
</caption>
<graphic xlink:href="1756-0500-5-243-1"></graphic>
</fig>
<p>Most de novo assemblers focus on solving large genomes; this involves implementing time-consuming and very complicated algorithms. As a result, the construction of contigs becomes stricter, though this is not the case for small genomes, as shown in the results section.</p>
<sec>
<title>Input data and graph construction</title>
<p>The entire dataset of
<italic>k</italic>
-mers is recorded using hash tables in order to speed up further operations. The reverse complements are also recorded without binding them with their original
<italic>k</italic>
-mers. All we need is a linear algorithm for constructing the de Bruijn graph. Since the alphabet is composed of four nucleotide letters, each
<italic>k</italic>
-mer will be connected to four
<italic>k</italic>
-mers at most. All
<italic>k</italic>
-mers that include unknown ‘N’ nucleotides are discarded. The pseudo-code of the algorithm is shown below:</p>
<p>1.
<bold>deBruijnGraphBuilder(</bold>
HashTable
<italic>kmerList</italic>
<bold>, integer</bold>
<italic>K</italic>
<bold>)</bold>
</p>
<p>2.
<bold>Integer</bold>
<italic>N</italic>
:=|
<italic>kmerList</italic>
|; //the size of
<italic>kmerList</italic>
</p>
<p>3.
<bold>String</bold>
<italic>temp</italic>
;</p>
<p>4.
<bold>for</bold>
<italic>i</italic>
:=1
<bold>to</bold>
<italic>N</italic>
<bold>do</bold>
</p>
<p>5.
<bold>begin</bold>
</p>
<p>6.
<italic>temp</italic>
:=
<italic>kmerList</italic>
[
<italic>i</italic>
][1..
<italic>K</italic>
−1];</p>
<p>7. //forward connection</p>
<p>8.
<bold>if</bold>
<italic>temp</italic>
+“A”
<italic>kmerList</italic>
<bold>then</bold>
createArc(
<italic>i</italic>
,
<italic>kmerList</italic>
.IndexOf(
<italic>temp</italic>
+“A”));</p>
<p>9.
<bold>if</bold>
<italic>temp</italic>
+“T”
<italic>kmerList</italic>
<bold>then</bold>
createArc(
<italic>i</italic>
,
<italic>kmerList</italic>
.IndexOf(
<italic>temp</italic>
+“T”));</p>
<p>10.
<bold>if</bold>
<italic>temp</italic>
+“C”
<italic>kmerList</italic>
<bold>then</bold>
createArc(
<italic>i</italic>
,
<italic>kmerList</italic>
.IndexOf(
<italic>temp</italic>
+“C”));</p>
<p>11.
<bold>if</bold>
<italic>temp</italic>
+“G”
<italic>kmerList</italic>
<bold>then</bold>
createArc(
<italic>i</italic>
,
<italic>kmerList</italic>
.IndexOf(
<italic>temp</italic>
+“G”));</p>
<p>12. //backward connection</p>
<p>13.
<bold>if</bold>
“A”+
<italic>temp kmerList</italic>
<bold>then</bold>
createArc(
<italic>kmerList</italic>
.IndexOf(“A”+
<italic>temp</italic>
),
<italic>i</italic>
);</p>
<p>14.
<bold>if</bold>
“T”+
<italic>temp kmerList</italic>
<bold>then</bold>
createArc(
<italic>kmerList</italic>
.IndexOf(“T”+
<italic>temp</italic>
),
<italic>i</italic>
);</p>
<p>15.
<bold>if</bold>
“C”+
<italic>temp kmerList</italic>
<bold>then</bold>
createArc(
<italic>kmerList</italic>
.IndexOf(“C”+
<italic>temp</italic>
),
<italic>i</italic>
)</p>
<p>16.
<bold>if</bold>
“G”+
<italic>temp kmerList</italic>
<bold>then</bold>
createArc(
<italic>kmerList</italic>
.IndexOf(“G”+
<italic>temp</italic>
),
<italic>i</italic>
);</p>
<p>17.
<bold>end</bold>
</p>
<p>Let
<italic>K</italic>
be the length of the short reads. The variable
<italic>temp</italic>
will contain the first prefix of a given
<italic>K</italic>
-mer whose length is
<italic>K</italic>
 − 1. The algorithm computes the out-neighbours in the forward orientation, and the in-neighbours in the opposite direction.</p>
</sec>
<sec>
<title>Cleaning process (simplifying the graph and solving errors)</title>
<p>The raw DNA data always suffer from errors, and since the de Bruijn graph is based on the exact matching of
<italic>k</italic>
-mers, error correction (or removal) becomes very important to the use of such graphs in representing and analyzing sequencing data. The coverage plays a vital role in guiding the cleaning and assembly algorithms to a more accurate result. After constructing the graph, some erroneous
<italic>k</italic>
-mers appear in the graph in different forms. The most common forms are the so-called “Tips, Bubbles and Chimeric connections”. However, while analyzing the graph, we found other forms as well. We have implemented an iterative algorithm that reduces the graph to its maximum simplification. The pseudo-code of the algorithm is shown below and its flowchart is given in Figure
<xref ref-type="fig" rid="F2">2</xref>
.</p>
<p>1.
<bold>
<italic>cleaningAlgorithm</italic>
</bold>
<italic>()</italic>
</p>
<p>2.
<bold>
<italic>Boolean</italic>
</bold>
<italic>col, bub, intip, outip, less, great;</italic>
</p>
<p>3.
<bold>
<italic>Begin</italic>
</bold>
</p>
<p>4.
<bold>
<italic>do</italic>
</bold>
</p>
<p>5.
<italic>col := collapsePaths();</italic>
</p>
<p>6.
<italic>bub := solveBubbles();</italic>
</p>
<p>7.
<bold>
<italic>if</italic>
</bold>
<italic>col==false</italic>
<bold>
<italic>and</italic>
</bold>
<italic>bub==false</italic>
<bold>
<italic>then</italic>
</bold>
</p>
<p>8.
<bold>
<italic>begin</italic>
</bold>
</p>
<p>9.
<italic>intip := removeInTips();</italic>
</p>
<p>10.
<italic>outip := removeOutTips();</italic>
</p>
<p>11.
<italic>less := removeLessMarkTips();</italic>
</p>
<p>12.
<italic>great := removeGreatMarkTips();</italic>
</p>
<p>13.
<bold>
<italic>if</italic>
</bold>
<italic>intip==false</italic>
<bold>
<italic>and</italic>
</bold>
<italic>outip==false</italic>
<bold>
<italic>and</italic>
</bold>
</p>
<p>14.
<italic>less ==false</italic>
<bold>
<italic>and</italic>
</bold>
<italic>great==false</italic>
<bold>
<italic>then</italic>
</bold>
<italic>stop;</italic>
</p>
<p>15.
<bold>
<italic>end</italic>
</bold>
</p>
<p>16.
<bold>
<italic>while</italic>
</bold>
<italic>(true)</italic>
</p>
<p>17.
<italic>removeSingletons();</italic>
</p>
<p>18.
<bold>
<italic>End</italic>
</bold>
</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Switch node.</bold>
All contiguous nodes are merged in one node. This operation is named “
<italic>The path collapsing.</italic>
</p>
</caption>
<graphic xlink:href="1756-0500-5-243-2"></graphic>
</fig>
<p>The
<italic>collapsePaths</italic>
() procedure will return false if it does not collapse any path, otherwise, it returns true. The other procedures behave exactly as
<italic>collapsePaths</italic>
() does. We will hereafter explain each procedure invoked by the cleaning algorithm.</p>
<sec>
<title>Path collapsing</title>
<p>To simplify and shrink the graph before applying any cleaning procedure, a path collapsing algorithm should be run immediately after constructing the graph.</p>
<p>A path is a chain of nodes. Two nodes
<italic>X</italic>
and
<italic>Y</italic>
are merged if the node
<italic>X</italic>
has only one outgoing arc connected to the node
<italic>Y</italic>
that has only one incoming arc. Their corresponding
<italic>k</italic>
-mers must be concatenated accordingly. Most of the resulting nodes (we call them
<italic>switch nodes</italic>
) are seen in Figure
<xref ref-type="fig" rid="F3">3</xref>
.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Bubbles.</bold>
This figure illustrates three simple bubbles and two complex bubbles. Simple bubbles are A-C, B-D and E-F. The first complex bubble starts at A and ends at D while the second one starts at D and ends at F. (X-Y is the subgraph that starts at X and ends at Y). Complex bubbles are solved by executing the simple bubble-solving algorithm and path-collapsing algorithm.</p>
</caption>
<graphic xlink:href="1756-0500-5-243-3"></graphic>
</fig>
</sec>
<sec>
<title>Bubble solving</title>
<p>In genome assembly, a bubble appears where two sequences initially align, then diverge in the middle, and align again at the end. Bubbles are caused by repeats or heterozygotes of diploid chromosomes [
<xref ref-type="bibr" rid="B14">14</xref>
], or created by errors or biological variants, such as SNPs, diploids or cloning artefacts prior to sequencing.</p>
<p>A path is a chain of nodes in a graph. We call a path a simple path if each internal node (i.e., each node between the start node and the end node of the path) has one outgoing edge and one incoming edge. A bubble is a subgraph that consists of multiple simple paths all of which share the same start node and the same end node. In the original graph, the start node must not have any outgoing edges other than those in the bubble, and the end node must not have any incoming edges other than those in the bubble.</p>
<p>In Velvet [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
], detection of bubbles was done by an algorithm based on a Dijkstra-like breadth-first search called “The Tour Bus Algorithm”. Similarly, Dijkstra’s algorithm is also used to detect bubbles in SOAPdenovo [
<xref ref-type="bibr" rid="B14">14</xref>
], in which the detected bubbles are merged into a single path if the sequences of the parallel paths are very similar; that is, had fewer than four base pairs difference with more than 90% identity.</p>
<p>In Arapan-S, all bubbles will be relaxed by combining all the cleaning procedures and without incorporating a time-consuming algorithm. After collapsing all paths, bubbles will appear in the graph as shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
. The node with a high coverage will not be removed from the bubble (However, the algorithm can also be parameterized to keep only the node that has the maximum
<italic>k</italic>
-mer’s length instead of high coverage).</p>
</sec>
<sec>
<title>Tips removal</title>
<p>Tips generally result from errors at the end of reads. In the graph, a tip is a node connected only on one end (Figure
<xref ref-type="fig" rid="F4">4</xref>
). In Velvet, a tip is removed if it is shorter than 2 
<italic>k</italic>
(
<italic>k</italic>
is chosen for the
<italic>k</italic>
-mer). After removing tips, new paths will appear again in the graph. Almost all the remaining nodes’ degrees are ≥ 2. We will hereafter call such nodes:
<italic>switch nodes</italic>
. The result of the cleaning process will be similar to what is shown in Figure
<xref ref-type="fig" rid="F5">5</xref>
.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Tips.</bold>
This figure shows some tips (i.e. C, D, F and I).</p>
</caption>
<graphic xlink:href="1756-0500-5-243-4"></graphic>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Graph visualization.</bold>
A part of two connected components of the white tailed deer corona virus genome graph after running the cleaning algorithm. Nodes represent
<italic>k</italic>
-mers and arrows represent the overlaps between
<italic>k</italic>
-mers. This picture was taken from the aiSee graph visualization software (
<ext-link ext-link-type="uri" xlink:href="www.aisee.com.">www.aisee.com.</ext-link>
)</p>
</caption>
<graphic xlink:href="1756-0500-5-243-5"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Connected components detection</title>
<p>Once the graph is reduced and contains only switch nodes, we start determining the connected components of the graph. There are two cases in which we need to determine the connected component. The first case is the nature of the
<italic>k</italic>
-mers and their reverse complements. Since each
<italic>k</italic>
-mer was recorded along with its reverse complement, we will obtain a graph composed of two subgraphs, one being the reverse of the other. The second case is the sparseness of the graph, especially when the initial
<italic>k</italic>
-mer length is a bit longer. Our assembly algorithm can run on every connected component of the graph. Detection of these components can lead the assembly algorithm to be run in parallel. The breadth-first search or depth-first search can be applied to find the connected components in linear time. The search begins at an arbitrary node
<italic>v</italic>
from which the entire connected component including
<italic>v</italic>
will be detected. A loop through all nodes of the graph must be implemented in order to find all the connected components. The loop runs until no visited node can be found. The pseudo-code of the modified algorithm is shown as follows:</p>
<p>1.
<bold>
<italic>connectedComponent(VertexSet</italic>
</bold>
<italic>V</italic>
<bold>
<italic>, EdgeSet</italic>
</bold>
<italic>E</italic>
<bold>
<italic>, Node</italic>
</bold>
<italic>a</italic>
<bold>
<italic>)</italic>
</bold>
</p>
<p>2.
<bold>
<italic>Set</italic>
</bold>
<italic>X;</italic>
</p>
<p>3.
<bold>
<italic>Boolean</italic>
</bold>
<italic>visited[|V|];</italic>
</p>
<p>4.
<italic>//Step 1</italic>
</p>
<p>5.
<inline-formula>
<mml:math id="M3" name="1756-0500-5-243-i3" overflow="scroll">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced open="{" close="}">
<mml:mi>a</mml:mi>
</mml:mfenced>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>6.
<inline-formula>
<mml:math id="M4" name="1756-0500-5-243-i4" overflow="scroll">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mfenced open="⌊" close="⌋">
<mml:mi>x</mml:mi>
</mml:mfenced>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi>V</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>7.
<italic>//Step 2</italic>
</p>
<p>8.
<bold>
<italic>while</italic>
</bold>
<inline-formula>
<mml:math id="M5" name="1756-0500-5-243-i5" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
<mml:mi mathvariant="bold-italic">i</mml:mi>
<mml:mi mathvariant="bold-italic">s</mml:mi>
<mml:mi mathvariant="bold-italic">i</mml:mi>
<mml:mi mathvariant="bold-italic">t</mml:mi>
<mml:mi mathvariant="bold-italic">e</mml:mi>
<mml:mi mathvariant="bold-italic">d</mml:mi>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi mathvariant="bold-italic">a</mml:mi>
<mml:mi mathvariant="bold-italic">l</mml:mi>
<mml:mi mathvariant="bold-italic">s</mml:mi>
<mml:mi mathvariant="bold-italic">e</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>
<italic>do</italic>
</bold>
</p>
<p>9.
<bold>
<italic>begin</italic>
</bold>
</p>
<p>10.
<inline-formula>
<mml:math id="M6" name="1756-0500-5-243-i6" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
<mml:mi mathvariant="bold-italic">i</mml:mi>
<mml:mi mathvariant="bold-italic">s</mml:mi>
<mml:mi mathvariant="bold-italic">i</mml:mi>
<mml:mi mathvariant="bold-italic">t</mml:mi>
<mml:mi mathvariant="bold-italic">e</mml:mi>
<mml:mi mathvariant="bold-italic">d</mml:mi>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi mathvariant="bold-italic">a</mml:mi>
<mml:mi mathvariant="bold-italic">l</mml:mi>
<mml:mi mathvariant="bold-italic">s</mml:mi>
<mml:mi mathvariant="bold-italic">e</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>11.
<inline-formula>
<mml:math id="M7" name="1756-0500-5-243-i7" overflow="scroll">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced open="{" close="}">
<mml:mi>y</mml:mi>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo></mml:mo>
<mml:mi>X</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>12.
<bold>
<italic>end</italic>
</bold>
</p>
<p>13.
<bold>
<italic>return</italic>
</bold>
<italic>X;</italic>
</p>
<p>The idea of this algorithm is to traverse the graph from an arbitrary node
<italic>a</italic>
, mark it as a visited node and record its neighbors in the set
<italic>X</italic>
. The same job is done for the recorded nodes until there are no visited nodes in the set
<italic>X</italic>
. The algorithm returns the connected component engendered from the node
<italic>a</italic>
. To find all connected components we apply the following algorithm:</p>
<p>1.
<bold>
<italic>allComponents(VertexSet</italic>
</bold>
<italic>V</italic>
<bold>
<italic>, EdgeSet</italic>
</bold>
<italic>E</italic>
<bold>
<italic>)</italic>
</bold>
</p>
<p>2.
<bold>
<italic>SetList</italic>
</bold>
<italic>C;</italic>
</p>
<p>3.
<bold>
<italic>Set</italic>
</bold>
<italic>X’ ;</italic>
</p>
<p>4.
<bold>
<italic>Integer</italic>
</bold>
<italic>i;</italic>
</p>
<p>5.
<italic>//Step1</italic>
</p>
<p>6.
<inline-formula>
<mml:math id="M8" name="1756-0500-5-243-i8" overflow="scroll">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>'</mml:mo>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>V</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>7.
<inline-formula>
<mml:math id="M9" name="1756-0500-5-243-i9" overflow="scroll">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>8.
<italic>//Step 2</italic>
</p>
<p>9.
<bold>
<italic>while</italic>
</bold>
<inline-formula>
<mml:math id="M10" name="1756-0500-5-243-i10" overflow="scroll">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>'</mml:mo>
<mml:mo></mml:mo>
<mml:mo>Ø</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>
<italic>do</italic>
</bold>
</p>
<p>10.
<bold>
<italic>begin</italic>
</bold>
</p>
<p>11.
<italic>select an arbitrary x∈X’;</italic>
</p>
<p>12.
<inline-formula>
<mml:math id="M11" name="1756-0500-5-243-i11" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>13.
<inline-formula>
<mml:math id="M12" name="1756-0500-5-243-i12" overflow="scroll">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>'</mml:mo>
<mml:mo>:</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>'</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>14.
<inline-formula>
<mml:math id="M13" name="1756-0500-5-243-i13" overflow="scroll">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>15.
<bold>
<italic>end</italic>
</bold>
</p>
<p>16.
<bold>
<italic>return</italic>
</bold>
<italic>C;</italic>
</p>
<p>We only need to select an arbitrary node
<italic>x</italic>
and determine, due to the
<italic>connectedComponent</italic>
() procedure, the connected component
<italic>C</italic>
<sub>
<italic>i</italic>
</sub>
having
<italic>x</italic>
. The determined component’s nodes will be removed from the
<italic>X’</italic>
(Line 14). The same operation is performed until no connected components can be detected.</p>
</sec>
<sec>
<title>Assembly algorithm</title>
<p>Once the connected components are detected, we run the assembly algorithm for each component. The assembly algorithm can be run by using one of two parameters: the coverage (
<italic>k</italic>
-mer’s frequency), and the
<italic>k</italic>
-mer lengths. The latter parameter is obtained by the cleaning process, which provides us with
<italic>switch nodes</italic>
whose corresponding
<italic>k</italic>
-mers have longer lengths due to the merging process.</p>
<p>Most of the previous work on genome assembly has the following assumption: given a set of reads, the objective of the assembly program is to minimize the length of the assembled genome [
<xref ref-type="bibr" rid="B18">18</xref>
]. However, according to our knowledge, there is no proof that the shortest path can always faithfully represent the genome. The same can be concluded concerning the longest path, the Hamiltonian path and the Eulerian path.</p>
<p>The assembly algorithm is a greedy function. It traverses the graph by selecting only the nodes whose frequency values are higher. We have chosen this strategy by assuming that
<italic>k</italic>
-mers, which are characterized by high frequency values, are more likely to be free of sequencing errors (we call it “
<italic>frequency function</italic>
”). All procedures of the assembly algorithm are given as follows:</p>
<p>1.
<bold>
<italic>stringPath( Set</italic>
</bold>
<bold>
<italic>C</italic>
</bold>
<bold>
<italic>)</italic>
</bold>
</p>
<p>2.
<bold>
<italic>Ordered Set</italic>
</bold>
<italic>path;</italic>
</p>
<p>3.
<bold>
<italic>Set</italic>
</bold>
<italic>P, Visited;</italic>
</p>
<p>4.
<bold>
<italic>Node</italic>
</bold>
<italic>u, v;</italic>
</p>
<p>5.
<italic>//Step1: preprocessing</italic>
</p>
<p>6.
<italic>u := the index of the node which have the longest k-mer length.</italic>
</p>
<p>7.
<inline-formula>
<mml:math id="M14" name="1756-0500-5-243-i14" overflow="scroll">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>u</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>8.
<inline-formula>
<mml:math id="M15" name="1756-0500-5-243-i15" overflow="scroll">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced open="{" close="}">
<mml:mi>u</mml:mi>
</mml:mfenced>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>9. Visited: =Ø</p>
<p>10.
<italic>//Step 2: forward direction</italic>
</p>
<p>11.
<bold>
<italic>do forever</italic>
</bold>
</p>
<p>12.
<bold>
<italic>begin</italic>
</bold>
</p>
<p>13.
<italic>P := out_neighbors(u) − Visited;</italic>
</p>
<p>14.
<inline-formula>
<mml:math id="M16" name="1756-0500-5-243-i16" overflow="scroll">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>V</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>;</italic>
</p>
<p>15.
<bold>
<italic>if</italic>
</bold>
<italic>P=</italic>
Ø
<bold>
<italic>then</italic>
</bold>
<italic>stop;</italic>
</p>
<p>16.
<italic>u := bestNeighbor(u, P);</italic>
</p>
<p>17.
<inline-formula>
<mml:math id="M17" name="1756-0500-5-243-i17" overflow="scroll">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced open="{" close="}">
<mml:mi>u</mml:mi>
</mml:mfenced>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>18.
<bold>
<italic>End</italic>
</bold>
</p>
<p>19.
<italic>//Step 3: backward direction</italic>
</p>
<p>20.
<bold>
<italic>do forever</italic>
</bold>
</p>
<p>21.
<bold>
<italic>begin</italic>
</bold>
</p>
<p>22.
<italic>P := in_neighbors(v) − Visited;</italic>
</p>
<p>23.
<inline-formula>
<mml:math id="M18" name="1756-0500-5-243-i18" overflow="scroll">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>V</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mi>P</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>24.
<bold>
<italic>if</italic>
</bold>
<italic>P=Ø</italic>
<bold>
<italic>then</italic>
</bold>
<italic>stop;</italic>
</p>
<p>25.
<italic>v := bestNeighbor(v, P);</italic>
</p>
<p>26.
<inline-formula>
<mml:math id="M19" name="1756-0500-5-243-i19" overflow="scroll">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mi>v</mml:mi>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>27.
<bold>
<italic>end</italic>
</bold>
</p>
<p>28.
<bold>
<italic>return</italic>
</bold>
<italic>path;</italic>
</p>
<p>The set
<italic>C</italic>
represents a connected component of the graph. The resulting path is kept in the ordered set
<italic>path.</italic>
After variables initialization, the algorithm goes in a forward direction selecting the best out-neighbors. In the last step, it goes backwards selecting the best in-neighbors. The
<italic>bestNeighbor</italic>
() function is the current node and the set of its in- or out-neighbors. Since each node could be connected to several neighbouring nodes, the best neighbor is characterized by the highest frequency value. The two loops stop when no more exploration can be done. To find all possible paths, we apply the following algorithm, called the
<italic>stringPath</italic>
() algorithm.</p>
<p>1.
<bold>allPaths()</bold>
</p>
<p>2.
<bold>SetList</bold>
<italic>C</italic>
; //components list</p>
<p>3.
<bold>SetList</bold>
<italic>P</italic>
; //paths list</p>
<p>4.
<bold>Integer</bold>
<italic>i</italic>
;</p>
<p>5. //Step 1</p>
<p>6.
<inline-formula>
<mml:math id="M20" name="1756-0500-5-243-i20" overflow="scroll">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>7. //Step 2</p>
<p>8.
<bold>for</bold>
<italic>i</italic>
<bold>:= 1 to |</bold>
<italic>C</italic>
<bold>| do</bold>
</p>
<p>9.
<bold>begin</bold>
</p>
<p>10.
<inline-formula>
<mml:math id="M21" name="1756-0500-5-243-i21" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext>;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>11.
<bold>end</bold>
</p>
<p>12.
<bold>return</bold>
<italic>P</italic>
;</p>
<p>By going through all connected components (determined by the
<italic>allComponents</italic>
() procedure), and due to the previous algorithm, a path
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
will be constructed for each connected component
<italic>C</italic>
<sub>
<italic>i.</italic>
</sub>
</p>
</sec>
<sec>
<title>Availability and requirements</title>
<p>Arapan-S is open access and freely available. All questions, comments and requests should be sent by email to nihon.sahli@gmail.com.</p>
<p>Project name: Arapan project</p>
<p>Project home page:
<ext-link ext-link-type="uri" xlink:href="http://shibuyalab.hgc.jp/Arapan/">http://shibuyalab.hgc.jp/Arapan/</ext-link>
</p>
<p>Operating system(s): Windows, Linux (Ubuntu)</p>
<p>Programming language: C/C++</p>
<p>Other requirements: None</p>
<p>License: None required</p>
<p>Any restrictions to use by non-academics: None required</p>
</sec>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>MS and TS conceived the research and wrote the article. MS conducted the research and implemented Arapan-S in C++ programming language. All authors have read and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>This work was partially supported by the Grant-in-Aid from the Ministry of Education, Culture, Sports, Science and Technology of Japan. We should give thanks to Mr. Yassine Bouhmadi and Fouad Kharroubi for their corrections.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>White</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Adams</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Kerlavage</surname>
<given-names>AR</given-names>
</name>
<article-title>TIGR Assembler: a new tool for assembling large shotgun sequencing projects</article-title>
<source>Genome Sci</source>
<year>1995</year>
<volume>1</volume>
<fpage>9</fpage>
<lpage>19</lpage>
<pub-id pub-id-type="doi">10.1089/gst.1995.1.9</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Madan</surname>
<given-names>A</given-names>
</name>
<article-title>CAP3: A DNA sequence assembly program</article-title>
<source>Genome Research</source>
<year>1999</year>
<volume>9</volume>
<fpage>868</fpage>
<lpage>877</lpage>
<pub-id pub-id-type="doi">10.1101/gr.9.9.868</pub-id>
<pub-id pub-id-type="pmid">10508846</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>SP</given-names>
</name>
<name>
<surname>Hillier</surname>
<given-names>L</given-names>
</name>
<article-title>PCAP: A whole-genome assembly program</article-title>
<source>Genome Research</source>
<year>2003</year>
<volume>13</volume>
<fpage>2164</fpage>
<lpage>2170</lpage>
<pub-id pub-id-type="doi">10.1101/gr.1390403</pub-id>
<pub-id pub-id-type="pmid">12952883</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<article-title>The fragment assembly string graph</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>ii79</fpage>
<lpage>ii85</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti1114</pub-id>
<pub-id pub-id-type="pmid">16204131</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Chevreux</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pfisterer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Drescher</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Driesel</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Müller</surname>
<given-names>WEG</given-names>
</name>
<name>
<surname>Wetter</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Suhai</surname>
<given-names>S</given-names>
</name>
<article-title>Using the miraEST assembler for reliable and automated mrna transcript assembly and snp detection in sequenced ests</article-title>
<source>Genome Research</source>
<year>2004</year>
<volume>14</volume>
<fpage>1147</fpage>
<lpage>1159</lpage>
<pub-id pub-id-type="doi">10.1101/gr.1917404</pub-id>
<pub-id pub-id-type="pmid">15140833</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
<source>Proc Natl Acad Sci</source>
<year>2001</year>
<volume>98</volume>
<fpage>9748</fpage>
<lpage>9753</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.171285098</pub-id>
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Warren</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Holt</surname>
<given-names>RA</given-names>
</name>
<article-title>Assembling millions of short DNA sequences using SSAKE</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>500</fpage>
<lpage>501</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl629</pub-id>
<pub-id pub-id-type="pmid">17158514</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Short read fragment assembly of bacterial genomes</article-title>
<source>Genome Research</source>
<year>2008</year>
<volume>18</volume>
<fpage>324</fpage>
<lpage>330</lpage>
<pub-id pub-id-type="doi">10.1101/gr.7088808</pub-id>
<pub-id pub-id-type="pmid">18083777</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Research</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>McEwen</surname>
<given-names>GK</given-names>
</name>
<name>
<surname>Margulies</surname>
<given-names>EH</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<article-title>Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short read de novo assembler</article-title>
<source>PLoS One</source>
<year>2009</year>
<volume>4</volume>
<fpage>e8407</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0008407</pub-id>
<pub-id pub-id-type="pmid">20027311</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Butler</surname>
<given-names>J</given-names>
</name>
<name>
<surname>MacCallum</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Kleber</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shlyakhter</surname>
<given-names>IA</given-names>
</name>
<name>
<surname>Belmonte</surname>
<given-names>MK</given-names>
</name>
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>DB</given-names>
</name>
<article-title>ALLPATHS: De novo assembly of whole genome shotgun microreads</article-title>
<source>Genome Research</source>
<year>2008</year>
<volume>18</volume>
<fpage>810</fpage>
<lpage>820</lpage>
<pub-id pub-id-type="doi">10.1101/gr.7337908</pub-id>
<pub-id pub-id-type="pmid">18340039</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Maccallum</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Przybylski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Gnerre</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Burton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shlyakhter</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Gnirke</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Malek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>McKernan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ranade</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shea</surname>
<given-names>TP</given-names>
</name>
<etal></etal>
<article-title>ALLPATHS 2: Small genomes assembled accurately and with high continuity from short paired reads</article-title>
<source>Genome Biology</source>
<year>2009</year>
<volume>10</volume>
<fpage>R103</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-10-r103</pub-id>
<pub-id pub-id-type="pmid">19796385</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Research</source>
<year>2009</year>
<volume>19</volume>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="doi">10.1101/gr.089532.108</pub-id>
<pub-id pub-id-type="pmid">19251739</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<etal></etal>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Research</source>
<year>2010</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="doi">10.1101/gr.097261.109</pub-id>
<pub-id pub-id-type="pmid">20019144</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Bryant</surname>
<given-names>DW</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>WK</given-names>
</name>
<name>
<surname>Mockler</surname>
<given-names>TC</given-names>
</name>
<article-title>QSRA – a quality-value guided de novo short read assembler</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>69</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-69</pub-id>
<pub-id pub-id-type="pmid">19239711</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Sommer</surname>
<given-names>DD</given-names>
</name>
<name>
<surname>Dlecher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<article-title>Minimus: a fast, lightweight genome assembler</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>64</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-64</pub-id>
<pub-id pub-id-type="pmid">17324286</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<article-title>Genome Sequence Assembly Using Trace Signals and Additional Sequence Information Computer Science and Biology</article-title>
<source>Proceedings of the German Conference on Bioinformatics</source>
<year>1999</year>
<volume>99</volume>
<fpage>45</fpage>
<lpage>56</lpage>
<comment>GCB</comment>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
<article-title>Maximum likelihood genome assembly</article-title>
<source>Journal of Computational Biology</source>
<year>2009</year>
<volume>16</volume>
<fpage>1</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2008.0137</pub-id>
<pub-id pub-id-type="pmid">19119992</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sante/explor/CovidV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000657 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000657 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sante
   |area=    CovidV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3441218
   |texte=   Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22591859" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CovidV2 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Sat Mar 28 17:51:24 2020. Site generation: Sun Jan 31 15:35:48 2021