Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Konnector v2.0: pseudo-long reads from paired-end sequencing data

Identifieur interne : 000959 ( Pmc/Corpus ); précédent : 000958; suivant : 000960

Konnector v2.0: pseudo-long reads from paired-end sequencing data

Auteurs : Benjamin P. Vandervalk ; Chen Yang ; Zhuyi Xue ; Karthika Raghavan ; Justin Chu ; Hamid Mohamadi ; Shaun D. Jackman ; Readman Chiu ; René L. Warren ; Inanç Birol

Source :

RBID : PMC:4582294

Abstract

Background

Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool.

Results

Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.

Conclusions

Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.


Url:
DOI: 10.1186/1755-8794-8-S3-S1
PubMed: 26399504
PubMed Central: 4582294

Links to Exploration step

PMC:4582294

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Konnector v2.0: pseudo-long reads from paired-end sequencing data</title>
<author>
<name sortKey="Vandervalk, Benjamin P" sort="Vandervalk, Benjamin P" uniqKey="Vandervalk B" first="Benjamin P" last="Vandervalk">Benjamin P. Vandervalk</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yang, Chen" sort="Yang, Chen" uniqKey="Yang C" first="Chen" last="Yang">Chen Yang</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xue, Zhuyi" sort="Xue, Zhuyi" uniqKey="Xue Z" first="Zhuyi" last="Xue">Zhuyi Xue</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Raghavan, Karthika" sort="Raghavan, Karthika" uniqKey="Raghavan K" first="Karthika" last="Raghavan">Karthika Raghavan</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chu, Justin" sort="Chu, Justin" uniqKey="Chu J" first="Justin" last="Chu">Justin Chu</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jackman, Shaun D" sort="Jackman, Shaun D" uniqKey="Jackman S" first="Shaun D" last="Jackman">Shaun D. Jackman</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chiu, Readman" sort="Chiu, Readman" uniqKey="Chiu R" first="Readman" last="Chiu">Readman Chiu</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Warren, Rene L" sort="Warren, Rene L" uniqKey="Warren R" first="René L" last="Warren">René L. Warren</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanç" last="Birol">Inanç Birol</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26399504</idno>
<idno type="pmc">4582294</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4582294</idno>
<idno type="RBID">PMC:4582294</idno>
<idno type="doi">10.1186/1755-8794-8-S3-S1</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000959</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000959</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Konnector v2.0: pseudo-long reads from paired-end sequencing data</title>
<author>
<name sortKey="Vandervalk, Benjamin P" sort="Vandervalk, Benjamin P" uniqKey="Vandervalk B" first="Benjamin P" last="Vandervalk">Benjamin P. Vandervalk</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yang, Chen" sort="Yang, Chen" uniqKey="Yang C" first="Chen" last="Yang">Chen Yang</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xue, Zhuyi" sort="Xue, Zhuyi" uniqKey="Xue Z" first="Zhuyi" last="Xue">Zhuyi Xue</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Raghavan, Karthika" sort="Raghavan, Karthika" uniqKey="Raghavan K" first="Karthika" last="Raghavan">Karthika Raghavan</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chu, Justin" sort="Chu, Justin" uniqKey="Chu J" first="Justin" last="Chu">Justin Chu</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jackman, Shaun D" sort="Jackman, Shaun D" uniqKey="Jackman S" first="Shaun D" last="Jackman">Shaun D. Jackman</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chiu, Readman" sort="Chiu, Readman" uniqKey="Chiu R" first="Readman" last="Chiu">Readman Chiu</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Warren, Rene L" sort="Warren, Rene L" uniqKey="Warren R" first="René L" last="Warren">René L. Warren</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanç" last="Birol">Inanç Birol</name>
<affiliation>
<nlm:aff id="I1">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Medical Genomics</title>
<idno type="eISSN">1755-8794</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local
<italic>de novo </italic>
assembly tool that addresses this problem. Here we report on version 2.0 of our tool.</p>
</sec>
<sec>
<title>Results</title>
<p>Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Vandervalk, Bp" uniqKey="Vandervalk B">BP Vandervalk</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Raymond, A" uniqKey="Raymond A">A Raymond</name>
</author>
<author>
<name sortKey="Mohamadi, H" uniqKey="Mohamadi H">H Mohamadi</name>
</author>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Attali, Da" uniqKey="Attali D">DA Attali</name>
</author>
<author>
<name sortKey="Konnector" uniqKey="Konnector">Konnector</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zimin, Av" uniqKey="Zimin A">AV Zimin</name>
</author>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Silver, Dh" uniqKey="Silver D">DH Silver</name>
</author>
<author>
<name sortKey="Ben Elazar, S" uniqKey="Ben Elazar S">S Ben-Elazar</name>
</author>
<author>
<name sortKey="Bogoslavsky, A" uniqKey="Bogoslavsky A">A Bogoslavsky</name>
</author>
<author>
<name sortKey="Yanai, I" uniqKey="Yanai I">I Yanai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nadalin, F" uniqKey="Nadalin F">F Nadalin</name>
</author>
<author>
<name sortKey="Vezzi, F" uniqKey="Vezzi F">F Vezzi</name>
</author>
<author>
<name sortKey="Policriti, A" uniqKey="Policriti A">A Policriti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Dew, Im" uniqKey="Dew I">IM Dew</name>
</author>
<author>
<name sortKey="Fasulo, Dp" uniqKey="Fasulo D">DP Fasulo</name>
</author>
<author>
<name sortKey="Flanigan, Mj" uniqKey="Flanigan M">MJ Flanigan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boisvert, S" uniqKey="Boisvert S">S Boisvert</name>
</author>
<author>
<name sortKey="Laviolette, F" uniqKey="Laviolette F">F Laviolette</name>
</author>
<author>
<name sortKey="Corbeil, J" uniqKey="Corbeil J">J Corbeil</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stranneheim, H" uniqKey="Stranneheim H">H Stranneheim</name>
</author>
<author>
<name sortKey="Kaller, M" uniqKey="Kaller M">M Kaller</name>
</author>
<author>
<name sortKey="Allander, T" uniqKey="Allander T">T Allander</name>
</author>
<author>
<name sortKey="Andersson, B" uniqKey="Andersson B">B Andersson</name>
</author>
<author>
<name sortKey="Arvestad, L" uniqKey="Arvestad L">L Arvestad</name>
</author>
<author>
<name sortKey="Lundeberg, J" uniqKey="Lundeberg J">J Lundeberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Venter, E" uniqKey="Venter E">E Venter</name>
</author>
<author>
<name sortKey="Walenz, Bp" uniqKey="Walenz B">BP Walenz</name>
</author>
<author>
<name sortKey="Brownley, A" uniqKey="Brownley A">A Brownley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
<author>
<name sortKey="Shi, Y" uniqKey="Shi Y">Y Shi</name>
</author>
<author>
<name sortKey="Lu, J" uniqKey="Lu J">J Lu</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gurevich, A" uniqKey="Gurevich A">A Gurevich</name>
</author>
<author>
<name sortKey="Saveliev, V" uniqKey="Saveliev V">V Saveliev</name>
</author>
<author>
<name sortKey="Vyahhi, N" uniqKey="Vyahhi N">N Vyahhi</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quinlan, Ar" uniqKey="Quinlan A">AR Quinlan</name>
</author>
<author>
<name sortKey="Hall, Im" uniqKey="Hall I">IM Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paulino, D" uniqKey="Paulino D">D Paulino</name>
</author>
<author>
<name sortKey="Warren, Rl" uniqKey="Warren R">RL Warren</name>
</author>
<author>
<name sortKey="Vandervalk, Bp" uniqKey="Vandervalk B">BP Vandervalk</name>
</author>
<author>
<name sortKey="Raymond, A" uniqKey="Raymond A">A Raymond</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boetzer, M" uniqKey="Boetzer M">M Boetzer</name>
</author>
<author>
<name sortKey="Pirovano, W" uniqKey="Pirovano W">W Pirovano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Raymond, A" uniqKey="Raymond A">A Raymond</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Pleasance, S" uniqKey="Pleasance S">S Pleasance</name>
</author>
<author>
<name sortKey="Coope, R" uniqKey="Coope R">R Coope</name>
</author>
<author>
<name sortKey="Taylor, Ga" uniqKey="Taylor G">GA Taylor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koboldt, Dc" uniqKey="Koboldt D">DC Koboldt</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Wylie, T" uniqKey="Wylie T">T Wylie</name>
</author>
<author>
<name sortKey="Larson, De" uniqKey="Larson D">DE Larson</name>
</author>
<author>
<name sortKey="Mclellan, Md" uniqKey="Mclellan M">MD McLellan</name>
</author>
<author>
<name sortKey="Mardis, Er" uniqKey="Mardis E">ER Mardis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bartenhagen, C" uniqKey="Bartenhagen C">C Bartenhagen</name>
</author>
<author>
<name sortKey="Dugas, M" uniqKey="Dugas M">M Dugas</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Med Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Med Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Medical Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1755-8794</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26399504</article-id>
<article-id pub-id-type="pmc">4582294</article-id>
<article-id pub-id-type="publisher-id">1755-8794-8-S3-S1</article-id>
<article-id pub-id-type="doi">10.1186/1755-8794-8-S3-S1</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Konnector v2.0: pseudo-long reads from paired-end sequencing data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Vandervalk</surname>
<given-names>Benjamin P</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>benv@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Yang</surname>
<given-names>Chen</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>cheny@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Xue</surname>
<given-names>Zhuyi</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>zxue@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Raghavan</surname>
<given-names>Karthika</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>kraghavan@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A5">
<name>
<surname>Chu</surname>
<given-names>Justin</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>cjustin@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A6">
<name>
<surname>Mohamadi</surname>
<given-names>Hamid</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>hmohamadi@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A7">
<name>
<surname>Jackman</surname>
<given-names>Shaun D</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>sjackman@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A8">
<name>
<surname>Chiu</surname>
<given-names>Readman</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>rchiu@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" id="A9">
<name>
<surname>Warren</surname>
<given-names>René L</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>rwarren@bcgsc.ca</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A10">
<name>
<surname>Birol</surname>
<given-names>Inanç</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>ibirol@bcgsc.ca</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada</aff>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>23</day>
<month>9</month>
<year>2015</year>
</pub-date>
<volume>8</volume>
<issue>Suppl 3</issue>
<supplement>
<named-content content-type="supplement-title">Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics</named-content>
<named-content content-type="supplement-editor">Feng Luo</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorship. Information about the source of funding for the publication charges can be found in the individual articles. Articles have undergone the journal's standard peer review process for supplements. The Supplement Editor declares that they have no competing interests.</named-content>
</supplement>
<fpage>S1</fpage>
<lpage>S1</lpage>
<permissions>
<copyright-statement>Copyright © 2015 Vandervalk et al.;</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Vandervalk et al.;</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1755-8794/8/S3/S1"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local
<italic>de novo </italic>
assembly tool that addresses this problem. Here we report on version 2.0 of our tool.</p>
</sec>
<sec>
<title>Results</title>
<p>Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Bloom filter</kwd>
<kwd>de Bruijn graph</kwd>
<kwd>paired-end sequencing</kwd>
<kwd>
<italic>de novo </italic>
</kwd>
<kwd>genome assembly</kwd>
</kwd-group>
<conference>
<conf-date>2-5 November 2014</conf-date>
<conf-name>IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014)</conf-name>
<conf-loc>Belfast, UK</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>If genomes were composed of random sequences, a sequence of length L would be specific enough to describe a locus on a genome of length G when 4
<sup>L</sup>
>>G. For instance, a typical HiSeq 4000 sequencer (Illumina, San Diego, CA) generates 150 base pair (bp) reads, for which 4
<sup>L </sup>
would be more than 80 orders of magnitude larger than the human genome. But, of course, genomes are not random sequences; they have structure, otherwise, we would not be here to write this paper, nor would you be there to read it.</p>
<p>Long read lengths are desirable to reveal structures in genomes of interest. While sequencing technologies from Pacific Biosciences (Menlo Park, CA) and Oxford Nanopore Technologies (Oxford, UK) can generate reads that are several kilo bases (kb) long, their low throughput and high error make them challenging to use in experiments that interrogate large targets.</p>
<p>Many experimental designs with short sequencing data use a paired-end tag (PET) sequencing strategy, where short sequences are determined from both ends of a DNA fragment. These PET sequences are then associated in downstream analysis to resolve structures as long as fragment lengths. Typically, these fragments are less than 1 kb, and ideally have unimodal length distributions. To resolve even longer structures, there are specialized library preparation protocols, such as Nextera and Moleculo from Illumina and GemCode from 10X Genomics (Pleasanton, CA).</p>
<p>In this study, we focus on the PET reads. We describe Konnector v2.0, a tool that uses the coverage redundancy in a high-throughput sequencing experiment to reconstruct fragment sequences (pseudo-reads). Optionally, it also extends those fragment sequences in 3' and 5' directions, as long as the extensions are unambiguous. The tool builds on our earlier implementation [
<xref ref-type="bibr" rid="B1">1</xref>
] that filled in the bases of the sequence gap between read pairs by navigating a de Bruijn graph [
<xref ref-type="bibr" rid="B2">2</xref>
]. Konnector represents a de Bruijn graph using a Bloom filter [
<xref ref-type="bibr" rid="B3">3</xref>
], a probabilistic and memory-efficient data structure.</p>
<p>The utility of long pseudo-reads has been demonstrated before [
<xref ref-type="bibr" rid="B4">4</xref>
], and forms the backbone of some
<italic>de novo </italic>
assembly tools [
<xref ref-type="bibr" rid="B5">5</xref>
]. Long pseudo-reads can be generated by merging overlapping PETs [
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
], or by localizing the sequence assembly problem around PETs [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
]. Our focus in this study is the latter problem.</p>
<p>For example, the ELOPER algorithm [
<xref ref-type="bibr" rid="B8">8</xref>
] identifies read pairs that share an overlap in both reads simultaneously, and uses these overlaps to generate "elongated paired-end reads". The GapFiller algorithm [
<xref ref-type="bibr" rid="B9">9</xref>
], on the other hand, formulates this problem as a collection of seed-and-extend local assembly problems. The latter concept has also been implemented within the MaSuRCA
<italic>de novo </italic>
assembly pipeline [
<xref ref-type="bibr" rid="B5">5</xref>
], a wrapper around the Celera Assembler software [
<xref ref-type="bibr" rid="B10">10</xref>
].</p>
<p>We benchmark Konnector v2.0 on simulated datasets, compare its performance against ELOPER [
<xref ref-type="bibr" rid="B8">8</xref>
], GapFiller [
<xref ref-type="bibr" rid="B9">9</xref>
], and a similar tool within MaSuRCA [
<xref ref-type="bibr" rid="B5">5</xref>
]. We demonstrate its utility for assembly finishing problems and variant calling. With its frugal memory use and algorithm implementation, we show that Konnector v2.0 can handle large sequence datasets with over a billion reads from Gbp scale genomes in a timely manner. Furthermore, we note that it consistently provides highly accurate results for a range of targets.</p>
</sec>
<sec>
<title>Implementation</title>
<p>Konnector creates long pseudo-reads from paired-end sequencing reads (Figure
<xref ref-type="fig" rid="F1">1</xref>
) by searching for connecting paths between read pairs using a Bloom filter representation of a de Bruijn graph. In addition to connecting read pairs, Konnector v2.0 can also extend connected or unconnected sequences by following paths from the ends of sequences up to the next branching point or dead end in the de Bruijn graph. When the sequence extension feature of Konnector v2.0 is enabled, an additional Bloom filter is employed to avoid the production of an intractable quantity of duplicate sequences. Figure
<xref ref-type="fig" rid="F2">2</xref>
provides a flowchart overview of the Konnector 2.0 algorithm.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>A connecting path between two non-overlapping paired-end sequencing reads within a de Bruijn graph</bold>
. Konnector joins the sequence provided by the input paired-end reads (green) by means a graph search for a connecting path (blue). Sequencing errors in the input sequencing data produce bubbles and branches in the de Bruijn graph of up to k nodes in length (red). Bloom filter false positives produce additional branches (yellow) with lengths that are typically much shorter than the error branches.</p>
</caption>
<graphic xlink:href="1755-8794-8-S3-S1-1"></graphic>
</fig>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>The Konnector2 algorithm</bold>
. (1): The algorithm builds a Bloom filter representation of the de Bruijn graph by loading all k-mers from the input paired-end sequencing data. (2): For each read pair, a graph search for connecting paths within the de Bruijn graph is performed. (3): If one or more connecting paths are found, a consensus sequence for the paths is built. (4): If no connecting paths are found, error-correction is attempted on reads 1 and 2. (5) and (6): the algorithm queries for the existence of either the consensus connecting sequence or the error-corrected reads in the "duplicate filter". The duplicate filter is an additional Bloom filter, separate from the Bloom filter de Bruijn graph, which tracks the parts of the genome that have already been assembled. (7) and (8): If one or more of the k-mers in the query sequence are not found in the duplicate filter, the sequence is extended outwards in the de Bruijn graph, until either a dead end or a branching point is encountered in the graph. Finally, the extended sequences are written to the output pseudo-reads file.</p>
</caption>
<graphic xlink:href="1755-8794-8-S3-S1-2"></graphic>
</fig>
<sec>
<title>Bloom filter de Bruijn graph</title>
<p>As the throughput of the Illumina platforms increased rapidly to generate up to 1Tb in a six-day run with the HiSeq SBS V4 Kits, one important concern for pseudo-read generating tools is their computational efficiency. In related problems, bioinformatics tools have used strategies such as parallel computing [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
], FM indexing [
<xref ref-type="bibr" rid="B13">13</xref>
,
<xref ref-type="bibr" rid="B14">14</xref>
], and compressed data structures [
<xref ref-type="bibr" rid="B15">15</xref>
] for handling big data.</p>
<p>To fit large assembly problems in small memory, one recent approach has been the use of Bloom filters [
<xref ref-type="bibr" rid="B16">16</xref>
,
<xref ref-type="bibr" rid="B3">3</xref>
] to represent de Bruijn graphs, as demonstrated by the Minia assembler [
<xref ref-type="bibr" rid="B17">17</xref>
]. Konnector adopts a similar approach. Briefly, a Bloom filter is a bit array that acts as a compact representation of a set, where the presence or absence of an element in the set is indicated by the state of one or more bits in the array. The particular position of the bits that correspond to each element is determined by a fixed set of hash functions. While Bloom filters are very memory-efficient, the principal challenge of developing Bloom filter algorithms is in dealing with the possibility of
<italic>false positives</italic>
. A false positive occurs when the bit positions of an element that is not in the set collide with the bit positions of an element that
<italic>is </italic>
in the set. In the context of Bloom filter de Bruijn graphs, false positives manifest themselves as false branches, as depicted by the yellow nodes in Figure
<xref ref-type="fig" rid="F1">1</xref>
.</p>
<p>In the first step of the algorithm (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (1)), the Bloom filter de Bruijn graph is constructed by shredding the input reads into k-mers, and loading the k-mers into a Bloom filter. To diminish the effect of sequencing errors at later stages of the algorithm, k-mers are initially propagated between two Bloom filters, where the first Bloom filter contains k-mers that have been seen at least once, and the second Bloom filter contains k-mers that have been seen at least twice. At the end of k-mer loading, the first Bloom filter is discarded, and the second Bloom filter is kept for use in the rest of the algorithm. We note here that only the k-mers of the input reads, corresponding to the nodes in the de Bruijn graph, are stored in the Bloom filter whereas there is no explicit storage of edges. Instead, the neighbours of a k-mer are determined during graph traversal by querying for the presence of all four possible neighbours (i.e. single base extensions) at each step.</p>
</sec>
<sec>
<title>Searching for connecting paths</title>
<p>In a second pass over the input sequencing data, Konnector searches for connecting paths within the de Bruijn graph between each read pair (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (2)). The graph search is initiated by choosing a start k-mer in the first read and a goal k-mer in the second read, and is carried out by means of a depth-limited, bidirectional, breadth-first search between these two k-mers.</p>
<p>The start and goal k-mers are selected to reduce the probability of dead-end searches due to sequencing errors or Bloom filter false positives. First, the putative non-error k-mers of each read are identified by querying for their existence in the Bloom filter de Bruijn graph. (Recall that after the loading stage, this Bloom filter only contains k-mers that occur twice or more.) Next, the algorithm attempts to find a consecutive run of three non-error k-mers within the read, and chooses the k-mer on the distal end (i.e. 5' end) of the run as the start/goal k-mer. This method ensures that if the chosen start/goal k-mer is a Bloom filter false positive, the path search will still proceed through at least two more k-mers instead of stopping at a dead end. In the likely case that there are multiple runs of "good" k-mers within a read, the run that is closest to the 3' (gap-facing) end of the read is chosen, in order reduce the depth of subsequent path search. In the case that there are no runs of three good k-mers, the algorithm falls back to using the longest run found (i.e. two k-mers or a single k-mer).</p>
<p>Once the start and goal k-mers have been selected, Konnector performs the search for connecting paths. In order to maximize the accuracy of the sequence connecting the reads, it is important for the algorithm to consider
<italic>all </italic>
possible paths between the reads, up to the depth limit dictated by the DNA fragment length. For this reason, a breadth-first search is employed rather than a shortest path algorithm such as Dijkstra or A*. Konnector implements a bidirectional version of breadth-first search, which improves performance by conducting two half-depth searches, and thus reducing the overall expansion of the search frontier. The bidirectional search is implemented by alternating between two standard breadth-first searches that can "see" each other's visited node lists. If one search encounters a node that has already been visited by the other search, the edge leading to that node is recorded as a "common edge", and the search proceeds no further through that particular node. As the two searches proceed, all visited nodes and edges are added to a temporary, in-memory "search graph". This facilitates the final step, where the full set of connecting paths are constructed by performing an exhaustive search both backwards and forwards from each common edge towards the start and goal k-mers, respectively.</p>
<p>If the search algorithm finds a unique path between the start and goal k-mers, then the path is converted to a DNA sequence, and is used to join the read sequences into a single pseudo-read. In the case of multiple paths, a multiple sequence alignment is performed, and the resulting consensus sequence is used to join the reads instead (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (3)). In order to fine-tune the quality of the results, the user may specify limits with respect to the maximum number of paths that can be collapsed to a consensus and/or the maximum number of mismatches that should be tolerated between alternate paths.</p>
</sec>
<sec>
<title>Extending connected and unconnected sequences</title>
<p>Konnector v2.0 introduces a new capability to extend both connected and unconnected sequences by traversing from the ends of sequences to the next branching point or dead-end in the de Bruijn graph (Figure
<xref ref-type="fig" rid="F2">2</xref>
, steps (7) and (8)). If a read pair is successfully connected, the algorithm will extend the pseudo-read outwards in both directions; if the read pair is not successfully connected, each of the two reads will be extended independently, both inwards and outwards. The extensions are seeded in the same manner described above for the connecting path searches; a putative non-error k-mer is selected near the end of the sequence, and following two consecutive non-error k-mers if possible.</p>
<p>The extension of connected reads or unconnected reads that are contained within the same linear path of the de Bruijn graph results in identical sequences. For this reason, the algorithm uses an additional Bloom filter to track the k-mers of sequences that have already been assembled. (Hereafter this Bloom filter will be referred to as the "duplicate filter" in order to reduce confusion with the Bloom filter de Bruijn graph.)</p>
<p>The logic for tracking duplicate sequences differs for the cases of connected and unconnected read pairs. In the case of connected reads, only the k-mers of the connecting sequence are used to query the duplicate filter (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (5)). By virtue of being present in the Bloom filter de Bruijn graph, the connecting k-mers are putative non-error k-mers that have occurred at least twice in the input sequencing data, and thus a 100% match is expected in the case that the genomic region in question has already been covered. If one or more k-mers from the connecting sequence are not found in the duplicate filter, the pseudo-read is kept and is extended outwards to its full length (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (7)). The k-mers of the extended sequence are then added to the duplicate filter, and the sequence is written to the output pseudo-reads file.</p>
<p>In the case of unconnected reads, the reads must first be corrected prior to querying the duplicate filter (Figure
<xref ref-type="fig" rid="F2">2</xref>
, step (4)). This is done by first extracting the longest contiguous sequence of non-error k-mers within the read, where k-mers that are present in the Bloom filter de Bruijn graph are considered to be putative non-error k-mers. An additional step is then performed to correct for recurrent read-errors that may have made it past the two-level Bloom filter. Starting from the rightmost k-mer of the selected subsequence, the algorithm steps left by k nodes, aborting the correction step if it encounters a branching point or dead-end before walking the full distance. As the longest branch that can be created by a single sequencing error is k nodes, this navigates out of any possible branch or bubble created by an error (red nodes of Figure
<xref ref-type="fig" rid="F1">1</xref>
). Finally, the algorithm steps right up to (k+1) nodes to generate a high confidence sequence for querying the duplicate filter. The second rightward step stops early upon encountering a branching point or dead-end, but any sequence generated up to that point is kept, and is still used to query the duplicate filter. Following error correction, the subsequent steps for handling unconnected reads are similar to the case for connected reads. If the high confidence sequence contains k-mers that are not found in the duplicate filter, the sequence is extended to its full length, added to the duplicate filter, and written to the output pseudo-reads file.</p>
<p>Finally, some additional look-ahead logic is employed in the extension algorithm to handle the common cases of false positive branches and simple bubbles created by heterozygous SNPs. All branches shorter than or equal to three nodes in length are assumed to be false positive branches and are ignored during extension. Upon reaching a fork with two (non-false-positive) branches, a look-ahead of (k+1) nodes is performed to see if the branches re-converge. If so, the bubble is collapsed and the extension continues.</p>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<sec>
<title>Read-elongation tools comparison</title>
<p>To evaluate Konnector v2.0, we performed a comparison with several other read-elongation tools: ELOPER [
<xref ref-type="bibr" rid="B8">8</xref>
], GapFiller [
<xref ref-type="bibr" rid="B9">9</xref>
], the MaSuRCA super-reads module [
<xref ref-type="bibr" rid="B5">5</xref>
], and the previously published version of Konnector [
<xref ref-type="bibr" rid="B1">1</xref>
].</p>
<p>ELOPER v1.2 (ELOngation of Paired-End Reads) [
<xref ref-type="bibr" rid="B8">8</xref>
] operates by calculating
<italic>gapped overlaps </italic>
between read pairs, where a gapped overlap requires simultaneous overlap of both reads across two read pairs. The main idea of the algorithm is that overlaps across read pairs yield higher-confidence sequence extensions than overlaps between individual reads alone. The program produces extended paired-end reads as output and single-end pseudo-reads in cases where paired reads can be extended far enough to overlap with their mates. The all-by-all computation of gapped overlaps between read pairs is realized using a hash table-based approach.</p>
<p>GapFiller v2.1.1 [
<xref ref-type="bibr" rid="B9">9</xref>
] fills the sequencing gap between paired-end reads using a seed-and-extend approach, where each input read is considered in turn as a seed. Reads are iteratively extended towards their mates by identifying overlapping reads, building a consensus sequence for the extension, and then repeating the process with extended sequence. GapFiller uses the eventual overlap of an extended read with its mate as a correctness check, and chooses not to continue the extension beyond the fragment length in favour of higher confidence results. The algorithm for detecting overlaps is implemented by computing some
<italic>fingerprint </italic>
values for the prefixes and suffixes of each read, and storing the mapping between fingerprints and reads in a hash-table. In order to calculate the consensus sequences during extension, the full set of input read sequences is stored in memory using a compressed 2-bit representation.</p>
<p>MaSuRCA v2.2.1 [
<xref ref-type="bibr" rid="B5">5</xref>
] is an extension of the CABOG overlap-layout-consensus assembler [
<xref ref-type="bibr" rid="B18">18</xref>
] that preprocesses the input short sequencing reads to generate a highly-reduced set of "super-reads" for input to the Celera assembler [
<xref ref-type="bibr" rid="B10">10</xref>
]. Much like the extension feature of Konnector v2.0, the super-reads are generated by extending the reads outward to the next branching point or dead-end within a de Bruijn graph. These "k-unitig" sequences are then joined by spanning read pairs or bridging single-end reads, in cases where such links are unambiguous.</p>
<p>The previously published version of Konnector [
<xref ref-type="bibr" rid="B1">1</xref>
] uses the same concept for connecting read pairs as Konnector v2.0, but does not include the sequence extension or duplicate filtering logic. Its output format is most similar to GapFiller, in the sense that it generates one fragment-length sequence for each successfully connected read pair.</p>
<p>We compared the performance and results of the tools across four paired-end sequencing data sets from organisms with a wide range of genome sizes:
<italic>E. coli, S. cerevisiae, C. elegans</italic>
, and
<italic>H. sapiens </italic>
(Table
<xref ref-type="table" rid="T1">1</xref>
). The
<italic>E. coli </italic>
data set consists of 100 bp synthetic reads generated with the pIRS read simulator [
<xref ref-type="bibr" rid="B19">19</xref>
] using a 0.1% error rate, 50x coverage, and an insert size of 400 ± 50 bp, while the other three data sets are experimental paired-end Illumina sequencing data with coverage levels ranging from 26x to 76x.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Datasets analyzed.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Organism</th>
<th align="center">Genome Size</th>
<th align="center">NGS data source</th>
<th align="center">Read length (bp)</th>
<th align="center">Read pairs (M)</th>
<th align="center">Fragment size (bp)</th>
<th align="center">Fold coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">
<italic>E.coli</italic>
<break></break>
K-12</td>
<td align="center">5 Mbp</td>
<td align="center">Simulated</td>
<td align="center">PE100</td>
<td align="center">1.2</td>
<td align="center">400</td>
<td align="center">50X</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">
<italic>S. cerevisiae</italic>
</td>
<td align="center">12 Mbp</td>
<td align="center">Experimental
<break></break>
SRA:ERR156523</td>
<td align="center">PE100</td>
<td align="center">1.6</td>
<td align="center">300</td>
<td align="center">26X</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">
<italic>C.elegans</italic>
</td>
<td align="center">97 Mbp</td>
<td align="center">Experimental
<break></break>
SRA:ERR294494</td>
<td align="center">PE100</td>
<td align="center">44.7</td>
<td align="center">450</td>
<td align="center">89X</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">
<italic>H.sapiens</italic>
<break></break>
NA19238</td>
<td align="center">3 Gbp</td>
<td align="center">Experimental
<break></break>
SRA:ERR309932</td>
<td align="center">PE250</td>
<td align="center">457.0</td>
<td align="center">550</td>
<td align="center">76X</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For each combination of data set and tool, we measured running time, peak memory usage, N50 length of the output pseudo-reads, sum length of the misassembled pseudo-reads, and percent coverage of the reference genome (Table
<xref ref-type="table" rid="T2">2</xref>
). The N50 length was calculated using the QUAST [
<xref ref-type="bibr" rid="B20">20</xref>
] assembly assessment tool, except for the human data set where the 'abyss-fac' utility (ABySS v1.5.2 [
<xref ref-type="bibr" rid="B11">11</xref>
]) was used instead. The "Misassembled Reads Length" column of Table
<xref ref-type="table" rid="T2">2</xref>
was also calculated by QUAST, and reports the sum length of all pseudo-reads that had split alignments to the reference with distance greater 1 kb, overlap greater 1 kb, or mappings to different strands/chromosomes. We found that QUAST was not able to scale to an analysis of the human Konnector and Konnector v2.0 pseudo-reads, and so those results were omitted from Table
<xref ref-type="table" rid="T2">2</xref>
. Finally, the genome coverage results were calculated by aligning the pseudo-reads to the reference with bwa mem v0.7.12 [
<xref ref-type="bibr" rid="B21">21</xref>
], with the multimapping option (-a) turned on, and then using the resulting BAM file as input to the bedtools v2.17.0 [
<xref ref-type="bibr" rid="B22">22</xref>
] 'genomecov' command.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Comparative analysis of read elongation tools.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="left">Running time
<break></break>
(hms)</th>
<th align="left">Peak memory
<break></break>
(MB)</th>
<th align="left">N50
<break></break>
(bp)</th>
<th align="left">Misassembled Reads Length
<break></break>
(bp)</th>
<th align="left">Percent genome coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="6">
<italic>E. coli </italic>
(synthetic)</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">ELOPER</td>
<td align="left">10m46s</td>
<td align="left">19013</td>
<td align="left">501</td>
<td align="left">16146</td>
<td align="left">
<bold>100.00</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">GapFiller</td>
<td align="left">32m39s</td>
<td align="left">476</td>
<td align="left">396</td>
<td align="left">14103</td>
<td align="left">
<bold>100.00</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">MaSuRCA super-reads</td>
<td align="left">
<bold>2m56s</bold>
</td>
<td align="left">4669</td>
<td align="left">
<bold>54103</bold>
</td>
<td align="left">
<bold>0</bold>
</td>
<td align="left">
<bold>100.00</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector (k = 50)</td>
<td align="left">6m15s</td>
<td align="left">
<bold>81</bold>
</td>
<td align="left">406</td>
<td align="left">3133</td>
<td align="left">
<bold>100.00</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector 2 (k = 70)</td>
<td align="left">5m0s</td>
<td align="left">100</td>
<td align="left">32012</td>
<td align="left">276</td>
<td align="left">99.99</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left" colspan="6">
<italic>S. cerevisiae</italic>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">ELOPER</td>
<td align="left">97h55m23s</td>
<td align="left">37119</td>
<td align="left">144</td>
<td align="left">71426</td>
<td align="left">
<bold>99.32</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">GapFiller</td>
<td align="left">3h13m18s</td>
<td align="left">666</td>
<td align="left">332</td>
<td align="left">96869</td>
<td align="left">97.04</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">MaSuRCA super-reads</td>
<td align="left">
<bold>4m2s</bold>
</td>
<td align="left">5294</td>
<td align="left">1684</td>
<td align="left">129828</td>
<td align="left">98.67</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector (k = 50)</td>
<td align="left">8m34s</td>
<td align="left">
<bold>231</bold>
</td>
<td align="left">315</td>
<td align="left">75435</td>
<td align="left">97.94</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector 2 (k = 40)</td>
<td align="left">7m1s</td>
<td align="left">232</td>
<td align="left">
<bold>4690</bold>
</td>
<td align="left">
<bold>67505</bold>
</td>
<td align="left">98.99</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left" colspan="6">
<italic>C. elegans</italic>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">ELOPER</td>
<td align="left" colspan="5">exceeds available memory (120GB)</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">GapFiller</td>
<td align="left" colspan="5">exceeds available memory (120GB)</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">MaSuRCA super-reads</td>
<td align="left">
<bold>2h2m17s</bold>
</td>
<td align="left">80742</td>
<td align="left">2554</td>
<td align="left">1925740</td>
<td align="left">
<bold>99.98</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector (k = 55)</td>
<td align="left">5h5m21s</td>
<td align="left">
<bold>1954</bold>
</td>
<td align="left">475</td>
<td align="left">NA</td>
<td align="left">99.80</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector 2 (k = 80)</td>
<td align="left">3h30m23s</td>
<td align="left">2193</td>
<td align="left">
<bold>6232</bold>
</td>
<td align="left">
<bold>837480</bold>
</td>
<td align="left">99.88</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left" colspan="6">
<italic>H. sapiens (NA19238)</italic>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">ELOPER</td>
<td align="left" colspan="5">not attempted</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">GapFiller</td>
<td align="left" colspan="5">not attempted</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">MaSuRCA super-reads</td>
<td align="left" colspan="5">exceeds available memory (120 GB)</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector (k = 150)</td>
<td align="left">4d9h15m48s</td>
<td align="left">
<bold>410381</bold>
</td>
<td align="left">556</td>
<td align="left">NA</td>
<td align="left">
<bold>94.15</bold>
</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Konnector 2 (k = 180)</td>
<td align="left">
<bold>20h47m24s</bold>
</td>
<td align="left">471905</td>
<td align="left">
<bold>3051</bold>
</td>
<td align="left">NA</td>
<td align="left">94.01</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The Konnector and Konnector v2.0 jobs for the comparison were run across a range of k-mer lengths to achieve the best possible results. For the previous version of Konnector, the run with the highest percentage of connected read pairs was selected, whereas for Konnector v2.0, a k-mer size was selected that provided a favourable combination of both N50 and misassembled reads length (Figure
<xref ref-type="fig" rid="F3">3</xref>
).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Comparison of pseudo-read tools by N50 and misassembled reads length</bold>
. Results are shown for Konnector v2.0 and three other pseudo-read-generating tools across
<italic>E. coli </italic>
(synthetic),
<italic>S. cerevisiae</italic>
, and
<italic>C. elegans </italic>
sequencing data sets. The "misassembled reads length" on the x-axis of each plot denotes the sum length of all pseudo-reads reported as misassembled by QUAST. MaSuRCA performs best on the synthetic
<italic>E. coli </italic>
data set, producing the highest N50 and creating no misassemblies. However, on the experimental
<italic>S. cerevisiae </italic>
and
<italic>C. elegans </italic>
data sets, Konnector v2.0 outperforms the other tools in terms of both N50 and misassembled sequence, for a range of k-mer lengths.</p>
</caption>
<graphic xlink:href="1755-8794-8-S3-S1-3"></graphic>
</fig>
<p>From the results of Table
<xref ref-type="table" rid="T2">2</xref>
, we observe that MaSuRCA was generally the fastest tool. While Konnector and Konnector v2.0 showed competitive run times, ELOPER and GapFiller were notably slower, and did not scale well to larger data sets. In the category of memory usage, both versions of Konnector outperformed the competitors by more than an order of magnitude due to their use of Bloom filters rather than hash tables.</p>
<p>The N50 and Misassembled Reads Length results from Table
<xref ref-type="table" rid="T2">2</xref>
are plotted in Figure
<xref ref-type="fig" rid="F3">3</xref>
, with additional data points shown for alternate runs of Konnector v2.0 with different k-mer sizes. The plots show that Konnector 2 generated the longest pseudo-reads and the least misassembled sequence for the experimental
<italic>S. cerevisiae </italic>
and
<italic>C. elegans </italic>
data sets, while MaSuRCA generated the longest and most accurate pseudo-reads for the synthetic
<italic>E. coli </italic>
data set.</p>
<p>Working with a maximum of 120 GB available memory on any single machine, Konnector and Konnector v2.0 were the only tools that could be run on the largest of the four data sets in Table
<xref ref-type="table" rid="T2">2</xref>
(
<italic>H. sapiens</italic>
). One of the main advantages of Konnector for this data set was the ability to split work across machines. This was accomplished by first building a reusable Bloom filter file with the companion 'abyss-bloom' utility (ABySS v1.5.2), and then sharing this file across 20 parallel Konnector jobs, each processing a disjoint subset of the paired-end reads. The two-level Bloom filter size was 40 GB, and each of the jobs was run on a machine with 12 cores and 48 GB RAM. The wall clock time for the job was less than 24 hours, and the aggregate memory requirement for job was just under 0.5TB. We note that the larger memory usage of Konnector v2.0 is due to the use of an additional Bloom filter for tracking duplicate sequences. The large improvement in running time between Konnector and Konnector v2.0 on the
<italic>H. sapiens </italic>
data set is due primarily to the introduction of multithreaded Bloom filter construction in Konnector v2.0.</p>
</sec>
<sec>
<title>Sealer: a Konnector-based gap-closing application</title>
<p>A natural application to Konnector includes automated finishing of genomes, by systematically targeting all regions of unresolved bases, or gaps, in draft genomes of wide-ranging sizes. This is accomplished by first identifying these scaffold gaps, deriving flanking sequences on the 5' and 3' ends of each gap, running Konnector with comprehensive short read data set, and patching the gaps by placing successfully merged sequences in those regions. We have developed a stand-alone utility called Sealer for this specific application [
<xref ref-type="bibr" rid="B23">23</xref>
].</p>
<p>To test the utility of Konnector for filing scaffold gaps, we ran Sealer on an ABySS
<italic>E. coli </italic>
genome assembly (5 Mbp) and, to assess the scalability of the approach, on an ABySS
<italic>H. sapiens </italic>
(3.3 Gbp) draft assembly of next-generation Illumina sequences (SRA:ERR309932) derived from the 1000 Genomes Project (individual NA19238) (Table
<xref ref-type="table" rid="T3">3</xref>
). For
<italic>E. coli</italic>
, we were able to successfully close all but one gap using a single k-mer size of 90 bp. On the human assembly, gaps were closed with Sealer using 31 k-mers (250 - 130 bp, decrementing by 10, and 125 - 40 bp, decrementing by 5; parameters for Konnector were -B 1000 -F 700 -P10), and compared the result to two similar tools GapFiller (v1.10) [
<xref ref-type="bibr" rid="B24">24</xref>
] and SOAPdenovo2 GapCloser (v1.12) [
<xref ref-type="bibr" rid="B25">25</xref>
]. Default settings were used for both tools in our tests, maximizing the number of compute threads, when needed (-t 16 for GapCloser on the human data set). On the
<italic>H. sapiens </italic>
draft assembly GapFiller was manually stopped after running for over 350 hours (approximately 14 days) without completion or output. All Sealer processes were executed on a 12-core computer running CentOS 5.4 with two Intel Xeon X5650 CPUs @ 2.67 GHz and 48 GB RAM. GapFiller and GapCloser were benchmarked on a machine using CentOS 5.10 with 16 cores @ 2.13 GHz, 125 GB RAM. The GapCloser run on the
<italic>H. sapiens </italic>
data ran on a CentOS 5.9 with 16 cores @ 2.13 GHz and 236 GB RAM to allow for its high memory requirement. We also compared the results of Sealer with two versions of Konnector on the
<italic>E. coli </italic>
and
<italic>H. sapiens </italic>
dataset, and noticed a marked improvement in speed of execution: ~12 h compared to ~29 h runs on human data with Konnector v2.0 and Konnector v1.0, respectively. We also noted improved sensitivity in Sealer results, when used in conjunction with Konnector v2.0 (6,566 or 2.8% more gaps closed). To test the limits of scalability of Konnector, we applied Sealer on a draft white spruce genome assembly of length 20 Gbp [
<xref ref-type="bibr" rid="B26">26</xref>
] (data not shown).</p>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Performance evaluation of Sealer and other gap-filling applications for finishing draft genomes.</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>Draft genome species</bold>
</td>
<td align="left">
<bold>Total gaps</bold>
</td>
<td align="left">
<bold>Software</bold>
</td>
<td align="left">
<bold>Gaps completely closed</bold>
</td>
<td align="left">
<bold>% Success</bold>
</td>
<td align="left">
<bold>Wall clock time (hh:mm)</bold>
</td>
<td align="left">
<bold>Memory (GB)</bold>
</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">
<italic>E. coli</italic>
</td>
<td align="center">18</td>
<td align="center">Sealer K2</td>
<td align="center">
<bold>17</bold>
</td>
<td align="center">
<bold>94.4</bold>
</td>
<td align="center">00:01</td>
<td align="center">0.5</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">Sealer K1</td>
<td align="center">
<bold>17</bold>
</td>
<td align="center">
<bold>94.4</bold>
</td>
<td align="center">00:20</td>
<td align="center">0.5</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">GapCloser</td>
<td align="center">2</td>
<td align="center">11.1</td>
<td align="center">
<bold>00:05</bold>
</td>
<td align="center">25.7</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">GapFiller</td>
<td align="center">15</td>
<td align="center">83.3</td>
<td align="center">00:43</td>
<td align="center">
<bold>0.4</bold>
</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">
<italic>H. sapiens</italic>
</td>
<td align="center">237,406</td>
<td align="center">Sealer K2</td>
<td align="center">
<bold>127,242</bold>
</td>
<td align="center">
<bold>53.6</bold>
</td>
<td align="center">
<bold>12:09</bold>
</td>
<td align="center">
<bold>22.2</bold>
</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">Sealer K1</td>
<td align="center">120,676</td>
<td align="center">50.8</td>
<td align="center">29:19</td>
<td align="center">
<bold>22.2</bold>
</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">GapCloser</td>
<td align="center">116,297</td>
<td align="center">48.9</td>
<td align="center">83:15</td>
<td align="center">178.1</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td align="center">GapFiller</td>
<td align="center" colspan="4">Incomplete. Terminated after 353 hours.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>KVarScan: a Konnector-based method for indel detection</title>
<p>Konnector long pseudo-reads can potentially improve the sensitivity of existing variant detection pipelines. To explore this idea, we conducted an experiment where we detected insertions and deletions (indels) using VarScan [
<xref ref-type="bibr" rid="B27">27</xref>
] 'pileup2indel' (version 2.3.7 with default parameters), and compared the results when using either regular reads or Konnector pseudo-reads as input. We refer to the two protocols as VarScan and KVarScan, respectively. For KVarScan, Konnector reads were generated by running Konnector on regular reads for a range of k-mer sizes from 90 bp to 30 bp, with a step size of 10 bp. Starting with the largest k-mer size of 90 bp, left over read pairs that were not connected were used for the next run of Konnector with the next smaller k-mer size. After running Konnector with a k-mer size of 30 bp, any remaining unconnected reads were used concurrently with the connected reads as input to the VarScan run. All connected reads output from the -p/--all-paths and -o/--output-prefix parameters of Konnector were used. The other parameters used for Konnector included: max path set to 4 (-P), max mismatches set to nolimit (-M), path identity set to 98 (-X), max branches set to 100 (-B), max fragment set to 525 (-F). Prior to the Konnector runs, a Bloom filter for each k-mer size was built using the abyss-bloom utility with the trim quality (-q) set to 15 and levels (-l) set to 2.</p>
<p>We performed a comparison of the VarScan and KVarScan methods on synthetic human data from chromosome 10 by simulating indels in the size range of 10 - 200 bp on hg19 using RSVsim v1.2.1 [
<xref ref-type="bibr" rid="B28">28</xref>
]. A final number of 224 insertions (10 - 192 bp) and 216 deletions (10-144 bp) were generated. We used pIRs v1.1.1 [
<xref ref-type="bibr" rid="B19">19</xref>
] to generate a diploid human chromosome 10 sequence, and combined it with the rearranged sequence to simulate a 30x coverage library of 100 bp Illumina PET reads. The average insert size was set at 400 bp; default parameters were used otherwise.</p>
<p>Both VarScan and KVarScan were able to detect small indels as short as 10 bp, the shortest available in the simulated data. However, the maximum size of indels detected by VarScan was 30 bp, while it was 99 bp for the KVarScan protocol. As illustrated by the distributions of the sizes of indels detected in Figure
<xref ref-type="fig" rid="F4">4</xref>
, we note that the use of long pseudo-reads generated by Konnector expands the range detection for VarScan. Hence, long pseudo-reads may find an application for profiling cancer genomes and other genomes that harbour structural variations that would otherwise be missed by shorter sequence reads.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Indels detected by VarScan using unaltered reads ("VarScan") or Konnector long pseudo-reads ("KVarScan") as input</bold>
. Results are shown for synthetic read data generated from hg19 chromosome 10 and containing 440 simulated indels. The indels detected by VarScan (green) range from 10 bp to 30 bp, whereas the indels detected by KVarScan (blue) range from 10 bp to 99 bp.</p>
</caption>
<graphic xlink:href="1755-8794-8-S3-S1-4"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>Long reads are highly desirable for both
<italic>de novo </italic>
assembly applications and reference-based applications such as variant calling. While long read sequencing technologies such Pacific Biosciences (Menlo Park, CA) and Oxford Nanopore Technologies (Oxford, UK) have yet to hit the mainstream, bioinformatics algorithms continue to be developed to better exploit the sequence and distance information captured by Illumina paired-end sequencing reads, currently ranging in length from 150 - 300 bp and spanning DNA fragments with sizes of 300 - 1000 bp.</p>
<p>In this paper we have presented Konnector v2.0, a tool for producing long "pseudo-reads" from Illumina paired-end libraries. While many tools exist to merge overlapping paired-end reads (e.g. [
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
]), our software addresses the more challenging problem of filling the sequencing gap between non-overlapping reads. Konnector accomplishes this by building a compact, Bloom filter based representation of a de Bruijn graph and performing a constrained path search between each pair of reads within the graph. Konnector v2.0 introduces a significant improvement to the algorithm by additionally extending sequences outwards within the de Bruijn graph, up to the point where such extensions are unambiguous. It also keeps the functionality of Konnector v1.0, as an option.</p>
<p>In a comparison of Konnector v2.0 against several similar tools, we have demonstrated that the software generates pseudo-reads with high accuracy, high yield, low memory usage, and fast run times. Owing to its use of a Bloom filter de Bruijn graph, Konnector was the only tool able to process 76x human sequencing data on a set of computing nodes with 48 GB of RAM, and was able to do so in under 24 hours.</p>
<p>While the long pseudo-read generating tools were all reported for their utility in
<italic>de novo </italic>
assembly applications in earlier studies [
<xref ref-type="bibr" rid="B6">6</xref>
-
<xref ref-type="bibr" rid="B8">8</xref>
], we demonstrated the utility of our tool on two novel uses cases: assembly finishing and variant detection. With its scaling properties and broad applications, we think Konnector will be an enabling technology in many genomics studies.</p>
</sec>
<sec>
<title>Availability and requirements</title>
<p>
<bold>Project name: </bold>
Konnector</p>
<p>
<bold>Project home page: </bold>
<ext-link ext-link-type="uri" xlink:href="http://www.bcgsc.ca/platform/bioinfo/software/konnector">http://www.bcgsc.ca/platform/bioinfo/software/konnector</ext-link>
</p>
<p>
<bold>Source code for version in evaluated in paper: </bold>
<ext-link ext-link-type="uri" xlink:href="https://github.com/bcgsc/abyss/tree/konnector2-prerelease">https://github.com/bcgsc/abyss/tree/konnector2-prerelease</ext-link>
</p>
<p>
<bold>Operating system(s): </bold>
Unix</p>
<p>
<bold>Programming language: </bold>
C++</p>
<p>
<bold>Other requirements: </bold>
Boost graph library, Google sparsehash library is recommended</p>
<p>
<bold>License: </bold>
Free for academic use under the British Columbia Cancer Agency's academic license</p>
<p>
<bold>Any restrictions to use by non-academics: </bold>
Contact ibirol@bcgsc.ca for license</p>
</sec>
<sec>
<title>List of abbreviations used</title>
<p>ABySS: Assembly By Short Sequences; CABOG: Celera Assembler with the Best Overlap Graph; ELOPER: Elongation of Paired-end Reads; MaSuRCA: Maryland Super-Read Celera Assembler.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>BV implemented the search and extension algorithms for Konnector v2.0, and wrote descriptions of the algorithm and tools comparison. CY conducted analyses for the tools comparison and for Sealer. KR ran QUAST evaluations and other exploratory data analyses. ZX and RC did the analysis and writing for the KVarScan application. HM and JC provided improved Bloom filter algorithms and implementations. SDJ implemented the Bloom filter class and the de Bruijn graph interface for Konnector, and also implemented algorithmic improvements for the new version of Sealer. RLW did analysis and writing for the Sealer section, ran jobs for the tools comparison, and oversaw the planning and organization the paper. IB designed the algorithms for Konnector v2.0 and oversaw the development, evaluation, and manuscript preparation.</p>
</sec>
</body>
<back>
<sec>
<title>Declarations</title>
<p>The authors thank the funding organizations, Genome Canada, British Columbia Cancer Foundation, and Genome British Columbia for their partial support of the publication. Research reported in this publication was also partly supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG007182. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or other funding organizations.</p>
<p>This article has been published as part of BMC
<italic>Medical Genomics </italic>
Volume 8 Supplement 3, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Medical Genomics. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3">http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S3</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="other">
<name>
<surname>Vandervalk</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mohamadi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Attali</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Konnector</surname>
</name>
<article-title>Connecting paired-end reads using a bloom filter de Bruijn graph</article-title>
<source>Bioinformatics and Biomedicine (BIBM) 2014 IEEE International Conference</source>
<year>2014</year>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
<source>Proceedings of the National Academy of Sciences of the United States of America</source>
<year>2001</year>
<volume>17</volume>
<fpage>9748</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Bloom</surname>
<given-names>BH</given-names>
</name>
<article-title>Space/Time Tradeoffs in Hash Coding With Allowable Errors</article-title>
<source>Communications of the Acm</source>
<year>1970</year>
<volume>13</volume>
<issue>7</issue>
<fpage>422</fpage>
<comment>doi:10.1145/362686.362692</comment>
<pub-id pub-id-type="doi">10.1145/362686.362692</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Short read fragment assembly of bacterial genomes</article-title>
<source>Genome Research</source>
<year>2008</year>
<volume>18</volume>
<fpage>324</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="doi">10.1101/gr.7088808</pub-id>
<pub-id pub-id-type="pmid">18083777</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Zimin</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
<article-title>The MaSuRCA genome assembler</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>21</issue>
<fpage>2669</fpage>
<lpage>77</lpage>
<comment>doi:10.1093/bioinformatics/btt476</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt476</pub-id>
<pub-id pub-id-type="pmid">23990416</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Magoc</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>FLASH: fast length adjustment of short reads to improve genome assemblies</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>21</issue>
<fpage>2957</fpage>
<lpage>63</lpage>
<comment>doi:10.1093/bioinformatics/btr507</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr507</pub-id>
<pub-id pub-id-type="pmid">21903629</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
<article-title>COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>22</issue>
<fpage>2870</fpage>
<lpage>4</lpage>
<comment>doi:10.1093/bioinformatics/bts563</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts563</pub-id>
<pub-id pub-id-type="pmid">23044551</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Silver</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Ben-Elazar</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bogoslavsky</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Yanai</surname>
<given-names>I</given-names>
</name>
<article-title>ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>11</issue>
<fpage>1455</fpage>
<lpage>7</lpage>
<comment>doi:10.1093/bioinformatics/btt169</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt169</pub-id>
<pub-id pub-id-type="pmid">23603334</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="other">
<name>
<surname>Nadalin</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Vezzi</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Policriti</surname>
<given-names>A</given-names>
</name>
<article-title>GapFiller: a de novo assembly approach to fill the gap within paired reads</article-title>
<source>Bmc Bioinformatics</source>
<year>2012</year>
<fpage>13</fpage>
<comment>doi:10.1186/1471-2105-13-s14-s8</comment>
<pub-id pub-id-type="pmid">22264315</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Dew</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Fasulo</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Flanigan</surname>
<given-names>MJ</given-names>
</name>
<etal></etal>
<article-title>A whole-genome assembly of Drosophila</article-title>
<source>Science</source>
<year>2000</year>
<volume>287</volume>
<fpage>2196</fpage>
<lpage>204</lpage>
<pub-id pub-id-type="doi">10.1126/science.287.5461.2196</pub-id>
<pub-id pub-id-type="pmid">10731133</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<issue>6</issue>
<fpage>1117</fpage>
<lpage>23</lpage>
<comment>doi:10.1101/gr.089532.108</comment>
<pub-id pub-id-type="doi">10.1101/gr.089532.108</pub-id>
<pub-id pub-id-type="pmid">19251739</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Boisvert</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Laviolette</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Corbeil</surname>
<given-names>J</given-names>
</name>
<article-title>Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies</article-title>
<source>J Comput Biol</source>
<year>2010</year>
<volume>17</volume>
<issue>11</issue>
<fpage>1519</fpage>
<lpage>33</lpage>
<comment>doi:10.1089/cmb.2009.0238</comment>
<pub-id pub-id-type="doi">10.1089/cmb.2009.0238</pub-id>
<pub-id pub-id-type="pmid">20958248</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>14</issue>
<fpage>1754</fpage>
<lpage>60</lpage>
<comment>doi:10.1093/bioinformatics/btp324</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp324</pub-id>
<pub-id pub-id-type="pmid">19451168</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome biology</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>R25</fpage>
<comment>doi:10.1186/gb-2009-10-3-r25</comment>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-3-r25</pub-id>
<pub-id pub-id-type="pmid">19261174</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
<source>Genome Research</source>
<year>2012</year>
<volume>22</volume>
<issue>3</issue>
<fpage>549</fpage>
<lpage>56</lpage>
<comment>doi:10.1101/gr.126953.111</comment>
<pub-id pub-id-type="doi">10.1101/gr.126953.111</pub-id>
<pub-id pub-id-type="pmid">22156294</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Stranneheim</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kaller</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Allander</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Andersson</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Arvestad</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lundeberg</surname>
<given-names>J</given-names>
</name>
<article-title>Classification of DNA sequences using Bloom filters</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>13</issue>
<fpage>1595</fpage>
<lpage>600</lpage>
<comment>doi:10.1093/bioinformatics/btq230</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq230</pub-id>
<pub-id pub-id-type="pmid">20472541</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="other">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rizk</surname>
<given-names>G</given-names>
</name>
<article-title>Space-efficient and exact de Bruijn graph representation based on a Bloom filter</article-title>
<source>Algorithms for Molecular Biology</source>
<year>2013</year>
<fpage>8</fpage>
<comment>doi:10.1186/1748-7188-8-22</comment>
<pub-id pub-id-type="pmid">23497444</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Venter</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Walenz</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Brownley</surname>
<given-names>A</given-names>
</name>
<etal></etal>
<article-title>Aggressive assembly of pyrosequencing reads with mates</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>2818</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn548</pub-id>
<pub-id pub-id-type="pmid">18952627</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Hu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
<article-title>pIRS: Profile-based Illumina pair-end reads simulator</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>11</issue>
<fpage>1533</fpage>
<lpage>5</lpage>
<comment>doi:10.1093/bioinformatics/bts187</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts187</pub-id>
<pub-id pub-id-type="pmid">22508794</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Gurevich</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Saveliev</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Vyahhi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
<article-title>QUAST: quality assessment tool for genome assemblies</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>8</issue>
<fpage>1072</fpage>
<lpage>5</lpage>
<comment>doi: 10.1093/bioinformatics/btt086</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt086</pub-id>
<pub-id pub-id-type="pmid">23422339</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="other">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<article-title>Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM</article-title>
<source>arXiv preprint</source>
<year>2013</year>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Quinlan</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>IM</given-names>
</name>
<article-title>BEDTools: a flexible suite of utilities for comparing genomic features</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>6</issue>
<fpage>841</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq033</pub-id>
<pub-id pub-id-type="pmid">20110278</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Paulino</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Warren</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Vandervalk</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<article-title>Sealer: a scalable gap-closing application for finishing draft genomes</article-title>
<source>BMC Bioinformatics</source>
<year>2015</year>
<volume>16</volume>
<issue>230</issue>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Boetzer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pirovano</surname>
<given-names>W</given-names>
</name>
<article-title>Toward almost closed genomes with GapFiller</article-title>
<source>Genome biology</source>
<year>2012</year>
<volume>13</volume>
<issue>6</issue>
<fpage>R56</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2012-13-6-r56</pub-id>
<pub-id pub-id-type="pmid">22731987</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<etal></etal>
<article-title>SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler</article-title>
<source>Gigascience</source>
<year>2012</year>
<volume>1</volume>
<issue>1</issue>
<fpage>18</fpage>
<comment>doi:10.1186/2047-217X-1-18</comment>
<pub-id pub-id-type="doi">10.1186/2047-217X-1-18</pub-id>
<pub-id pub-id-type="pmid">23587118</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="other">
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Pleasance</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Coope</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Taylor</surname>
<given-names>GA</given-names>
</name>
<etal></etal>
<article-title>Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<comment>doi:10.1093/bioinformatics/btt178</comment>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="other">
<name>
<surname>Koboldt</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wylie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Larson</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>McLellan</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Mardis</surname>
<given-names>ER</given-names>
</name>
<etal></etal>
<article-title>VarScan: variant detection in massively parallel sequencing of individual and pooled samples</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<issue>25</issue>
<fpage>2283</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">19542151</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="other">
<name>
<surname>Bartenhagen</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Dugas</surname>
<given-names>M</given-names>
</name>
<article-title>RSVSim: an R/Bioconductor package for the simulation of structural variations</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<issue>btt198</issue>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000959 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000959 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4582294
   |texte=   Konnector v2.0: pseudo-long reads from paired-end sequencing data
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26399504" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021