Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000269 ( Pmc/Corpus ); précédent : 0002689; suivant : 0002700 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity</title>
<author>
<name sortKey="Kim, Chang Sik" sort="Kim, Chang Sik" uniqKey="Kim C" first="Chang Sik" last="Kim">Chang Sik Kim</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, WA4 4AD UK</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121662407</institution-id>
<institution-id institution-id-type="GRID">grid.5379.8</institution-id>
<institution>Present addresse Cancer Research UK Manchester Institute, The University of Manchester,</institution>
</institution-wrap>
M20 4BX, Manchester, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Winn, Martyn D" sort="Winn, Martyn D" uniqKey="Winn M" first="Martyn D." last="Winn">Martyn D. Winn</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, WA4 4AD UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sachdeva, Vipin" sort="Sachdeva, Vipin" uniqKey="Sachdeva V" first="Vipin" last="Sachdeva">Vipin Sachdeva</name>
<affiliation>
<nlm:aff id="Aff2">Computational Science Center, IBM T.J. Watson Research, Cambridge, MA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff4">Present addresse Silicon Therapeutics, 300 A Street, Boston, MA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jordan, Kirk E" sort="Jordan, Kirk E" uniqKey="Jordan K" first="Kirk E." last="Jordan">Kirk E. Jordan</name>
<affiliation>
<nlm:aff id="Aff2">Computational Science Center, IBM T.J. Watson Research, Cambridge, MA USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29100493</idno>
<idno type="pmc">5670514</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5670514</idno>
<idno type="RBID">PMC:5670514</idno>
<idno type="doi">10.1186/s12859-017-1881-8</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000269</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000269</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity</title>
<author>
<name sortKey="Kim, Chang Sik" sort="Kim, Chang Sik" uniqKey="Kim C" first="Chang Sik" last="Kim">Chang Sik Kim</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, WA4 4AD UK</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121662407</institution-id>
<institution-id institution-id-type="GRID">grid.5379.8</institution-id>
<institution>Present addresse Cancer Research UK Manchester Institute, The University of Manchester,</institution>
</institution-wrap>
M20 4BX, Manchester, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Winn, Martyn D" sort="Winn, Martyn D" uniqKey="Winn M" first="Martyn D." last="Winn">Martyn D. Winn</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, WA4 4AD UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sachdeva, Vipin" sort="Sachdeva, Vipin" uniqKey="Sachdeva V" first="Vipin" last="Sachdeva">Vipin Sachdeva</name>
<affiliation>
<nlm:aff id="Aff2">Computational Science Center, IBM T.J. Watson Research, Cambridge, MA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff4">Present addresse Silicon Therapeutics, 300 A Street, Boston, MA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jordan, Kirk E" sort="Jordan, Kirk E" uniqKey="Jordan K" first="Kirk E." last="Jordan">Kirk E. Jordan</name>
<affiliation>
<nlm:aff id="Aff2">Computational Science Center, IBM T.J. Watson Research, Cambridge, MA USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">De novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">We introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par3">Our study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-017-1881-8) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Corney, Dc" uniqKey="Corney D">DC Corney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schliesky, S" uniqKey="Schliesky S">S Schliesky</name>
</author>
<author>
<name sortKey="Gowik, U" uniqKey="Gowik U">U Gowik</name>
</author>
<author>
<name sortKey="Weber, Apm" uniqKey="Weber A">APM Weber</name>
</author>
<author>
<name sortKey="Brautigam, A" uniqKey="Brautigam A">A Brautigam</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oshlack, A" uniqKey="Oshlack A">A Oshlack</name>
</author>
<author>
<name sortKey="Robinson, Md" uniqKey="Robinson M">MD Robinson</name>
</author>
<author>
<name sortKey="Young, Md" uniqKey="Young M">MD Young</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gunaratne, Ph" uniqKey="Gunaratne P">PH Gunaratne</name>
</author>
<author>
<name sortKey="Coarfa, C" uniqKey="Coarfa C">C Coarfa</name>
</author>
<author>
<name sortKey="Soibam, B" uniqKey="Soibam B">B Soibam</name>
</author>
<author>
<name sortKey="Tandon, A" uniqKey="Tandon A">A Tandon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ulitsky, I" uniqKey="Ulitsky I">I Ulitsky</name>
</author>
<author>
<name sortKey="Bartel, Dp" uniqKey="Bartel D">DP Bartel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Memczak, S" uniqKey="Memczak S">S Memczak</name>
</author>
<author>
<name sortKey="Jens, M" uniqKey="Jens M">M Jens</name>
</author>
<author>
<name sortKey="Elefsinioti, A" uniqKey="Elefsinioti A">A Elefsinioti</name>
</author>
<author>
<name sortKey="Torti, F" uniqKey="Torti F">F Torti</name>
</author>
<author>
<name sortKey="Krueger, J" uniqKey="Krueger J">J Krueger</name>
</author>
<author>
<name sortKey="Rybak, A" uniqKey="Rybak A">A Rybak</name>
</author>
<author>
<name sortKey="Maier, L" uniqKey="Maier L">L Maier</name>
</author>
<author>
<name sortKey="Mackowiak, Sd" uniqKey="Mackowiak S">SD Mackowiak</name>
</author>
<author>
<name sortKey="Gregersen, Lh" uniqKey="Gregersen L">LH Gregersen</name>
</author>
<author>
<name sortKey="Munschauer, M" uniqKey="Munschauer M">M Munschauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morin, Rd" uniqKey="Morin R">RD Morin</name>
</author>
<author>
<name sortKey="Bainbridge, M" uniqKey="Bainbridge M">M Bainbridge</name>
</author>
<author>
<name sortKey="Fejes, A" uniqKey="Fejes A">A Fejes</name>
</author>
<author>
<name sortKey="Hirst, M" uniqKey="Hirst M">M Hirst</name>
</author>
<author>
<name sortKey="Krzywinski, M" uniqKey="Krzywinski M">M Krzywinski</name>
</author>
<author>
<name sortKey="Pugh, Tj" uniqKey="Pugh T">TJ Pugh</name>
</author>
<author>
<name sortKey="Mcdonald, H" uniqKey="Mcdonald H">H McDonald</name>
</author>
<author>
<name sortKey="Varhol, R" uniqKey="Varhol R">R Varhol</name>
</author>
<author>
<name sortKey="Jones, Sjm" uniqKey="Jones S">SJM Jones</name>
</author>
<author>
<name sortKey="Marra, Ma" uniqKey="Marra M">MA Marra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Et" uniqKey="Wang E">ET Wang</name>
</author>
<author>
<name sortKey="Sandberg, R" uniqKey="Sandberg R">R Sandberg</name>
</author>
<author>
<name sortKey="Luo, S" uniqKey="Luo S">S Luo</name>
</author>
<author>
<name sortKey="Khrebtukova, I" uniqKey="Khrebtukova I">I Khrebtukova</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Mayr, C" uniqKey="Mayr C">C Mayr</name>
</author>
<author>
<name sortKey="Kingsmore, Sf" uniqKey="Kingsmore S">SF Kingsmore</name>
</author>
<author>
<name sortKey="Schroth, Gp" uniqKey="Schroth G">GP Schroth</name>
</author>
<author>
<name sortKey="Burge, Cb" uniqKey="Burge C">CB Burge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fullwood, Mj" uniqKey="Fullwood M">MJ Fullwood</name>
</author>
<author>
<name sortKey="Wei, C L" uniqKey="Wei C">C-L Wei</name>
</author>
<author>
<name sortKey="Liu, Et" uniqKey="Liu E">ET Liu</name>
</author>
<author>
<name sortKey="Ruan, Y" uniqKey="Ruan Y">Y Ruan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yassour, M" uniqKey="Yassour M">M Yassour</name>
</author>
<author>
<name sortKey="Kaplan, T" uniqKey="Kaplan T">T Kaplan</name>
</author>
<author>
<name sortKey="Fraser, Hb" uniqKey="Fraser H">HB Fraser</name>
</author>
<author>
<name sortKey="Levin, Jz" uniqKey="Levin J">JZ Levin</name>
</author>
<author>
<name sortKey="Pfiffner, J" uniqKey="Pfiffner J">J Pfiffner</name>
</author>
<author>
<name sortKey="Adiconis, X" uniqKey="Adiconis X">X Adiconis</name>
</author>
<author>
<name sortKey="Schroth, G" uniqKey="Schroth G">G Schroth</name>
</author>
<author>
<name sortKey="Luo, S" uniqKey="Luo S">S Luo</name>
</author>
<author>
<name sortKey="Khrebtukova, I" uniqKey="Khrebtukova I">I Khrebtukova</name>
</author>
<author>
<name sortKey="Gnirke, A" uniqKey="Gnirke A">A Gnirke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guttman, M" uniqKey="Guttman M">M Guttman</name>
</author>
<author>
<name sortKey="Garber, M" uniqKey="Garber M">M Garber</name>
</author>
<author>
<name sortKey="Levin, Jz" uniqKey="Levin J">JZ Levin</name>
</author>
<author>
<name sortKey="Donaghey, J" uniqKey="Donaghey J">J Donaghey</name>
</author>
<author>
<name sortKey="Robinson, J" uniqKey="Robinson J">J Robinson</name>
</author>
<author>
<name sortKey="Adiconis, X" uniqKey="Adiconis X">X Adiconis</name>
</author>
<author>
<name sortKey="Fan, L" uniqKey="Fan L">L Fan</name>
</author>
<author>
<name sortKey="Koziol, Mj" uniqKey="Koziol M">MJ Koziol</name>
</author>
<author>
<name sortKey="Gnirke, A" uniqKey="Gnirke A">A Gnirke</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Williams, Ba" uniqKey="Williams B">BA Williams</name>
</author>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Mortazavi, A" uniqKey="Mortazavi A">A Mortazavi</name>
</author>
<author>
<name sortKey="Kwan, G" uniqKey="Kwan G">G Kwan</name>
</author>
<author>
<name sortKey="Van Baren, Mj" uniqKey="Van Baren M">MJ van Baren</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Wold, Bj" uniqKey="Wold B">BJ Wold</name>
</author>
<author>
<name sortKey="Pachter, L" uniqKey="Pachter L">L Pachter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sze, S H" uniqKey="Sze S">S-H Sze</name>
</author>
<author>
<name sortKey="Tarone, Am" uniqKey="Tarone A">AM Tarone</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Nielsen, Cb" uniqKey="Nielsen C">CB Nielsen</name>
</author>
<author>
<name sortKey="Qian, Jq" uniqKey="Qian J">JQ Qian</name>
</author>
<author>
<name sortKey="Varhol, R" uniqKey="Varhol R">R Varhol</name>
</author>
<author>
<name sortKey="Stazyk, G" uniqKey="Stazyk G">G Stazyk</name>
</author>
<author>
<name sortKey="Morin, Rd" uniqKey="Morin R">RD Morin</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Hirst, M" uniqKey="Hirst M">M Hirst</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sjm" uniqKey="Jones S">SJM Jones</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Wu, G" uniqKey="Wu G">G Wu</name>
</author>
<author>
<name sortKey="Tang, J" uniqKey="Tang J">J Tang</name>
</author>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="Patterson, J" uniqKey="Patterson J">J Patterson</name>
</author>
<author>
<name sortKey="Liu, S" uniqKey="Liu S">S Liu</name>
</author>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="He, G" uniqKey="He G">G He</name>
</author>
<author>
<name sortKey="Gu, S" uniqKey="Gu S">S Gu</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Qian, W" uniqKey="Qian W">W Qian</name>
</author>
<author>
<name sortKey="Fang, X" uniqKey="Fang X">X Fang</name>
</author>
<author>
<name sortKey="Shi, Z" uniqKey="Shi Z">Z Shi</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Shan, G" uniqKey="Shan G">G Shan</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabherr, Mg" uniqKey="Grabherr M">MG Grabherr</name>
</author>
<author>
<name sortKey="Haas, Bj" uniqKey="Haas B">BJ Haas</name>
</author>
<author>
<name sortKey="Yassour, M" uniqKey="Yassour M">M Yassour</name>
</author>
<author>
<name sortKey="Levin, Jz" uniqKey="Levin J">JZ Levin</name>
</author>
<author>
<name sortKey="Thompson, Da" uniqKey="Thompson D">DA Thompson</name>
</author>
<author>
<name sortKey="Amit, I" uniqKey="Amit I">I Amit</name>
</author>
<author>
<name sortKey="Adiconis, X" uniqKey="Adiconis X">X Adiconis</name>
</author>
<author>
<name sortKey="Fan, L" uniqKey="Fan L">L Fan</name>
</author>
<author>
<name sortKey="Raychowdhury, R" uniqKey="Raychowdhury R">R Raychowdhury</name>
</author>
<author>
<name sortKey="Zeng, Q" uniqKey="Zeng Q">Q Zeng</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, J" uniqKey="Liu J">J Liu</name>
</author>
<author>
<name sortKey="Li, G" uniqKey="Li G">G Li</name>
</author>
<author>
<name sortKey="Chang, Z" uniqKey="Chang Z">Z Chang</name>
</author>
<author>
<name sortKey="Yu, T" uniqKey="Yu T">T Yu</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Mcmullen, R" uniqKey="Mcmullen R">R McMullen</name>
</author>
<author>
<name sortKey="Chen, P" uniqKey="Chen P">P Chen</name>
</author>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Q Y" uniqKey="Zhao Q">Q-Y Zhao</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Kong, Y M" uniqKey="Kong Y">Y-M Kong</name>
</author>
<author>
<name sortKey="Luo, D" uniqKey="Luo D">D Luo</name>
</author>
<author>
<name sortKey="Li, X" uniqKey="Li X">X Li</name>
</author>
<author>
<name sortKey="Hao, P" uniqKey="Hao P">P Hao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sachdeva, V" uniqKey="Sachdeva V">V Sachdeva</name>
</author>
<author>
<name sortKey="Kim, Cs" uniqKey="Kim C">CS Kim</name>
</author>
<author>
<name sortKey="Jordan, Ke" uniqKey="Jordan K">KE Jordan</name>
</author>
<author>
<name sortKey="Winn, Md" uniqKey="Winn M">MD Winn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author>
<name sortKey="Pyrkosz, Ab" uniqKey="Pyrkosz A">AB Pyrkosz</name>
</author>
<author>
<name sortKey="Brom, Th" uniqKey="Brom T">TH Brom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mckenna, A" uniqKey="Mckenna A">A McKenna</name>
</author>
<author>
<name sortKey="Hanna, M" uniqKey="Hanna M">M Hanna</name>
</author>
<author>
<name sortKey="Banks, E" uniqKey="Banks E">E Banks</name>
</author>
<author>
<name sortKey="Sivachenko, A" uniqKey="Sivachenko A">A Sivachenko</name>
</author>
<author>
<name sortKey="Kristian Cibulskis, Ak" uniqKey="Kristian Cibulskis A">AK Kristian Cibulskis</name>
</author>
<author>
<name sortKey="Garimella, K" uniqKey="Garimella K">K Garimella</name>
</author>
<author>
<name sortKey="Altshuler, D" uniqKey="Altshuler D">D Altshuler</name>
</author>
<author>
<name sortKey="Gabriel, S" uniqKey="Gabriel S">S Gabriel</name>
</author>
<author>
<name sortKey="Daly, M" uniqKey="Daly M">M Daly</name>
</author>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohammed, Ea" uniqKey="Mohammed E">EA Mohammed</name>
</author>
<author>
<name sortKey="Far, Bh" uniqKey="Far B">BH Far</name>
</author>
<author>
<name sortKey="Naugler, C" uniqKey="Naugler C">C Naugler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xu, B" uniqKey="Xu B">B Xu</name>
</author>
<author>
<name sortKey="Gao, J" uniqKey="Gao J">J Gao</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hopcroft, Je" uniqKey="Hopcroft J">JE Hopcroft</name>
</author>
<author>
<name sortKey="Tarjan, Re" uniqKey="Tarjan R">RE Tarjan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Plimpton, Sj" uniqKey="Plimpton S">SJ Plimpton</name>
</author>
<author>
<name sortKey="Devine, Kd" uniqKey="Devine K">KD Devine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Ruotti, V" uniqKey="Ruotti V">V Ruotti</name>
</author>
<author>
<name sortKey="Stewart, Rm" uniqKey="Stewart R">RM Stewart</name>
</author>
<author>
<name sortKey="Thomson, Ja" uniqKey="Thomson J">JA Thomson</name>
</author>
<author>
<name sortKey="Dewey, Cn" uniqKey="Dewey C">CN Dewey</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Fillmore, N" uniqKey="Fillmore N">N Fillmore</name>
</author>
<author>
<name sortKey="Bai, Y" uniqKey="Bai Y">Y Bai</name>
</author>
<author>
<name sortKey="Collins, M" uniqKey="Collins M">M Collins</name>
</author>
<author>
<name sortKey="Thomson, Ja" uniqKey="Thomson J">JA Thomson</name>
</author>
<author>
<name sortKey="Stewart, R" uniqKey="Stewart R">R Stewart</name>
</author>
<author>
<name sortKey="Dewey, C" uniqKey="Dewey C">C Dewey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pedersen, Bs" uniqKey="Pedersen B">BS Pedersen</name>
</author>
<author>
<name sortKey="Yang, Iv" uniqKey="Yang I">IV Yang</name>
</author>
<author>
<name sortKey="De, S" uniqKey="De S">S De</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Parkinson, J" uniqKey="Parkinson J">J Parkinson</name>
</author>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29100493</article-id>
<article-id pub-id-type="pmc">5670514</article-id>
<article-id pub-id-type="publisher-id">1881</article-id>
<article-id pub-id-type="doi">10.1186/s12859-017-1881-8</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Kim</surname>
<given-names>Chang Sik</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Winn</surname>
<given-names>Martyn D.</given-names>
</name>
<address>
<email>martyn.winn@stfc.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sachdeva</surname>
<given-names>Vipin</given-names>
</name>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Jordan</surname>
<given-names>Kirk E.</given-names>
</name>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, WA4 4AD UK</aff>
<aff id="Aff2">
<label>2</label>
Computational Science Center, IBM T.J. Watson Research, Cambridge, MA USA</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000000121662407</institution-id>
<institution-id institution-id-type="GRID">grid.5379.8</institution-id>
<institution>Present addresse Cancer Research UK Manchester Institute, The University of Manchester,</institution>
</institution-wrap>
M20 4BX, Manchester, UK</aff>
<aff id="Aff4">
<label>4</label>
Present addresse Silicon Therapeutics, 300 A Street, Boston, MA USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>3</day>
<month>11</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>3</day>
<month>11</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>18</volume>
<elocation-id>467</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>6</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>10</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">De novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">We introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par3">Our study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-017-1881-8) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>MapReduce</kwd>
<kwd>De novo sequence assembly</kwd>
<kwd>RNA-Seq</kwd>
<kwd>Trinity</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par4">Quantifying the expression of genes under different conditions is fundamental to understanding the behaviour and response of organisms to internal and external stimuli. With the arrival of Next Generation massively parallel sequencing technologies, the ability to monitor gene expression has been transformed [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
]. Direct sequencing of mRNA from expressed genes (RNA-Seq) is now feasible, and has several advantages over microarray technology [
<xref ref-type="bibr" rid="CR3">3</xref>
]. Most notably, it removes the need to have a priori knowledge of the transcribed regions, so that novel genes can be identified, or novel variants of known genes. This has led to a rapid increase in the number of studies looking at gene expression in non-model organisms. RNA-Seq is also increasingly used to study non-coding RNAs, such as microRNAs [
<xref ref-type="bibr" rid="CR4">4</xref>
], lincRNAs [
<xref ref-type="bibr" rid="CR5">5</xref>
], and circRNAs [
<xref ref-type="bibr" rid="CR6">6</xref>
] which play various regulatory roles.</p>
<p id="Par5">Nevertheless, it is widely recognised that the improvement in sequencing technology has shifted the bottleneck to down-stream data analysis. In the case of RNA-Seq, sequencing can be complicated by the presence of contaminant RNA, paralogous genes, and especially for higher organisms the prevalence of alternative splicing [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. Paired-end sequencing and strand-specific sequencing can help to resolve sequencing ambiguities, but must be included explicitly in the data analysis. Finally, and as we address in this study, the sheer size of datasets can cause practical problems in sequence assembly. In particular, the computational complexity due to the typical size of RNA-Seq datasets limits the ability to try multiple methods or multiple parameter choices, in order to optimise the quality of the results obtained.</p>
<p id="Par6">Initial approaches to the high throughput analysis of transcriptome sequence data were based on the alignment of RNA-Seq reads to reference genomes [
<xref ref-type="bibr" rid="CR9">9</xref>
<xref ref-type="bibr" rid="CR14">14</xref>
]. Such approaches are limited by the availability of suitable reference genomes, and by the structural alterations that can be detected, particularly when input reads are relatively short. Subsequently, de novo genome assemblers were adapted to the analysis of transcriptome data in the absence of a reference, by postprocessing draft contigs to identify transcripts. Examples of transcriptome assemblers based on genome assemblers include Oases [
<xref ref-type="bibr" rid="CR15">15</xref>
] and
<italic>postprocess</italic>
[
<xref ref-type="bibr" rid="CR16">16</xref>
] based on Velvet [
<xref ref-type="bibr" rid="CR17">17</xref>
], TransABySS [
<xref ref-type="bibr" rid="CR18">18</xref>
] based on ABySS [
<xref ref-type="bibr" rid="CR19">19</xref>
], and SOAPdenovo-Trans [
<xref ref-type="bibr" rid="CR20">20</xref>
] based on SOAPdenovo [
<xref ref-type="bibr" rid="CR21">21</xref>
]. In contrast, the Trinity [
<xref ref-type="bibr" rid="CR22">22</xref>
] pipeline which we consider below was developed specifically for de novo transcriptome assembly. More recent examples hybridizing previous de novo assembly algorithms include Bridger [
<xref ref-type="bibr" rid="CR23">23</xref>
] based on Trinity [
<xref ref-type="bibr" rid="CR22">22</xref>
] and SOAPdenovo-Trans [
<xref ref-type="bibr" rid="CR20">20</xref>
], BinPacker [
<xref ref-type="bibr" rid="CR24">24</xref>
] based on Bridger [
<xref ref-type="bibr" rid="CR23">23</xref>
] and bin-packing strategy [
<xref ref-type="bibr" rid="CR25">25</xref>
], and DRAP [
<xref ref-type="bibr" rid="CR26">26</xref>
] based on Trinity [
<xref ref-type="bibr" rid="CR22">22</xref>
] and Oases [
<xref ref-type="bibr" rid="CR15">15</xref>
].</p>
<p id="Par7">Most de novo transcriptome assembly methods are based on
<italic>de Bruijn</italic>
graphs of k-mers, where a k-mer is a sub-sequence of an input read with k base calls. For a chosen value of k, the assembler creates a k-mer graph, where the set of nodes correspond to all unique k-mers present in the input reads, and the edges represent “suffix-to-prefix” overlaps between k-mers. Most de novo transcriptome assembly algorithms store all unique k-mers from the input reads in shared memory, in order to facilitate edge detection and graph construction, and this can lead to extremely large RAM usage [
<xref ref-type="bibr" rid="CR27">27</xref>
]. For example Velvet, as used by Oases, starts by creating two large hashmap tables in memory storing the information for all k-mers. TransABySS/ABySS is the only parallel algorithm for de novo transcriptome assembly, which starts by distributing k-mers onto multiple compute nodes with a simple hash function. The Trinity pipeline consists of three independent software modules;
<italic>Inchworm</italic>
,
<italic>Chrysalis</italic>
and
<italic>Butterfly</italic>
.
<italic>Inchworm</italic>
initially creates a large hashmap table to store all unique k-mers from the input RNA-seq reads, and then it selects k-mers from the hashmap to construct linear contigs using a greedy k-mer extension approach.
<italic>Chrysalis</italic>
clusters Inchworm contigs into sets of connected components that are linked by pair-end reads, and creates
<italic>de Bruijn</italic>
graphs for each set.
<italic>Butterfly</italic>
then reconstructs the full-length transcripts based on the
<italic>de Bruijn</italic>
graphs from
<italic>Chrysalis,</italic>
taking into account possible alternative splicing
<italic>.</italic>
In our previous study [
<xref ref-type="bibr" rid="CR28">28</xref>
], we identified the
<italic>Chrysalis</italic>
module as the main bottleneck in terms of runtime, and alleviated this bottleneck by parallelising the processing over multiple compute nodes using MPI. We also confirmed that the
<italic>Inchworm</italic>
module of Trinity requires relatively high physical memory usage.</p>
<p id="Par8">The memory requirements of these packages increase for larger and more complex transcriptomes, which generate larger numbers of k-mers and hence larger graphs, and can exceed the computational resources available. One strategy that is commonly used is to normalize the read data [
<xref ref-type="bibr" rid="CR29">29</xref>
]. Redundant reads are removed from regions with high sequencing coverage, while reads are retained in regions of low coverage. In this way, up to 90% of input reads can be removed, which in turn leads to the elimination of a large fraction of erroneous k-mers associated with these reads [
<xref ref-type="bibr" rid="CR29">29</xref>
]. While this is believed to work well, it introduces an additional processing step, which can in itself require large memory.</p>
<p id="Par9">The fundamental task of de novo transcriptome assembly (in contrast to genome assembly) is to separate the full sequence data into many disjoint sets. Each set corresponds to a collection of gene variants sharing k-mers due to alternative splicing or gene duplication. In other words, a transcriptome can be represented as multiple distinct
<italic>de Bruijn</italic>
graphs (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
), each of which contains several paths corresponding to alternative gene products. Intuitively, de novo transcriptome assembly could be performed for every connected sub-graph separately. In the case of genome-guided transcriptome assembly, generation of sub-graphs is directed by the reference genome. In the absence of such a method for de novo assembly, however, most assemblers [
<xref ref-type="bibr" rid="CR15">15</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
] work with all unique k-mers obtained from the input reads, resulting in the requirement for a large amount of available memory.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>A few selected de Bruijn graphs of transcripts from whitefly RNA-Seq data. Each node represents one of the unique k-mers present in the input reads, and the edges represent
<italic>suffix-to-prefix</italic>
overlap between k-mers. Examples of branching and looping are visible (data source:
<ext-link ext-link-type="uri" xlink:href="http://evomics.org/learning/genomics/trinity">http://evomics.org/learning/genomics/trinity</ext-link>
)</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par10">In this work, we present a reference-free method for generating connected sub-graphs from datasets of RNA-Seq reads. We employ the MapReduce formulation [
<xref ref-type="bibr" rid="CR30">30</xref>
] for distributing the analysis of large datasets over many compute nodes. The MapReduce approach was popularized by Google for handling massively distributed queries, but has since been applied in a wide range of domains, including genome analysis [
<xref ref-type="bibr" rid="CR31">31</xref>
<xref ref-type="bibr" rid="CR33">33</xref>
]. A typical MapReduce implementation is based on
<italic>map()</italic>
and
<italic>reduce()</italic>
operations that work on a local subset of the data, but the power of the approach comes from an intermediate step called
<italic>shuffle()</italic>
or
<italic>collate()</italic>
which is responsible for re-distributing the data across the compute nodes. In the context of transcriptome assembly, the MapReduce approach distributes the sequence data over the available nodes, thus reducing the per-node memory requirement. The iterative application of
<italic>map()</italic>
,
<italic>collate()</italic>
and
<italic>reduce()</italic>
steps leads to clustering of the k-mers, such that the desired sub-graphs are each physically located on a single compute node.</p>
<p id="Par11">While distributing the sequence data across nodes of a compute cluster should lead to faster runtimes and reduced per-node memory requirements, this must be balanced against the cost of inter-node communication and transfer of data. We make use of an established MapReduce software library [
<xref ref-type="bibr" rid="CR34">34</xref>
] that handles communication via the Message Passing Interface (MPI) protocol. Using this library, we have developed software that can cluster k-mers, and then launch multiple
<italic>Inchworm</italic>
jobs for the resulting sub-graphs. The procedure can be linked with the rest of the Trinity pipeline, for selected components of which we have also developed an MPI-based parallelisation [
<xref ref-type="bibr" rid="CR28">28</xref>
], so that the entire assembly workflow can be run on a commodity cluster. Use of the MapReduce-MPI software library [
<xref ref-type="bibr" rid="CR34">34</xref>
] means that specialised MapReduce installations such as Hadoop are not required. The only requirement is an MPI installation which, while requiring some setup and management, is well-established and omnipresent on high performance computing platforms.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>MapReduce-MPI library</title>
<p id="Par12">The MapReduce [
<xref ref-type="bibr" rid="CR30">30</xref>
] programming paradigm consists of two core operations, namely a “map” operation followed by “reduce” operation. These are highly parallel operations working on distributed data, which wrap around an intermediate data-shuffling operation that requires inter-processor communication. The basic data structures for MapReduce operations are key/value (KV) pairs, and key/multivalue (KMV) pairs that consist of a unique key and a set of associated values. There are many implementations of the MapReduce idea, see for examples [
<xref ref-type="bibr" rid="CR35">35</xref>
,
<xref ref-type="bibr" rid="CR36">36</xref>
]. In the MapReduce-MPI library [
<xref ref-type="bibr" rid="CR34">34</xref>
], which we utilise here, KV and KMV pairs are stored within MapReduce objects, and user defined algorithms consist of operations on these objects.</p>
<p id="Par13">A typical algorithm using the MapReduce-MPI library is built upon three basic functions operating on MapReduce objects, namely
<italic>map()</italic>
,
<italic>collate()</italic>
and
<italic>reduce()</italic>
. In
<italic>map()</italic>
, KV pairs are generated by reading data from files or processing existing KV pairs to create new ones. The
<italic>collate()</italic>
operation extracts unique keys and maps all the values associated with these keys to create KMV pairs. The
<italic>reduce()</italic>
operation processes KMV pairs to produce new KV pairs as input to the following steps of the algorithm. In a parallel environment, the
<italic>map()</italic>
and
<italic>reduce()</italic>
operations work on local data, while the
<italic>collate()</italic>
operation builds KMV pairs using values stored on all processors. Since KV pairs with the same key could be located on many different processors, there is a choice about where to store the resulting KMV pair. In the MapReduce-MPI library, each KMV pair is distributed onto a processor by hashing its key into a 32-bit value whose remainder modulo the number of processors is the owning processor rank.</p>
<p id="Par14">The MapReduce-MPI library allows user-defined functions to be invoked for
<italic>map()</italic>
or
<italic>reduce()</italic>
operations, while the
<italic>collate()</italic>
operation and the general housekeeping of MapReduce objects are handled automatically. The
<italic>map()</italic>
and
<italic>reduce()</italic>
operations are called via pointers to functions supplied by the application program. Each user-defined function is invoked multiple times as a callback for each KV or KMV pair that is processed.</p>
<p id="Par15">
<italic>Out-of-core</italic>
processing is an important feature of the MapReduce-MPI library, and is initiated when KV or KMV pairs owned by a processor do not fit in the physical memory. When this happens, each processor writes one or more temporary files to disk and reads the data back in when required. Specifically, a
<italic>pagesize</italic>
is defined by the user, which is the maximum size of MapReduce objects that can be held in memory and used in MapReduce operations. This allows the MapReduce-library to handle data objects larger than the available memory, at the expense of additional I/O to disk, and we give examples later.</p>
</sec>
<sec id="Sec4">
<title>Finding connected components</title>
<p id="Par16">A connected component of an undirected graph is a sub-graph where any two nodes are connected by a path of edges. A transcriptome can be represented as a k-mer graph with multiple
<italic>connected components</italic>
, where ideally the number of sub-graphs equals the number of genes (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
). The identification of connected components can be done using a depth-first search [
<xref ref-type="bibr" rid="CR37">37</xref>
]. Starting from a seed node, the procedure searches for the entire connected component by repeatedly looping through neighbour nodes, and creates new paths between nodes as extensions of pre-existing paths.</p>
<p id="Par17">An algorithm implementing the above search in a MapReduce framework starts with the assignment of unique “zone” IDs to each graph node stored in a MapReduce object [
<xref ref-type="bibr" rid="CR38">38</xref>
]. In each iteration, the size of a zone may increase by one layer of its neighbours. As zone IDs between two nodes conflict by sharing edge, a winner is chosen and the losing nodes are then merged into the winning zone. When the final iteration is reached, all nodes in a connected component will have been assigned to the same zone, and the MapReduce object will contain the zone assignments for all fully connected components. More details of the algorithm and its implementation in the MapReduce-MPI library are given in [
<xref ref-type="bibr" rid="CR38">38</xref>
]. For the current application, we need to define the nodes and edges of the full (disconnected) graph to be analysed, which we do in the next subsection.</p>
</sec>
<sec id="Sec5">
<title>MapReduce-inchworm</title>
<p id="Par18">We have implemented a multi-step procedure for clustering k-mers as the initial stages of transcriptome assembly in Trinity [
<xref ref-type="bibr" rid="CR22">22</xref>
] (see Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). In the first step, input sequence reads are decomposed into a list of unique k-mers, together with their abundances, as a single MapReduce cycle (Algorithm 1 in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
). In the second step, edges representing k-1 overlaps between k-mers are extracted in a single MapReduce operation (Algorithm 2 in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
). This pre-collection of edge information is an important feature of our algorithm. The third step filters out edges where a k-mer node has multiple candidates in the 3′ or 5′ directions, and is introduced to make the later
<italic>Inchworm</italic>
runs more robust (Algorithm 3 in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
). Inchworm builds contigs by extending a seed k-mer using the overlapping k-mer with the highest abundance, and extension continues until no more overlapping k-mers exist in the dataset. Our filtering step makes sure that the edge or edges with the highest abundance are kept in the cluster, and so available to Inchworm, while others are removed. Without this filtering operation, the subsequent step tends to produce k-mer clusters with highly diverse sizes, and leads to load balancing issues for high performance computing clusters. Having prepared the k-mer and k-mer overlap data, the fourth step (Algorithm 4 in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
) performs the k-mer clustering by finding connected components, as described above. The steps are described in detail in the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Workflow summarising the MapReduce-Inchworm algorithm. The steps are described in the main text and the Additional file 1. In this figure,
<italic>V</italic>
represents k-mer nodes with abundances
<italic>C</italic>
,
<italic>E</italic>
represents edges with abundances
<italic>CE</italic>
, and
<italic>Z</italic>
represents zone IDs</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par19">The original C++ code of
<italic>Inchworm</italic>
for constructing contigs is implemented as step 5 of the algorithm, and is executed as a callback function by each set of clustered k-mers (Algorithm 5 in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
). The input consists of two MapReduce objects, the zone assignment of k-mers from the previous step and the list of k-mers with their abundance values. These two input objects are concatenated into a single MapReduce object, followed by a
<italic>collate()</italic>
operation using k-mers as key. This creates KMV pairs with the k-mer as key and the pair of zone ID and abundance value as the multivalue. The following
<italic>reduce()</italic>
operation creates new KV pairs, this time with the zone ID as key and the corresponding pair of k-mer and abundance as the value. Another
<italic>collate()</italic>
operation with zone ID as key produces KMV pairs with each zone ID linked to a list of k-mer/abundant value pairs.</p>
<p id="Par20">The final
<italic>reduce()</italic>
operation creates a
<italic>hash_map</italic>
table for each zone ID, i.e. for each cluster. This table has the k-mers
<italic>V</italic>
<sub>
<italic>i</italic>
</sub>
as keys and the abundance
<italic>C</italic>
<sub>
<italic>i</italic>
</sub>
as values. This
<italic>hash_map</italic>
table is an input to the
<italic>Inchworm</italic>
module, which constructs contigs for that cluster. The final
<italic>collate()</italic>
operation evenly distributes the k-mer clusters across the allocated nodes of the computer. Each compute node will then run multiple
<italic>Inchworm</italic>
jobs, according to the number of k-mer clusters residing on that compute node. The resulting files of Inchworm contigs can be merged for input to Chrysalis.</p>
</sec>
</sec>
<sec id="Sec6">
<title>Results</title>
<p id="Par21">This section presents our evaluation of MapReduce-Inchworm, in comparison to the original Inchworm. The primary aim of our work is to circumvent the high-memory requirements of the original Inchworm, while a secondary aim is to reduce the runtime required. It is vital, of course, that performance improvements do not lead to loss of accuracy, and so we begin by presenting a detailed characterization of the transcripts generated by the Trinity pipeline when MapReduce-Inchworm is used to generate the initial contigs. Next, we present performance results in terms of runtime and scalability, followed by results for the physical memory usage of MapReduce-Inchworm. Finally, we present a performance comparison using RNA-Seq datasets from several different organisms.</p>
<p id="Par22">The datasets and computing resources used in our evaluations are listed in Table 
<xref rid="Tab1" ref-type="table">1</xref>
. The results presented here for MapReduce-Inchworm were obtained on an IBM iDataplex-Nextscale cluster, consisting of nodes with 2 × 12-core Intel Xeon processors and 64GB of RAM. For the original version of Inchworm, the code is necessarily run on a single node, and only a single thread was used. For the mouse dataset, a single node of the iDataplex-Nextscale cluster was used. For the larger sugarbeet dataset, jobs were run on a high-memory (256GB) node of a slightly older iDataplex cluster. For the most complex transcriptome, the wheat dataset, ScaleMP software (
<ext-link ext-link-type="uri" xlink:href="http://www.scalemp.com">http://www.scalemp.com</ext-link>
/) was used to create a virtual symmetric multiprocessing node on the iDataplex cluster to meet the high memory requirement of the original Inchworm.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>RNA-Seq datasets and computing resources used for each RNA-Seq data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>organism</th>
<th># of reads</th>
<th># of unique k-mers</th>
<th colspan="2">Computing resource</th>
<th>data source</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>MR-Inchworm</td>
<td>Original Inchworm</td>
<td></td>
</tr>
<tr>
<td>mouse</td>
<td>105,290,476</td>
<td>746,811,557</td>
<td>iDataplex-nextscale</td>
<td>iDataplex-nextscale:single node (64GB mem)</td>
<td>[
<xref ref-type="bibr" rid="CR22">22</xref>
]</td>
</tr>
<tr>
<td>sugarbeet</td>
<td>129,832,549</td>
<td>2,213,519,875</td>
<td>iDataplex-nextscale</td>
<td>iDataplex:single node (256GB mem)</td>
<td>unpublished data</td>
</tr>
<tr>
<td>wheat</td>
<td>1,468,701,119</td>
<td>5,775,799,648</td>
<td>iDataplex-nextscale</td>
<td>iDataplex:single vSMP node (4 TB mem) cerated by ScaleMP</td>
<td>unpublished data</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>All datasets are pair-end datasets, in which only mouse dataset is strand-specific.iDataplex-nextscale cluster is known as
<italic>BlueWonder-NextScale</italic>
, consisting of 360 nodes each with 2 × 12 core Intel Xeon processors (E5-2697v2 2.7GHz) and 64GB RAM making total 8640 cores in total. iDataplex cluster is known as “BlueWonder”, consisting of 512 nodes each with 2 × 8 core Intel SandyBridge processors (2.6 Ghz) making 8192 cores in total. Original Inchworm with sugarbeet dataset was run using a single iDataplex node with 256GB memory. Original Inchworm with wheat dataset was run using a single vSMP node with 4 Tb memory created by ScalewMP software (
<ext-link ext-link-type="uri" xlink:href="http://www.scalemp.com">http://www.scalemp.com</ext-link>
) on iDataplex. ScaleMP creates a virtual symmetric multiprocessing (vSMP) node for shared memory by aggregating multiple compute nodes</p>
</table-wrap-foot>
</table-wrap>
</p>
<sec id="Sec7">
<title>Accuracy assessment</title>
<p id="Par23">To evaluate the accuracy of the MapReduce procedure, we compared the final transcripts generated by the Trinity pipeline when either the MapReduce-Inchworm or the original Inchworm is used. We focus on the final transcripts since these are the biologically relevant objects, while the intermediate contigs from each version of Inchworm can be quite different. We performed these tests using a mouse RNA-Seq dataset consisting of 105 M pair-end reads taken from [
<xref ref-type="bibr" rid="CR22">22</xref>
]. To generate additional datasets, we used the
<italic>rsem-simulate-reads</italic>
program from RSEM [
<xref ref-type="bibr" rid="CR39">39</xref>
,
<xref ref-type="bibr" rid="CR40">40</xref>
] to simulate RNA-Seq read data based on parameters learned from the real dataset. The simulation was done in 3 steps as follows. First, we ran Trinity (using the original Inchworm) on the downloaded set of reads to produce 80,867 transcripts. These transcripts act as the set of
<italic>reference transcripts</italic>
for our trials. Second, RSEM was executed using the mouse RNA-Seq data together with the
<italic>reference transcripts</italic>
to obtain parameters for simulation of RNA-Seq reads. Third, RNA-Seq read data was simulated by executing
<italic>rsem-simulate-reads</italic>
with the
<italic>reference transcripts</italic>
and parameters from the previous RSEM run. Three simulated datasets were generated consisting of 100 M, 150 M and 200 M pair-end reads, compared to the original experimental dataset with 105 M pair-end reads, and contain approximately 5% background reads.</p>
<p id="Par24">We ran both versions of
<italic>Inchworm</italic>
on the three simulated datasets to produce Fasta-formatted files of Inchworm contigs. The remainder of the Trinity pipeline was run from these Inchworm contigs, producing two sets of transcripts derived from MapReduce-Inchworm and the original Inchworm. The REF-EVAL module from DETONATE [
<xref ref-type="bibr" rid="CR41">41</xref>
] was used to assess both sets against the “reference transcripts”, giving assembly recall and precision scores for each version of the transcriptome. Initially, all significant local alignments between assembled transcripts and reference transcripts are found using BLAT [
<xref ref-type="bibr" rid="CR42">42</xref>
]. At the contig level, REF-EVAL counts the number of transcripts that align with at least a predefined level of accuracy in a one-to-one mapping. We varied the required level of accuracy to get a range of statistics. At the nucleotide level, it counts the number of correctly assembled nucleotides without requiring “one-to-one” mapping; that is, it takes partially assembled transcripts into account as true positives. Recall is defined as the fraction of reference transcripts that are correctly recovered by an assembly. Precision is defined as the fraction of assembled transcripts that correctly recover a reference transcript.</p>
<p id="Par25">We also evaluated the two quantities N1 and N2, as given by the analysis script
<italic>FL_trans_analysis_pipeline.pl</italic>
distributed with the Trinity software. This tool looks at the alignment of reconstructed transcripts onto the set of reference transcripts. If at least 99% of a reconstructed transcript is aligned to the reference, and the aligned sections have at least 99% identity, then it is considered a full-length match. The focus is on the quality of the reconstructed transcript, rather than recovery of the reference transcripts (cf. REF-EVAL above). The N1 statistic represents the total number of assembled transcripts that give full-length matches to the reference. The N2 statistic represents the number of assembled transcripts that align to multiple reference transcripts, and are thus
<italic>fused</italic>
transcripts.</p>
<p id="Par26">The results (Table 
<xref rid="Tab2" ref-type="table">2</xref>
) show that Trinity run with MapReduce-Inchworm gives consistently higher values for Recall, Precision and N1 for the three simulated datasets. The number of fused transcripts, given by N2, is also lower. Thus, parallelisation of the initial step in the Trinity pipeline actually leads to a slight increase in assembly accuracy. In fact, the improvement of Recall and Precision at the nucleotide level is only marginal, and the absolute values are close to 1.0, indicating that both versions of Inchworm lead to transcripts that are highly similar to the reference transcripts. Recall and Precision at the contig level are lower, roughly in line with the N1 values, indicating small differences in the transcripts that lead to some reference transcripts not being fully recovered or matched. In this case, the MapReduce-Inchworm leads to a more significant improvement.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Accuracy assessment of MapReduce-Inchworm compared to the original Inchworm using three simulated read datasets for mouse RNA-Seq</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th colspan="3">Number of pair-end reads</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>100 M</td>
<td>150 M</td>
<td>200 M</td>
</tr>
<tr>
<td rowspan="8">REF-EVAL</td>
<td rowspan="4">Contig</td>
<td rowspan="2">Recall</td>
<td>Original</td>
<td>0.3098</td>
<td>0.3122</td>
<td>0.3126</td>
</tr>
<tr>
<td>MapReduce</td>
<td>0.3274</td>
<td>0.3308</td>
<td>0.3314</td>
</tr>
<tr>
<td rowspan="2">Precision</td>
<td>Original</td>
<td>0.3258</td>
<td>0.3283</td>
<td>0.3280</td>
</tr>
<tr>
<td>MapReduce</td>
<td>0.3389</td>
<td>0.3419</td>
<td>0.3422</td>
</tr>
<tr>
<td rowspan="4">Nucleotide</td>
<td rowspan="2">Recall</td>
<td>Original</td>
<td>0.9752</td>
<td>0.9764</td>
<td>0.9779</td>
</tr>
<tr>
<td>MapReduce</td>
<td>0.9763</td>
<td>0.9783</td>
<td>0.9793</td>
</tr>
<tr>
<td rowspan="2">Precision</td>
<td>Original</td>
<td>0.9847</td>
<td>0.9845</td>
<td>0.9845</td>
</tr>
<tr>
<td>MapReduce</td>
<td>0.9870</td>
<td>0.9869</td>
<td>0.9869</td>
</tr>
<tr>
<td></td>
<td></td>
<td rowspan="2">N1</td>
<td>Original</td>
<td>32,712</td>
<td>39,273</td>
<td>43,862</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MapReduce</td>
<td>33,687</td>
<td>40,452</td>
<td>45,344</td>
</tr>
<tr>
<td></td>
<td></td>
<td rowspan="2">N2</td>
<td>Original</td>
<td>16</td>
<td>20</td>
<td>26</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MapReduce</td>
<td>4</td>
<td>12</td>
<td>8</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Statistics from the REF-EVAL component of DENONATE [
<xref ref-type="bibr" rid="CR41">41</xref>
], for three simulated read datasets. Recall is the fraction of reference elements that are correctly recovered by an assembly. Precision is the fraction of assembly elements that correctly recover a reference element. At the Contig level, a 99% alignment cutoff has been used to identify a recovered transcript (left-hand bars in Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
). Original refers to the results of Trinity run with the original version of Inchworm. MapReduce refers to the results of Trinity run with the MapReduce-Inchworm method presented here. Also shown are the N1 and N2 statistics, as given by the script
<italic>FL_trans_analysis_pipeline.pl</italic>
. N1 represents the total number of assembled transcripts that give full-length matches to the reference. N2 represents the number of fused transcripts. For comparison, there are 80,867 reference transcripts</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par27">Figure 
<xref rid="Fig3" ref-type="fig">3(a)</xref>
and
<xref rid="Fig3" ref-type="fig">3(c)</xref>
show the variation of the Recall and Precision statistics at the contig level, as a function of required alignment accuracy, for the simulated dataset with 100 M reads. If the cutoff is reduced from 99% to 90%, so that transcripts align with high but not complete overlap, then most of the reference transcripts can be recovered from the simulated dataset. Although the absolute numbers are similar, MapReduce-Inchworm gives higher values of Recall and Precision for all cutoffs.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Assessment of the reconstruction accuracy of MapReduce-Inchworm (red bars) compared to the original Inchworm program (light blue bars), as given by the REF-EVAL tool of DETONATE [
<xref ref-type="bibr" rid="CR41">41</xref>
]. Plots (
<bold>a</bold>
) and (
<bold>c</bold>
) give results for the dataset of 100 M simulated pair-end reads (see main text), while plots (
<bold>b</bold>
) and (
<bold>d</bold>
) give the corresponding results for the original mouse RNA-Seq dataset of experimental reads. Plots (
<bold>a</bold>
) and (
<bold>b</bold>
) show the Recall statistic, which is the fraction of reference transcripts that are correctly recovered by an assembly. Plots (
<bold>c</bold>
) and (
<bold>d</bold>
) show the Precision statistic, which is the fraction of assembled transcripts that correctly recover a reference transcript. Recovery of a reference transcript by a particular assembly is measured at the “contig” level, which requires almost complete alignment in a one-to-one mapping between the assembly and the reference. Each plot is given as a function of the alignment cutoff used to identify a recovered transcript</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p id="Par28">With the simulated data, we are testing the ability of Trinity to recover the transcripts from which the simulated reads were generated. As a further test, we used REF-EVAL to compare the transcript sets that we generate to a mouse transcriptome downloaded from the UCSC genome-browser database. We used the CruzDB programmatic interface [
<xref ref-type="bibr" rid="CR43">43</xref>
] to obtain a set of 22,403 coding transcripts. Statistics for the two sets of transcripts are given Table 
<xref rid="Tab3" ref-type="table">3</xref>
, and the similarity between them is quantified in Table 
<xref rid="Tab4" ref-type="table">4</xref>
. Specifically, we compared transcripts generated from the downloaded set of 105 M pair-end reads using Trinity run with MapReduce-Inchworm and the original Inchworm.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Basic statistics of Trinity transcripts using original Inchworm and MapReduce-Inchworm using the mouse RNA-seq data [
<xref ref-type="bibr" rid="CR22">22</xref>
]</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th>original Inchworm</th>
<th>MapReduce-Inchworm</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total trinity
<italic>genes</italic>
</td>
<td>55,498</td>
<td>55,047</td>
</tr>
<tr>
<td>Total trinity transcripts</td>
<td>80,825</td>
<td>78,719</td>
</tr>
<tr>
<td>Percent GC</td>
<td>49.4</td>
<td>49.46</td>
</tr>
<tr>
<td>Contig N10</td>
<td>6282</td>
<td>6163</td>
</tr>
<tr>
<td>Contig N20</td>
<td>4725</td>
<td>4621</td>
</tr>
<tr>
<td>Contig N30</td>
<td>3946</td>
<td>3831</td>
</tr>
<tr>
<td>Contig N40</td>
<td>3275</td>
<td>3183</td>
</tr>
<tr>
<td>Contig N50</td>
<td>2693</td>
<td>2630</td>
</tr>
<tr>
<td>Median contig length</td>
<td>551</td>
<td>558</td>
</tr>
<tr>
<td>Average contig</td>
<td>1284.53</td>
<td>1267.98</td>
</tr>
<tr>
<td>Total assembled bases</td>
<td>103,822,088</td>
<td>99,814,020</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The statistics were calculated using the perl script
<italic>TrinityStats.pl</italic>
, included in the original Trinity pipeline</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>The number of similar Trinity transcripts between original Inchworm and MapReduce-Inchworm using the mouse RNA-seq data [
<xref ref-type="bibr" rid="CR22">22</xref>
]</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>cutoff for transcript similarity (%)</th>
<th>number of similar transcripts</th>
</tr>
</thead>
<tbody>
<tr>
<td>100</td>
<td>47,816</td>
</tr>
<tr>
<td>99</td>
<td>57,926</td>
</tr>
<tr>
<td>95</td>
<td>64,109</td>
</tr>
<tr>
<td>90</td>
<td>67,178</td>
</tr>
<tr>
<td>85</td>
<td>69,002</td>
</tr>
<tr>
<td>80</td>
<td>70,398</td>
</tr>
<tr>
<td>75</td>
<td>71,390</td>
</tr>
<tr>
<td>70</td>
<td>72,285</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Two sets of transcripts from original Inchworm and MapReduce-Inchworm were compared using BLAT [
<xref ref-type="bibr" rid="CR42">42</xref>
]; Transcripts from original Inchworm was used as
<italic>target</italic>
and transcripts from MapReduce-Inchworm was used as
<italic>query</italic>
for input parameters to BLAT. The perl
<italic>script blat_top_hit_extractor.pl</italic>
, included in Trinity pipeline, was used to extract the most top hit for each transcript in
<italic>query</italic>
against
<italic>target</italic>
. The first column refers to the cutoff of transcript similarity, which was quantified using two similarity score defined as follows: 1) 1 - (query_sequence_size - number_of_matching_bases)/query_sequence_size 2) 1 - (target_sequence_size - number_of_matching_bases)/target_sequence_size. If these two similarity scores between two transcripts from both methods were greater than or equal to the cutoff value, those were considered as similar transcripts. The second column refers to the number of similar transcripts between original and MapReduce-Inchworm according to the cutoff value. Note the total number of transcripts from both methods can be found in Table
<xref rid="Tab3" ref-type="table">3</xref>
</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par29">Results for contig-level Recall and Precision are shown in Fig.
<xref rid="Fig3" ref-type="fig">3(b)</xref>
and
<xref rid="Fig3" ref-type="fig">3(d)</xref>
, as a function of the required alignment accuracy. The Recall is generally lower than for the simulated datasets, as the read data used probably doesn’t have the coverage to fully explain the UCSC transcriptome. Nevertheless, the Recall does approach 1.0 when the required accuracy is relaxed. The Precision also improves as the required alignment accuracy is relaxed, but remains less than 0.5 reflecting the fact that some of the read data used derives from transcripts not included in the UCSC set of coding transcripts. In the context of the current study, it is reassuring to see that again the MapReduce-Inchworm approach gives slightly improved statistics in most cases, compared to the original Inchworm (Table 
<xref rid="Tab5" ref-type="table">5</xref>
).
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Comparison of mouse transcripts assembled using MapReduce-Inchworm or the original Inchworm with a reference mouse transcriptome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th>alignment cutoff</th>
<th>MapReduce</th>
<th>Original</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">Number of reference transcripts recovered</td>
<td>95</td>
<td>6230</td>
<td>6231</td>
</tr>
<tr>
<td>90</td>
<td>12,198</td>
<td>11,600</td>
</tr>
<tr>
<td>85</td>
<td>14,912</td>
<td>14,546</td>
</tr>
<tr>
<td>80</td>
<td>17,340</td>
<td>15,995</td>
</tr>
<tr>
<td rowspan="4">Number of transcripts that map to reference</td>
<td>95</td>
<td>12,627</td>
<td>13,601</td>
</tr>
<tr>
<td>90</td>
<td>24,816</td>
<td>24,770</td>
</tr>
<tr>
<td>85</td>
<td>30,923</td>
<td>32,397</td>
</tr>
<tr>
<td>80</td>
<td>33,966</td>
<td>35,208</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Statistics from the REF-EVAL component of DENONATE [
<xref ref-type="bibr" rid="CR41">41</xref>
] using mouse RNA-seq data [
<xref ref-type="bibr" rid="CR22">22</xref>
]. Dividing the number of reference transcripts recovered by the total number of reference transcripts (22402) gives the Recall shown in Fig.
<xref rid="Fig3" ref-type="fig">3(b)</xref>
. Dividing the number of transcripts that map to reference by the total number of assembled transcripts (78,719 for MapReduce-Inchworm and 80,825 for original Inchworm) gives the Precision shown in Fig.
<xref rid="Fig3" ref-type="fig">3(d)</xref>
. The recovery rate was measured at the contig level, which requires certain amount of complete alignment in a one-to-one mapping between the assembly and the reference. Alignment cutoff refers to the minimum required alignment for the recovery rate calculation. MapReduce refers to the results of Trinity run with the MapReduce-Inchworm method presented here. Original refers to the results of Trinity run with the original version of Inchworm</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par30">We believe that the reason for the slightly improved accuracy of MapReduce-Inchworm is the inclusion of additional edge information, which is obtained in step 2 of the procedure from pairs of k-mers appearing consecutively in input reads (see Methods). With this edge information, MapReduce-Inchworm clusters k-mers into multiple groups, each of which should contain k-mers from same gene. Inchworm contigs are constructed within each cluster, and the output is implicitly guided by the input reads via this initial segregation. On the other hand, the original Inchworm uses all unique k-mers extracted from the input reads, and the construction of Inchworm contigs is done without any additional supporting information from the input reads. Both methods produce a similar total number of Inchworm contigs (data not shown), but there are clearly differences in the resulting transcripts.</p>
</sec>
<sec id="Sec8">
<title>Runtime improvement</title>
<p id="Par31">Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
shows the scaling of the MapReduce-Inchworm runtime with increasing number of compute nodes, for the experimental mouse dataset. Plots are displayed for different choices of the
<italic>pagesize</italic>
parameter, which determines the physical memory usage (see the
<italic>MapReduce-MPI library</italic>
section in Methods for a detailed explanation). For each plot, the runtime of the original
<italic>Inchworm</italic>
(5431 s) is displayed as a dashed line for comparison. The number of compute nodes was varied from 32 to 192, with each node running a single MPI process, while the
<italic>pagesize</italic>
was varied from 1 GB to 4 GB. The runtimes obtained using all 192 compute nodes are 1093, 1067, 1034, and 1034 s for the four choices of
<italic>pagesize</italic>
, corresponding to a speed-up by a factor of about 5 compared to the original Inchworm.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Scaling of the runtime of MapReduce-Inchworm (black lines, left-hand axis) as a function of the number of compute nodes used, for the experimental mouse dataset (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). The runtime is for the MapReduce-Inchworm step only, and does not include the remainder of the Trinity pipeline. The runtime of the corresponding serial Inchworm is shown as a horizontal dashed line. Results are shown with
<italic>pagesize</italic>
set to (
<bold>a</bold>
) 1 GB, (
<bold>b</bold>
) 2 GB, (
<bold>c</bold>
) 3 GB and (
<bold>d</bold>
) 4 GB. The cumulative I/O to disk, due to out-of-core processing, is also shown (blue line, right-hand axis)</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p id="Par32">There are also speed-ups for smaller numbers of compute nodes, except for the cases of 32 nodes with a
<italic>pagesize</italic>
parameter of 1 or 2 GB. In these cases, the memory requirements exceed the chosen
<italic>pagesize</italic>
leading to significant “out-of-core” processing (see Methods). The cumulative file I/O (Tb) is also plotted in Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
, which confirms the significant paging to disk in these cases. Thus, the
<italic>pagesize</italic>
setting should be large enough (within the constraints of the available physical memory) or the number of nodes large enough (in order to distribute the memory requirements), otherwise there is an adverse effect on the runtime.</p>
<p id="Par33">We stratified the runtime in terms of the major steps in both versions of
<italic>Inchworm</italic>
, as shown in Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
. The original
<italic>Inchworm</italic>
consists of 3 principal steps: 1)
<italic>jellyfish</italic>
, 2)
<italic>parsing k-mers</italic>
, and 3)
<italic>inchworm contig construction</italic>
. The first step involves counting the occurrence of every unique k-mer in the set of input reads using the program Jellyfish [
<xref ref-type="bibr" rid="CR44">44</xref>
], and writing the output to a disk file. In the second step,
<italic>Inchworm</italic>
reads the output file back into physical memory by storing each k-mer and its count into a hashmap table as a key-value pair. In the final step, the algorithm creates draft contigs using the hashmap table of unique k-mers. We divide the MapReduce-Inchworm algorithm into an initial
<italic>splitting input reads</italic>
step, followed by the five steps described in Methods. The initial step consists of evenly splitting the input file of reads into multiple files, according to the number of allocated compute nodes. Each file is then read into a compute node in preparation for subsequent steps.
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Stratification of the runtime in terms of individual steps within both versions of Inchworm, for the experimental mouse dataset (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). OI represents Original Inchworm; MR represents MapReduce-Inchworm. On the X-axis,
<italic>original</italic>
represents the original version of Inchworm, while 32–192 represent the numbers of compute nodes allocated for MapReduce-Inchworm. The original Inchworm is divided into three steps: step1 corresponds to
<italic>Jellyfish</italic>
[
<xref ref-type="bibr" rid="CR44">44</xref>
]; step 2 corresponds to
<italic>parsing kmers</italic>
; and step3 corresponds to
<italic>Inchworm contig construction</italic>
. MapReduce-Inchworm is divided into six steps: an initial step for
<italic>splitting input reads</italic>
and steps 1–5. The initial step splits an input file (containing the RNA-Seq reads) into multiple files according to the number of allocated compute nodes. Steps 1 to 5 of the main algorithm are described in detail in Methods. Results are given with
<italic>pagesize</italic>
assigned to 2GB, cf. Fig.
<xref rid="Fig3" ref-type="fig">3(b)</xref>
</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
<p id="Par34">Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
shows that the first two steps of the original Inchworm, which could be categorised as k-mer preparation steps, take a significant fraction of the total runtime. These steps are equivalent to the
<italic>splitting input reads</italic>
and step 1 of MapReduce-Inchworm. The latter steps are however much quicker because they avoid storing k-mers on disc. The remaining runtime of the original Inchworm involves construction of contigs. In the MapReduce-Inchworm implementation, this is done individually for each cluster, and is very fast (MR: step 5 in Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
). The bulk of the runtime for MapReduce-Inchworm is taken by the clustering algorithm (MR: step 4 in Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
), and this scales well with the number of nodes used. As mentioned above, super-linear scaling is achieved in going from 32 nodes to 64 nodes because of the reduction in out-of-core processing, while going from 64 to 128 nodes gives a speedup of 1.9, and from 64 to 192 nodes a speedup of 2.6.</p>
</sec>
<sec id="Sec9">
<title>Physical memory requirement</title>
<p id="Par35">The main objective of our work is to remove the need for large shared memory, by distributing the overall memory requirement over multiple computer nodes. With the ability to do that, the per-node memory requirement can always be reduced by adding more compute nodes, albeit with the expense of increased inter-node communication. The physical memory available on each node is controlled by the
<italic>pagesize</italic>
parameter in the underlying MapReduce-MPI library. In this section, we look at the memory requirements of MapReduce-Inchworm, as a function of the number of compute nodes and the
<italic>pagesize</italic>
parameter.</p>
<p id="Par36">Firstly, we assessed the memory requirements as a function of the number of allocated compute nodes, using the mouse dataset (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). Within the 5 main steps of MapReduce-Inchworm, we collected the number of KV/KMV pairs generated from each of the three basic MapReduce functions:
<italic>map</italic>
,
<italic>collate</italic>
, or
<italic>reduce</italic>
. These values were converted into data object sizes in GB, and averaged over all compute nodes. Fig. 
<xref rid="Fig6" ref-type="fig">6(a)</xref>
shows that the data size per compute node, and hence the memory requirement, decreases with increasing number of nodes, as expected. The values for step 4 are also averaged over the iterations of the k-mer clustering algorithm, of which there are 47 for the mouse dataset. The figure shows clearly that step 2 of the MapReduce algorithm, which extracts edges from the input read data, is the most memory demanding.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Data objects created by MapReduce-Inchworm for the experimental mouse dataset (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). (
<bold>a</bold>
) The size of data objects generated by each MapReduce function on each compute node, as calculated from the number of KV/KMV pairs involved. The data sizes for each MapReduce function (
<italic>map()</italic>
,
<italic>collate()</italic>
, and
<italic>reduce()</italic>
) were averaged over nodes and iterations within each of the 5 main steps of MapReduce-Inchworm. (
<bold>b</bold>
-
<bold>e</bold>
) The corresponding cumulative I/O to disk, due to out-of-core processing, per compute node. Results are shown with
<italic>pagesize</italic>
set to: (
<bold>b</bold>
) 1 GB, (
<bold>c</bold>
) 2 GB, (
<bold>d</bold>
) 3 GB and (
<bold>e</bold>
) 4 GB. For all graphs, the Y-axis gives the data size in GB</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
<p id="Par37">The values in Fig.
<xref rid="Fig6" ref-type="fig">6(a)</xref>
give an estimate of the per-node memory requirements of MapReduce-Inchworm. When these exceed the physical memory allocated according to the
<italic>pagesize</italic>
parameter, then pages of data are written as temporary files on disk. Paging for each of the steps is shown in Fig.
<xref rid="Fig6" ref-type="fig">6(b)-(e)</xref>
for four choices of the
<italic>pagesize</italic>
parameter. For example, the data sizes of KV pairs obtained from the
<italic>map</italic>
operation of step 2 are 11.0~GB, 5.5~GB, 2.75~GB and 1.83~GB when run on 32, 64, 128 and 192 nodes respectively (see Fig.
<xref rid="Fig6" ref-type="fig">6(a)</xref>
). For a small
<italic>pagesize</italic>
of 1 GB (Fig.
<xref rid="Fig6" ref-type="fig">6(b)</xref>
), there is always some out-of-core processing. Increasing the
<italic>pagesize</italic>
to 2 GB (Fig.
<xref rid="Fig6" ref-type="fig">6(c)</xref>
) means that, in the case of 192 compute nodes, the KV pairs can fit in memory and there is no paging. Increasing the
<italic>pagesize</italic>
to 3 GB (Fig.
<xref rid="Fig6" ref-type="fig">6(d)</xref>
) means that out-of-core processing is eliminated when using 128 or 192 compute nodes. However, there is always paging for the smaller compute clusters considered, even for the largest tested
<italic>pagesize</italic>
of 4 GB.</p>
<p id="Par38">Out-of-core processing in the
<italic>map</italic>
operation of step 2 initiates
<italic>out-of-core</italic>
processing in the following
<italic>collate</italic>
step, because the KV pairs written to disk by the
<italic>map</italic>
operation need to be read back for the
<italic>collate</italic>
step. Fig.
<xref rid="Fig6" ref-type="fig">6(b)-(e)</xref>
shows that file I/O occurs for the
<italic>collate</italic>
operation of step 2 whenever it is present for the
<italic>map</italic>
operation. It also shows that total I/O is substantially higher for the
<italic>collate</italic>
operation. This arises because the core
<italic>collate</italic>
algorithm needs to cycle through the KV pairs multiple times, as it finds matches and builds up the KMV objects, necessitating multiple reading and writing of pages to disk. Thus, step 2 of the MapReduce algorithm is particularly sensitive to the choice of the
<italic>pagesize</italic>
parameter. On the other hand, Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
shows that the runtime is dominated by step 4, and so the performance hit caused by paging in step 2 is perhaps not so important.</p>
<p id="Par39">The underlying MapReduce-MPI library tends to evenly distribute the KMV pairs produced by
<italic>collate</italic>
operations by hashing each of its key for assignment onto available MPI-processors. In the present application, where the number of KMV pairs is much larger than the number of compute nodes, this is expected to lead to good load balancing. In fact, the minimum and maximum values for data size over all compute nodes are indistinguishable from the average values shown in Fig.
<xref rid="Fig6" ref-type="fig">6(a)</xref>
.</p>
</sec>
<sec id="Sec10">
<title>Performance comparison for RNA-Seq datasets from complex organisms</title>
<p id="Par40">We next tested our approach on some more challenging RNA-Seq datasets obtained from sugarbeet and wheat samples (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). The memory requirement of the original
<italic>Inchworm</italic>
depends on the transcriptome complexity and is expected to roughly correlate with the number of unique k-mers from the input reads. Figure 
<xref rid="Fig7" ref-type="fig">7(a)</xref>
shows that the mouse, sugarbeet and wheat datasets require 46.7GB, 141.5GB and 373.9GB of memory respectively, and these values do indeed correlate with the total number of unique k-mers listed in Table
<xref rid="Tab1" ref-type="table">1</xref>
. In fact, the required memory for the wheat dataset exceeded the physical memory on any single node of our available compute platforms. In order to run the original
<italic>Inchworm</italic>
, we used ScaleMP software (
<ext-link ext-link-type="uri" xlink:href="http://www.scalemp.com">http://www.scalemp.com</ext-link>
/) to aggregate 32 nodes, each providing 128GB memory, to create a vSMP node with a 4 TB address space.
<fig id="Fig7">
<label>Fig. 7</label>
<caption>
<p>(
<bold>a</bold>
) High-water mark for memory usage in GB over all compute nodes, (
<bold>b</bold>
) runtime in minutes, and (
<bold>c</bold>
) cumulative I/O in TB. Results are given for the mouse, sugarbeet and wheat datasets described in Table
<xref rid="Tab1" ref-type="table">1</xref>
, and using the computing resources listed. All jobs with MapReduce-Inchworm used a 4GB
<italic>pagesize</italic>
parameter. Bars marked
<italic>original</italic>
represent runs of the original Inchworm, MR-64 represents runs of MapReduce-Inchworm using 64 compute nodes, MR-128 represents runs using 128 compute nodes, and MR-192 represents runs using 192 compute nodes</p>
</caption>
<graphic xlink:href="12859_2017_1881_Fig7_HTML" id="MO7"></graphic>
</fig>
</p>
<p id="Par41">Figure
<xref rid="Fig7" ref-type="fig">7(b)</xref>
shows the total runtime to produce
<italic>Inchworm</italic>
contigs for
<italic>MapReduce-Inchworm</italic>
, compared with a run using the original
<italic>Inchworm</italic>
. With 64 compute nodes available, the
<italic>MapReduce-Inchworm</italic>
procedure yields a faster runtime, and increasing the number of nodes to 128 or 192 gives further improvements. Although we have only tested a small range of node counts, the scaling of the speedup is very good, suggesting that more nodes could be used. For example, tripling the number of nodes from 64 to 192 yields speedups of 2.4, 1.9 and 2.9 for mouse, sugarbeet and wheat respectively. The original
<italic>Inchworm</italic>
is particularly slow for the wheat dataset, taking over 6 days, because of the use of ScaleMP to provide the required amount of shared memory. While ScaleMP was required to be able to process the wheat dataset at all, its software-based aggregation of memory clearly incurs a significant overhead. Such problems are avoided in the distributed memory implementation of
<italic>MapReduce-Inchworm</italic>
.</p>
<p id="Par42">Figure
<xref rid="Fig7" ref-type="fig">7(c)</xref>
shows the cumulative I/O for the runs using
<italic>MapReduce-Inchworm</italic>
, reflecting the
<italic>out-of-core</italic>
processing of MapReduce-MPI. The wheat dataset, with 1.5~billion input reads and 5.8~billion unique k-mers, is particularly large and the MapReduce objects do not fit into the aggregate physical memory. Even with 192 nodes, there is 45 TB of I/O over the course of the run. This could be reduced by allocating further compute nodes, or by using higher memory nodes that could accommodate a larger pagesize. For the smaller sugarbeet dataset, the situation is much better, but there is nevertheless still a small amount of paging to disk.</p>
<p id="Par43">In conclusion, the
<italic>MapReduce-Inchworm</italic>
procedure scales well to large and complex datasets. Increasing the number of compute nodes leads to a reduction in runtime, and reduced paging to disk as the per-node memory requirements are lowered. In particular,
<italic>MapReduce-Inchworm</italic>
allowed us to process the large wheat dataset in less than a day, while the original
<italic>Inchworm</italic>
required an advanced platform solution to run at all.</p>
</sec>
</sec>
<sec id="Sec11">
<title>Discussion</title>
<p id="Par44">In this study, we enabled the parallelization of the
<italic>Inchworm</italic>
module of Trinity by using a MapReduce-based approach to pre-cluster the k-mers obtained from the input reads. An instance of
<italic>Inchworm</italic>
is run on each k-mer cluster, yielding a set of contigs per cluster. Contigs from all clusters are pooled together and passed to the
<italic>Chrysalis</italic>
module for re-clustering according to the original Trinity scheme. The
<italic>Inchworm</italic>
module of Trinity is known to be the most memory-intensive step [
<xref ref-type="bibr" rid="CR28">28</xref>
], and is often a barrier to processing large or complex RNA-Seq datasets. In our scheme, the computational load is passed to the pre-clustering step, where the well-established MapReduce procedure allows the load to be distributed over a commodity compute cluster. Our approach is distinct from other recent developments, which seek to MPI-parallelise
<italic>Inchworm</italic>
itself (Brian Haas, personal communication).</p>
<p id="Par45">Pell et al. [
<xref ref-type="bibr" rid="CR45">45</xref>
] have introduced a Bloom filter which provides memory efficient storage of kmer graphs. Chikhi and Rizk [
<xref ref-type="bibr" rid="CR46">46</xref>
] added an additional data structure holding the false positives which might arise through trial kmer extensions, as well as a marking structure holding complex nodes for use in graph traversal. In contrast, we store an explicit list of kmer nodes and edges in a set of MapReduce objects. There is no reduction in the total memory required, but rather we focus on the ability to distribute these MapReduce objects over a large number of nodes to reduce the per node memory requirement. Note that by storing edges explicitly, we do not make trial extensions from kmer nodes, which can lead to false edges in the Bloom filter method.</p>
<p id="Par46">We expect that there should be a correspondence between the k-mer clusters we generate, and the contig clusters (
<italic>components</italic>
) produced by
<italic>Chrysalis</italic>
, in that they both relate to a set of gene products. It may be that further efficiency gains can be achieved by merging these steps, but we have not investigated this possibility in the current work, adopting instead a conservative approach. If, for example, reads from a transcribed gene yield two k-mer clusters, and hence two sets of
<italic>Inchworm</italic>
contigs, then the
<italic>Chrysalis</italic>
module should in principle find
<italic>welds</italic>
between them, and recover the correct graph.</p>
<p id="Par47">The assessment of the assembly accuracy using simulated and experimental RNA-Seq datasets shows that our parallelized
<italic>Inchworm</italic>
provides the final transcripts from the Trinity pipeline with marginally more accuracy compared to the original inchworm. The difference in accuracy comes from the utilization of additional edge information in MapReduce-Inchworm, which clusters k-mers guided by edge objects which link pairs of k-mers appearing consecutively in input reads. On the other hand, the original
<italic>Inchworm</italic>
constructs contigs directly from the set of all k-mers. Contigs are extended by searching for appropriately overlapping k-mers, rather than using pre-calculated edges. This pre-collection of edge information is feasible in our approach because of the distributed nature of the algorithm.</p>
<p id="Par48">We have presented performance results for a range of experimental read datasets. The total runtime required to produce the complete set of
<italic>Inchworm</italic>
contigs could be reduced below that required for the original
<italic>Inchworm</italic>
, provided a moderate number of compute nodes are available. It may be debatable whether this is necessary for small datasets, such as the mouse dataset included here, but there are clearly significant gains for larger and more complex datasets (Fig.
<xref rid="Fig7" ref-type="fig">7</xref>
). More importantly, the memory requirement on each node can be reduced by distributing the job over sufficient nodes. In this way, commodity compute clusters can be used, and there is no need for high memory nodes or specialised solutions for aggregating node memory into a single address space.</p>
</sec>
<sec id="Sec12">
<title>Conclusion</title>
<p id="Par49">The results of this study indicate that the MapReduce framework has great potential for processing high throughput sequencing datasets more efficiently. The proposed approach could be applied as a pre-processing step for other de novo transcriptome assemblers, by implementing the chosen assembly code as a callback function in the final
<italic>reduce()</italic>
step, as we have done for
<italic>Inchworm</italic>
in the current study. Specifically, we plan to investigate the parallelization of Oases [
<xref ref-type="bibr" rid="CR15">15</xref>
] and SOAPdenovo-Trans [
<xref ref-type="bibr" rid="CR20">20</xref>
] via this approach. It is also worth exploring the feasibility of the pre-clustering approach for de novo metagenome and metatranscriptome assembly, which is more complex due to the presence of multiple genomes or transcriptomes from different species. For example, the de novo metagenome assembler [
<xref ref-type="bibr" rid="CR47">47</xref>
] starts by partitioning the
<italic>de Bruijn</italic>
graph into isolated components corresponding to different species. Then for each component, it reconstructs the slight variants of the genomes of subspecies within the same species using multiple sequence alignments. A similar approach has been developed for de novo metatransciptome assembly [
<xref ref-type="bibr" rid="CR48">48</xref>
]. Our proposed approach could be adapted to these pipelines to provide a memory-efficient method for the initial partitioning.</p>
<p id="Par50">In conclusion, we have presented a computationally efficient method for clustering k-mers derived from RNA-Seq datasets. Applied to the Trinity pipeline, the approach avoids the large memory requirements of the original Inchworm, enabling the analysis of large datasets on commodity compute clusters. We expect that this general approach will have applications for other assembly problems.</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec13">
<title>Additional file</title>
<p id="Par59">
<media position="anchor" xlink:href="12859_2017_1881_MOESM1_ESM.pdf" id="MOESM1">
<label>Additional file 1:</label>
<caption>
<p>Details of the MapReduce kmer clustering approach, and explicit algorithms for each step of the procedure. (PDF 73 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<fn-group>
<fn>
<p>
<bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (10.1186/s12859-017-1881-8) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>We would like to thank Brian Haas of the Broad Institute for his advice on the Trinity software and source code. We thank Keywan Hassani-Pak of Rothamsted Research, UK for providing the sugarbeet and wheat datasets, as well as many useful discussions. We thank Steve Plimpton for making the MapReduce-MPI library freely available, and for answering technical questions on its use. We also wish to acknowledge the Hartree Centre at the STFC Daresbury Laboratory, UK for providing the computational resources used for this work.</p>
<sec id="FPar1">
<title>Funding</title>
<p id="Par51">This work was internally funded by the Hartree Centre at STFC Daresbury Laboratory.</p>
</sec>
<sec id="FPar2">
<title>Availability of data and materials</title>
<p id="Par52">
<bold>Data</bold>
</p>
<p id="Par53">The datasets analysed during the current study are not publicly available due to the fact that they are collaborator’s datasets but are available from the corresponding author on reasonable request.</p>
<p id="Par54">
<bold>Software</bold>
</p>
<p id="Par55">The code used in this study is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/kimosaby2001/MR-Inchworm/">https://github.com/kimosaby2001/MR-Inchworm/</ext-link>
.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>
<bold>Authors’ contribution</bold>
s</title>
<p>CSK and MDW conceived the study. CSK implemented the method. CSK and MDW designed the research. All authors analysed the data. The benchmarking and repository maintenance were done by CSK and MDW. All authors wrote, read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="FPar3">
<title>Ethics approval and consent to participate</title>
<p id="Par56">Not applicable.</p>
</sec>
<sec id="FPar4">
<title>Consent for publication</title>
<p id="Par57">Not applicable.</p>
</sec>
<sec id="FPar5">
<title>Competing interests</title>
<p id="Par58">The authors declare that they have no competing interests.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corney</surname>
<given-names>DC</given-names>
</name>
</person-group>
<article-title>RNA-Seq using next generation sequencing</article-title>
<source>Materials and Methods</source>
<year>2013</year>
<volume>3</volume>
<fpage>203</fpage>
<pub-id pub-id-type="doi">10.13070/mm.en.3.203</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schliesky</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gowik</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Weber</surname>
<given-names>APM</given-names>
</name>
<name>
<surname>Brautigam</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>RNA-Seq assembly - are we there yet?</article-title>
<source>Front Plant Sci</source>
<year>2012</year>
<volume>3</volume>
<fpage>220</fpage>
<pub-id pub-id-type="doi">10.3389/fpls.2012.00220</pub-id>
<pub-id pub-id-type="pmid">23056003</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oshlack</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>From RNA-Seq reads to differential expression results</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<fpage>220</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-12-220</pub-id>
<pub-id pub-id-type="pmid">21176179</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gunaratne</surname>
<given-names>PH</given-names>
</name>
<name>
<surname>Coarfa</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Soibam</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tandon</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>miRNA data analysis: next-gen sequencing</article-title>
<source>Methods Mol Biol</source>
<year>2012</year>
<volume>822</volume>
<fpage>273</fpage>
<lpage>288</lpage>
<pub-id pub-id-type="doi">10.1007/978-1-61779-427-8_19</pub-id>
<pub-id pub-id-type="pmid">22144206</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ulitsky</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bartel</surname>
<given-names>DP</given-names>
</name>
</person-group>
<article-title>lincRNAs: genomics, evolution, and mechanisms</article-title>
<source>Cell</source>
<year>2013</year>
<volume>154</volume>
<fpage>26</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2013.06.020</pub-id>
<pub-id pub-id-type="pmid">23827673</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Memczak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jens</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Elefsinioti</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Torti</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Krueger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rybak</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Maier</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Mackowiak</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Gregersen</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Munschauer</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Circular RNAs are a large class of animal RNAs with regulatory potency</article-title>
<source>Nature</source>
<year>2013</year>
<volume>495</volume>
<fpage>333</fpage>
<lpage>338</lpage>
<pub-id pub-id-type="doi">10.1038/nature11928</pub-id>
<pub-id pub-id-type="pmid">23446348</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<mixed-citation publication-type="other">Reddy ASN, Rogers MF, Richardson DN, Hamilton M, Ben-Hur A. Deciphering the plant splicing code: experimental and computational approaches for predicting alternative splicing and splicing regulatory elements. Front Plant Sci. 2012;3</mixed-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<mixed-citation publication-type="other">Hooper JE. A survey of software for genome-wide discovery of differential splicing in RNA-Seq data. Human Genomics. 2014;8</mixed-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morin</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Bainbridge</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fejes</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Krzywinski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pugh</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Varhol</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJM</given-names>
</name>
<name>
<surname>Marra</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing</article-title>
<source>BioTechniques</source>
<year>2008</year>
<volume>45</volume>
<fpage>81</fpage>
<lpage>94</lpage>
<pub-id pub-id-type="doi">10.2144/000112900</pub-id>
<pub-id pub-id-type="pmid">18611170</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>ET</given-names>
</name>
<name>
<surname>Sandberg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Khrebtukova</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Mayr</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kingsmore</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Schroth</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Burge</surname>
<given-names>CB</given-names>
</name>
</person-group>
<article-title>Alternative isoform regulation in human tissue transcriptions</article-title>
<source>Nature</source>
<year>2008</year>
<volume>456</volume>
<fpage>470</fpage>
<lpage>476</lpage>
<pub-id pub-id-type="doi">10.1038/nature07509</pub-id>
<pub-id pub-id-type="pmid">18978772</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fullwood</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>C-L</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>ET</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>4</volume>
<fpage>521</fpage>
<lpage>532</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074906.107</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yassour</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kaplan</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Fraser</surname>
<given-names>HB</given-names>
</name>
<name>
<surname>Levin</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Pfiffner</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Adiconis</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Schroth</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Khrebtukova</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Gnirke</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2009</year>
<volume>9</volume>
<fpage>3264</fpage>
<lpage>3269</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0812841106</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guttman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Garber</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Levin</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Donaghey</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Adiconis</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Koziol</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Gnirke</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exomic structure of LineRNA</article-title>
<source>Nat Biotechnol</source>
<year>2010</year>
<volume>28</volume>
<fpage>503</fpage>
<lpage>510</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1633</pub-id>
<pub-id pub-id-type="pmid">20436462</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mortazavi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kwan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>van Baren</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Wold</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Pachter</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Transcripts assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation</article-title>
<source>Nat Biotechnol</source>
<year>2010</year>
<volume>28</volume>
<fpage>511</fpage>
<lpage>515</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1621</pub-id>
<pub-id pub-id-type="pmid">20436464</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<fpage>1086</fpage>
<lpage>1092</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts094</pub-id>
<pub-id pub-id-type="pmid">22368243</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sze</surname>
<given-names>S-H</given-names>
</name>
<name>
<surname>Tarone</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>A memory-efficient algorithm to obtain splicing graphs and de novo expression estimates from de Bruijn graphs of RNA-Seq data</article-title>
<source>BMC Genomics</source>
<year>2014</year>
<volume>15</volume>
<fpage>S6</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-15-S5-S6</pub-id>
<pub-id pub-id-type="pmid">25082000</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>CB</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>JQ</given-names>
</name>
<name>
<surname>Varhol</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Stazyk</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Morin</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo transcriptome assembly with ABySS</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>2872</fpage>
<lpage>2877</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp367</pub-id>
<pub-id pub-id-type="pmid">19528083</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJM</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="doi">10.1101/gr.089532.108</pub-id>
<pub-id pub-id-type="pmid">19251739</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>He</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SOAPdenovo-trans: de novo transcriptome assembly with short RNA-Seq reads</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<fpage>1660</fpage>
<lpage>1666</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu077</pub-id>
<pub-id pub-id-type="pmid">24532719</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="doi">10.1101/gr.097261.109</pub-id>
<pub-id pub-id-type="pmid">20019144</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grabherr</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Haas</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Yassour</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Levin</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Thompson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Amit</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Adiconis</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Raychowdhury</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Full-length transcriptome assembly from RNA-seq data without a reference genome</article-title>
<source>Nat Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<fpage>644</fpage>
<lpage>652</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1883</pub-id>
<pub-id pub-id-type="pmid">21572440</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<mixed-citation publication-type="other">Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 2015;16</mixed-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>McMullen</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>BinPacker: packing-based de novo Transcriptome assembly from RNA-seq data</article-title>
<source>PLoS Comput Biol</source>
<year>2016</year>
<volume>12</volume>
<fpage>e1004772</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1004772</pub-id>
<pub-id pub-id-type="pmid">26894997</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<mixed-citation publication-type="other">Martello S, Toth P. Knapsack problems: algorithms and computer implementations: John Wiley and Sons; 1990.</mixed-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<mixed-citation publication-type="other">Cabau C, Escudié F, Djari A, Guiguen Y, Bobe J, Klopp C. Compacting and correcting trinity and oases RNA-Seq de novo assemblies. PeerJ. 2017;5</mixed-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Q-Y</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>Y-M</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>S2</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-S14-S2</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sachdeva</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Winn</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>Parallelization of the trinity pipeline for de novo transcriptome assembly</article-title>
<source>Parallel & distributed processing symposium workshops (IPDPSW), 2014 IEEE international; may 19–23</source>
<year>2014</year>
<publisher-loc>USA</publisher-loc>
<publisher-name>Phoenix, AZ</publisher-name>
<fpage>566</fpage>
<lpage>575</lpage>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Howe</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Pyrkosz</surname>
<given-names>AB</given-names>
</name>
<name>
<surname>Brom</surname>
<given-names>TH</given-names>
</name>
</person-group>
<article-title>A reference-free algorithm for computational normalization of shotgun sequencing data</article-title>
<source>arXiv:12034802</source>
<year>2012</year>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dean</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>MapReduce: simplified data processing on large clusters</article-title>
<source>Computing in Science and Engineering</source>
<year>2009</year>
<volume>11</volume>
<fpage>29</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="doi">10.1109/MCSE.2009.120</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McKenna</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hanna</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Banks</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Sivachenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kristian Cibulskis</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Garimella</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Altshuler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Gabriel</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Daly</surname>
<given-names>M</given-names>
</name>
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<fpage>1297</fpage>
<lpage>1303</lpage>
<pub-id pub-id-type="doi">10.1101/gr.107524.110</pub-id>
<pub-id pub-id-type="pmid">20644199</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mohammed</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Far</surname>
<given-names>BH</given-names>
</name>
<name>
<surname>Naugler</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends</article-title>
<source>Biodata Mining</source>
<year>2014</year>
<volume>7</volume>
<fpage>22</fpage>
<pub-id pub-id-type="doi">10.1186/1756-0381-7-22</pub-id>
<pub-id pub-id-type="pmid">25383096</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>An efficient algorithm for DNA fragment assembly in MapReduce</article-title>
<source>Biochem Biophys Res Commun</source>
<year>2012</year>
<volume>426</volume>
<fpage>395</fpage>
<lpage>398</lpage>
<pub-id pub-id-type="doi">10.1016/j.bbrc.2012.08.101</pub-id>
<pub-id pub-id-type="pmid">22960169</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<mixed-citation publication-type="other">MapReduce-MPI Library.</mixed-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<mixed-citation publication-type="other">White T.
<italic>Hadoop: The Definitive Guide.</italic>
O'Reilly. Media. 2009;</mixed-citation>
</ref>
<ref id="CR36">
<label>36.</label>
<mixed-citation publication-type="other">Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems.
<italic>IEEE 13th International Symposium on High Performance Computer</italic>
Architecture. 2007:13–24.</mixed-citation>
</ref>
<ref id="CR37">
<label>37.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hopcroft</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Tarjan</surname>
<given-names>RE</given-names>
</name>
</person-group>
<article-title>Efficient algorithms for graph manipulation</article-title>
<source>Commun ACM</source>
<year>1973</year>
<volume>16</volume>
<fpage>372</fpage>
<lpage>378</lpage>
<pub-id pub-id-type="doi">10.1145/362248.362272</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Plimpton</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Devine</surname>
<given-names>KD</given-names>
</name>
</person-group>
<article-title>MapReduce in MPI for large-scale graph algorithms</article-title>
<source>Parallel Comput</source>
<year>2011</year>
<volume>37</volume>
<fpage>610</fpage>
<lpage>632</lpage>
<pub-id pub-id-type="doi">10.1016/j.parco.2011.02.004</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ruotti</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Thomson</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Dewey</surname>
<given-names>CN</given-names>
</name>
</person-group>
<article-title>RNA-Seq gene expression estimation with read mapping uncertainty</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>493</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp692</pub-id>
<pub-id pub-id-type="pmid">20022975</pub-id>
</element-citation>
</ref>
<ref id="CR40">
<label>40.</label>
<mixed-citation publication-type="other">Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12</mixed-citation>
</ref>
<ref id="CR41">
<label>41.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fillmore</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Collins</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Thomson</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dewey</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Evaluation of de novo transcriptome assemblies from RNA-Seq data</article-title>
<source>Genome Biol</source>
<year>2014</year>
<volume>15</volume>
<fpage>553</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-014-0553-5</pub-id>
<pub-id pub-id-type="pmid">25608678</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>BLAT-the BLAST-like alignment tool</article-title>
<source>Genome Res</source>
<year>2002</year>
<volume>12</volume>
<fpage>654</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="doi">10.1101/gr.229202</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pedersen</surname>
<given-names>BS</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>IV</given-names>
</name>
<name>
<surname>De</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>CruzDB: software for annotation of genomic intervals with UCSC genome-browser database</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<fpage>3003</fpage>
<lpage>3006</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt534</pub-id>
<pub-id pub-id-type="pmid">24037212</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<label>45.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hintze</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Canino-Koning</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Howe</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>CT</given-names>
</name>
</person-group>
<article-title>Scaling metagenome sequence assembly with probabilistic de Bruijn graphs</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2012</year>
<volume>109</volume>
<fpage>13272</fpage>
<lpage>13277</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1121464109</pub-id>
<pub-id pub-id-type="pmid">22847406</pub-id>
</element-citation>
</ref>
<ref id="CR46">
<label>46.</label>
<mixed-citation publication-type="other">Chikhi R, Rizk G. Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms for Molecular Biology. 2013;8</mixed-citation>
</ref>
<ref id="CR47">
<label>47.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
</person-group>
<article-title>Meta-IDBA: a de novo assembler for metagenomic data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>i94</fpage>
<lpage>i101</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr216</pub-id>
<pub-id pub-id-type="pmid">21685107</pub-id>
</element-citation>
</ref>
<ref id="CR48">
<label>48.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Parkinson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
</person-group>
<article-title>IDBA-MT: de novo assembler for Metatranscriptomic data generated from next-generation sequencing technology</article-title>
<source>J Comp Biol</source>
<year>2013</year>
<volume>20</volume>
<fpage>540</fpage>
<lpage>550</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2013.0042</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000269  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000269  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021