Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

Identifieur interne : 000944 ( Pmc/Corpus ); précédent : 000943; suivant : 000945

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

Auteurs : Berat Z. Haznedaroglu ; Darryl Reeves ; Hamid Rismani-Yazdi ; Jordan Peccia

Source :

RBID : PMC:3489510

Abstract

Background

The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly.

Results

Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented.

Conclusions

This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.


Url:
DOI: 10.1186/1471-2105-13-170
PubMed: 22808927
PubMed Central: 3489510

Links to Exploration step

PMC:3489510

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Optimization of
<italic>de novo</italic>
transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms</title>
<author>
<name sortKey="Haznedaroglu, Berat Z" sort="Haznedaroglu, Berat Z" uniqKey="Haznedaroglu B" first="Berat Z" last="Haznedaroglu">Berat Z. Haznedaroglu</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Reeves, Darryl" sort="Reeves, Darryl" uniqKey="Reeves D" first="Darryl" last="Reeves">Darryl Reeves</name>
<affiliation>
<nlm:aff id="I2">Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rismani Yazdi, Hamid" sort="Rismani Yazdi, Hamid" uniqKey="Rismani Yazdi H" first="Hamid" last="Rismani-Yazdi">Hamid Rismani-Yazdi</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Now at the Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Peccia, Jordan" sort="Peccia, Jordan" uniqKey="Peccia J" first="Jordan" last="Peccia">Jordan Peccia</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22808927</idno>
<idno type="pmc">3489510</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489510</idno>
<idno type="RBID">PMC:3489510</idno>
<idno type="doi">10.1186/1471-2105-13-170</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000944</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000944</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Optimization of
<italic>de novo</italic>
transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms</title>
<author>
<name sortKey="Haznedaroglu, Berat Z" sort="Haznedaroglu, Berat Z" uniqKey="Haznedaroglu B" first="Berat Z" last="Haznedaroglu">Berat Z. Haznedaroglu</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Reeves, Darryl" sort="Reeves, Darryl" uniqKey="Reeves D" first="Darryl" last="Reeves">Darryl Reeves</name>
<affiliation>
<nlm:aff id="I2">Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rismani Yazdi, Hamid" sort="Rismani Yazdi, Hamid" uniqKey="Rismani Yazdi H" first="Hamid" last="Rismani-Yazdi">Hamid Rismani-Yazdi</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Now at the Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Peccia, Jordan" sort="Peccia, Jordan" uniqKey="Peccia J" first="Jordan" last="Peccia">Jordan Peccia</name>
<affiliation>
<nlm:aff id="I1">Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The
<italic>k</italic>
-mer hash length is a key factor affecting the output of
<italic>de novo</italic>
transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single
<italic>k</italic>
-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single
<italic>k</italic>
-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of
<italic>k</italic>
-mer selection on the annotation output. This study provides an in-depth
<italic>k</italic>
-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual
<italic>k</italic>
-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual
<italic>k</italic>
-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a
<italic>de novo</italic>
transcriptome assembly.</p>
</sec>
<sec>
<title>Results</title>
<p>Analyses of single
<italic>k</italic>
-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of
<italic>k</italic>
-mers (
<italic>k-</italic>
19 to
<italic>k-</italic>
63). For each
<italic>k</italic>
-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other
<italic>k</italic>
-mer assemblies. Producing a non-redundant CA of
<italic>k</italic>
-mers 19 to 63 resulted in a more complete functional annotation than any single
<italic>k</italic>
-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual
<italic>k</italic>
-mers (
<italic>k-</italic>
19 to
<italic>k-</italic>
63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>This study demonstrated that different
<italic>k</italic>
-mer choices result in various quantities of unique contigs per single
<italic>k</italic>
-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-
<italic>k</italic>
assemblies with redundancy removal. The complete extraction of biological information in
<italic>de novo</italic>
transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual
<italic>k</italic>
-mer assemblies but not in the CA.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Iyer, Mk" uniqKey="Iyer M">MK Iyer</name>
</author>
<author>
<name sortKey="Chinnaiyan, Am" uniqKey="Chinnaiyan A">AM Chinnaiyan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martin, Ja" uniqKey="Martin J">JA Martin</name>
</author>
<author>
<name sortKey="Wang, Z" uniqKey="Wang Z">Z Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Bruijn, Ng" uniqKey="De Bruijn N">NG De Bruijn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robertson, G" uniqKey="Robertson G">G Robertson</name>
</author>
<author>
<name sortKey="Schein, J" uniqKey="Schein J">J Schein</name>
</author>
<author>
<name sortKey="Chiu, R" uniqKey="Chiu R">R Chiu</name>
</author>
<author>
<name sortKey="Corbett, R" uniqKey="Corbett R">R Corbett</name>
</author>
<author>
<name sortKey="Field, M" uniqKey="Field M">M Field</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Mungall, K" uniqKey="Mungall K">K Mungall</name>
</author>
<author>
<name sortKey="Lee, S" uniqKey="Lee S">S Lee</name>
</author>
<author>
<name sortKey="Okada, Hm" uniqKey="Okada H">HM Okada</name>
</author>
<author>
<name sortKey="Qian, Jq" uniqKey="Qian J">JQ Qian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Qian, W" uniqKey="Qian W">W Qian</name>
</author>
<author>
<name sortKey="Fang, X" uniqKey="Fang X">X Fang</name>
</author>
<author>
<name sortKey="Shi, Z" uniqKey="Shi Z">Z Shi</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Shan, G" uniqKey="Shan G">G Shan</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabherr, Mg" uniqKey="Grabherr M">MG Grabherr</name>
</author>
<author>
<name sortKey="Haas, Bj" uniqKey="Haas B">BJ Haas</name>
</author>
<author>
<name sortKey="Yassour, M" uniqKey="Yassour M">M Yassour</name>
</author>
<author>
<name sortKey="Levin, Jz" uniqKey="Levin J">JZ Levin</name>
</author>
<author>
<name sortKey="Thompson, Da" uniqKey="Thompson D">DA Thompson</name>
</author>
<author>
<name sortKey="Amit, I" uniqKey="Amit I">I Amit</name>
</author>
<author>
<name sortKey="Adiconis, X" uniqKey="Adiconis X">X Adiconis</name>
</author>
<author>
<name sortKey="Fan, L" uniqKey="Fan L">L Fan</name>
</author>
<author>
<name sortKey="Raychowdhury, R" uniqKey="Raychowdhury R">R Raychowdhury</name>
</author>
<author>
<name sortKey="Zeng, Q" uniqKey="Zeng Q">Q Zeng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, S" uniqKey="Bao S">S Bao</name>
</author>
<author>
<name sortKey="Jiang, R" uniqKey="Jiang R">R Jiang</name>
</author>
<author>
<name sortKey="Kwan, W" uniqKey="Kwan W">W Kwan</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Ma, X" uniqKey="Ma X">X Ma</name>
</author>
<author>
<name sortKey="Song, Y Q" uniqKey="Song Y">Y-Q Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Narzisi, G" uniqKey="Narzisi G">G Narzisi</name>
</author>
<author>
<name sortKey="Mishra, B" uniqKey="Mishra B">B Mishra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author>
<name sortKey="Tang, Y" uniqKey="Tang Y">Y Tang</name>
</author>
<author>
<name sortKey="Shang, J" uniqKey="Shang J">J Shang</name>
</author>
<author>
<name sortKey="Shen, B" uniqKey="Shen B">B Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Surget Groba, Y" uniqKey="Surget Groba Y">Y Surget-Groba</name>
</author>
<author>
<name sortKey="Montoya Burgos, Ji" uniqKey="Montoya Burgos J">JI Montoya-Burgos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Liang, F" uniqKey="Liang F">F Liang</name>
</author>
<author>
<name sortKey="Antonescu, V" uniqKey="Antonescu V">V Antonescu</name>
</author>
<author>
<name sortKey="Sultana, R" uniqKey="Sultana R">R Sultana</name>
</author>
<author>
<name sortKey="Karamycheva, S" uniqKey="Karamycheva S">S Karamycheva</name>
</author>
<author>
<name sortKey="Lee, Y" uniqKey="Lee Y">Y Lee</name>
</author>
<author>
<name sortKey="White, J" uniqKey="White J">J White</name>
</author>
<author>
<name sortKey="Cheung, F" uniqKey="Cheung F">F Cheung</name>
</author>
<author>
<name sortKey="Parvizi, B" uniqKey="Parvizi B">B Parvizi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Griffiths, M" uniqKey="Griffiths M">M Griffiths</name>
</author>
<author>
<name sortKey="Harrison, S" uniqKey="Harrison S">S Harrison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Horsman, M" uniqKey="Horsman M">M Horsman</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Wu, N" uniqKey="Wu N">N Wu</name>
</author>
<author>
<name sortKey="Lan, C" uniqKey="Lan C">C Lan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruvost, J" uniqKey="Pruvost J">J Pruvost</name>
</author>
<author>
<name sortKey="Van Vooren, G" uniqKey="Van Vooren G">G Van Vooren</name>
</author>
<author>
<name sortKey="Cogne, G" uniqKey="Cogne G">G Cogne</name>
</author>
<author>
<name sortKey="Legrand, J" uniqKey="Legrand J">J Legrand</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andrews, S" uniqKey="Andrews S">S Andrews</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cox, M" uniqKey="Cox M">M Cox</name>
</author>
<author>
<name sortKey="Peterson, D" uniqKey="Peterson D">D Peterson</name>
</author>
<author>
<name sortKey="Biggs, P" uniqKey="Biggs P">P Biggs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garg, R" uniqKey="Garg R">R Garg</name>
</author>
<author>
<name sortKey="Patel, Rk" uniqKey="Patel R">RK Patel</name>
</author>
<author>
<name sortKey="Tyagi, Ak" uniqKey="Tyagi A">AK Tyagi</name>
</author>
<author>
<name sortKey="Jain, M" uniqKey="Jain M">M Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feldmeyer, B" uniqKey="Feldmeyer B">B Feldmeyer</name>
</author>
<author>
<name sortKey="Wheat, C" uniqKey="Wheat C">C Wheat</name>
</author>
<author>
<name sortKey="Krezdorn, N" uniqKey="Krezdorn N">N Krezdorn</name>
</author>
<author>
<name sortKey="Rotter, B" uniqKey="Rotter B">B Rotter</name>
</author>
<author>
<name sortKey="Pfenninger, M" uniqKey="Pfenninger M">M Pfenninger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Moriya, Y" uniqKey="Moriya Y">Y Moriya</name>
</author>
<author>
<name sortKey="Itoh, M" uniqKey="Itoh M">M Itoh</name>
</author>
<author>
<name sortKey="Okuda, S" uniqKey="Okuda S">S Okuda</name>
</author>
<author>
<name sortKey="Yoshizawa, Ac" uniqKey="Yoshizawa A">AC Yoshizawa</name>
</author>
<author>
<name sortKey="Kanehisa, M" uniqKey="Kanehisa M">M Kanehisa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aoki Kinoshita, Kf" uniqKey="Aoki Kinoshita K">KF Aoki-Kinoshita</name>
</author>
<author>
<name sortKey="Kanehisa, M" uniqKey="Kanehisa M">M Kanehisa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22808927</article-id>
<article-id pub-id-type="pmc">3489510</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-170</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-170</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Optimization of
<italic>de novo</italic>
transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Haznedaroglu</surname>
<given-names>Berat Z</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>berat.haznedaroglu@yale.edu</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Reeves</surname>
<given-names>Darryl</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>darryl.reeves@yale.edu</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Rismani-Yazdi</surname>
<given-names>Hamid</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>hrismani@mit.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A4">
<name>
<surname>Peccia</surname>
<given-names>Jordan</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>jordan.peccia@yale.edu</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06511, USA</aff>
<aff id="I2">
<label>2</label>
Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA</aff>
<aff id="I3">
<label>3</label>
Now at the Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>18</day>
<month>7</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<fpage>170</fpage>
<lpage>170</lpage>
<history>
<date date-type="received">
<day>25</day>
<month>2</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>6</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2012 Haznedaroglu et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Haznedaroglu et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/170"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>The
<italic>k</italic>
-mer hash length is a key factor affecting the output of
<italic>de novo</italic>
transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single
<italic>k</italic>
-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single
<italic>k</italic>
-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of
<italic>k</italic>
-mer selection on the annotation output. This study provides an in-depth
<italic>k</italic>
-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual
<italic>k</italic>
-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual
<italic>k</italic>
-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a
<italic>de novo</italic>
transcriptome assembly.</p>
</sec>
<sec>
<title>Results</title>
<p>Analyses of single
<italic>k</italic>
-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of
<italic>k</italic>
-mers (
<italic>k-</italic>
19 to
<italic>k-</italic>
63). For each
<italic>k</italic>
-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other
<italic>k</italic>
-mer assemblies. Producing a non-redundant CA of
<italic>k</italic>
-mers 19 to 63 resulted in a more complete functional annotation than any single
<italic>k</italic>
-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual
<italic>k</italic>
-mers (
<italic>k-</italic>
19 to
<italic>k-</italic>
63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>This study demonstrated that different
<italic>k</italic>
-mer choices result in various quantities of unique contigs per single
<italic>k</italic>
-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-
<italic>k</italic>
assemblies with redundancy removal. The complete extraction of biological information in
<italic>de novo</italic>
transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual
<italic>k</italic>
-mer assemblies but not in the CA.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Transcriptomic studies using high-throughput sequencing data have enabled researchers to study the global and specific gene expression of many different organisms without the need for a fully sequenced and annotated genome [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
]. Recently, de Bruijn graph-based [
<xref ref-type="bibr" rid="B3">3</xref>
] software packages such as Oases [
<xref ref-type="bibr" rid="B4">4</xref>
], Trans-ABySS [
<xref ref-type="bibr" rid="B5">5</xref>
], SOAPdenovo [
<xref ref-type="bibr" rid="B6">6</xref>
], and Trinity [
<xref ref-type="bibr" rid="B7">7</xref>
] have been developed to facilitate the transcriptome assembly of massive amounts of short read sequences produced using next generation DNA sequencing technologies. The power and robustness of these packages for forming contiguous sequences (contigs) has been tested, and comparative evaluations on computational resources such as execution time and parallelization, storage, and memory usage have been documented [
<xref ref-type="bibr" rid="B8">8</xref>
-
<xref ref-type="bibr" rid="B10">10</xref>
]. The choice of
<italic>k</italic>
-mer length (the length parameter defining the sequence overlap between two reads forming a contig) significantly affects the final assembly product [
<xref ref-type="bibr" rid="B5">5</xref>
]. Shorter
<italic>k</italic>
-mer values might be a better choice in low-coverage studies to prevent the formation of complex overlapping nodes; whereas a larger
<italic>k</italic>
-mer choice would be more practical for high-coverage sequencing projects [
<xref ref-type="bibr" rid="B11">11</xref>
] to improve assembly accuracy. As an alternative to a single best
<italic>k</italic>
-mer value selection, multi-
<italic>k</italic>
value based methods have been adopted to compile different
<italic>k</italic>
-mer assemblies in order to improve performance, sensitivity, and specificity of the overall
<italic>de novo</italic>
transcriptome assemblies [
<xref ref-type="bibr" rid="B2">2</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
]. Multi-
<italic>k</italic>
value based transcriptome assemblies come along with additional complexities, requiring algorithms to efficiently cluster homologous sequences from each single-
<italic>k</italic>
assembly and to remove redundant contigs to generate the final non-redundant clustered assembly (CA). Several algorithms such as CD-HIT-EST [
<xref ref-type="bibr" rid="B13">13</xref>
], VMATCH [
<xref ref-type="bibr" rid="B14">14</xref>
], and TGI Clustering tools [
<xref ref-type="bibr" rid="B15">15</xref>
] have been developed to obtain an optimal assembly clustering.</p>
<p>To date, the optimization studies for both single
<italic>k</italic>
-mer and clustered multi
<italic>k</italic>
-mer assemblies have largely focused on the length and number of contigs produced as a metric to evaluate the quality of the assembly output. There is, however, a limited understanding of how functional annotation—a primary goal of
<italic>de novo </italic>
transcriptome analysis—is affected by
<italic>k</italic>
-mer selection and clustering of multi-
<italic>k</italic>
assemblies. In this study, we report the significance of
<italic>k</italic>
-mer selection in the
<italic>de novo</italic>
assembly and annotation of a non-model eukaryotic organism’s transcriptome with no reference genome information available. We document the variations in uniqueness and the degree of functional annotations obtained under single
<italic>k</italic>
-mer and multi-
<italic>k</italic>
clustering methods, and present an assembly strategy to optimize the functional annotation to generate the gene catalogue of a non-model eukaryotic organism. Analysis is performed on Illumina short read sequencing of mRNA transcripts from the microalgae
<italic>Neochloris oleoabundans</italic>
, a candidate species for the production of microalgae-based biofuels [
<xref ref-type="bibr" rid="B16">16</xref>
,
<xref ref-type="bibr" rid="B17">17</xref>
].</p>
<p>Herein, we also demonstrate that the combination of individual
<italic>k</italic>
-mer assemblies improves, but does not complete the annotation of all available unique contigs produced in an assembly. A workflow and useful scripts are provided to allow retrieval of additional biological information from contigs that are present in individual
<italic>k</italic>
-mer assemblies, but not in the clustered
<italic>k</italic>
-mer assembly.</p>
</sec>
<sec>
<title>Results and discussion</title>
<sec>
<title>Sequencing and
<italic>de novo</italic>
transcriptome assembly</title>
<p>Following the removal of short and low-quality reads, the remaining read set was assembled using the combined Velvet and Oases packages [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B11">11</xref>
] with single-
<italic>k</italic>
value selection of odd numbers ranging from 19 to 63. The assembly metrics are provided in Table
<xref ref-type="table" rid="T1">1</xref>
for the representative
<italic>k</italic>
-mers: 19, 21, 23, 27, 33, 37, 43, 53, and 63. The number of reads assembled increased gradually from ~18.1 M (for
<italic>k</italic>
-19 assembly) to ~30.2 M (for
<italic>k</italic>
-63 assembly), whilst the number of reads mapped was within the range of ~27.8 to 33.3 M for all assemblies. Contig numbers, length distributions, and length-weighted medians (N50 and N90) were comparable among all assemblies, except the
<italic>k</italic>
-19 assembly (Table
<xref ref-type="table" rid="T1">1</xref>
). The highest number of contigs produced per assembly was 99,438 for the
<italic>k</italic>
-19 assembly. The contig number steadily decreased to 32,780 as the
<italic>k</italic>
-mer value increased to 63. Individual contig length and count frequencies are also depicted in Figure
<xref ref-type="fig" rid="F1">1</xref>
for the same representative set of
<italic>k</italic>
-mers from 19 through 63. These length data (calculated as N50 and N90 in Table
<xref ref-type="table" rid="T1">1</xref>
) and the contig length distribution histograms (Figure
<xref ref-type="fig" rid="F1">1</xref>
) demonstrate that a greater contiguity was achieved in mid-range assemblies with
<italic>k</italic>
-mer selection of 21 to 43 as compared to
<italic>k</italic>
-19,
<italic>k</italic>
-53, and
<italic>k</italic>
-63 assemblies. The average contig length for
<italic>k</italic>
-21 to
<italic>k</italic>
-43 assemblies was approximately 1.4 times longer than that of
<italic>k</italic>
-19,
<italic>k</italic>
-53, and
<italic>k</italic>
-63 assemblies. Additionally, the longest average contig length assembled was 1,463 bp in the
<italic>k</italic>
-23 assembly. </p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Transcriptome sequencing and assembly summary</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
<col align="left"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="left">
<bold>
<italic>k</italic>
</bold>
<bold>-19</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>21</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>23</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>27</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>33</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>37</bold>
</th>
<th align="left">
<bold>
<italic>k-</italic>
</bold>
<bold>43</bold>
</th>
<th align="left">
<bold>
<italic>k</italic>
</bold>
<bold>-53</bold>
</th>
<th align="left">
<bold>
<italic>k</italic>
</bold>
<bold>-63</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<bold>Sequencing</bold>
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Raw sequencing reads
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">44,568,122
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">99
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Pre-assembly</bold>
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Reads requiring trimming
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">29,264,547
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Minimum read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">1
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Lower quartile read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">65
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Median read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">87
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Upper quartile read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">99
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Maximum read length
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom">99
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Assembly</bold>
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="left" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
<td align="right" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of reads assembled
<hr></hr>
</td>
<td align="right" valign="bottom">18,097,635
<hr></hr>
</td>
<td align="right" valign="bottom">18,481,043
<hr></hr>
</td>
<td align="right" valign="bottom">18,018,855
<hr></hr>
</td>
<td align="right" valign="bottom">16,878,820
<hr></hr>
</td>
<td align="right" valign="bottom">16,918,312
<hr></hr>
</td>
<td align="right" valign="bottom">17,188,419
<hr></hr>
</td>
<td align="right" valign="bottom">26,970,166
<hr></hr>
</td>
<td align="right" valign="bottom">30,231,540
<hr></hr>
</td>
<td align="right" valign="bottom">28,058,816
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of reads mapped
<hr></hr>
</td>
<td align="right" valign="bottom">28,449,200
<hr></hr>
</td>
<td align="right" valign="bottom">31,893,197
<hr></hr>
</td>
<td align="right" valign="bottom">33,199,219
<hr></hr>
</td>
<td align="right" valign="bottom">32,903,930
<hr></hr>
</td>
<td align="right" valign="bottom">33,395,304
<hr></hr>
</td>
<td align="right" valign="bottom">32,523,018
<hr></hr>
</td>
<td align="right" valign="bottom">31,939,290
<hr></hr>
</td>
<td align="right" valign="bottom">30,267,756
<hr></hr>
</td>
<td align="right" valign="bottom">27,808,027
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of contigs (≥ 100 bp)
<hr></hr>
</td>
<td align="right" valign="bottom">98,094
<hr></hr>
</td>
<td align="right" valign="bottom">64,000
<hr></hr>
</td>
<td align="right" valign="bottom">47,448
<hr></hr>
</td>
<td align="right" valign="bottom">46,461
<hr></hr>
</td>
<td align="right" valign="bottom">40,965
<hr></hr>
</td>
<td align="right" valign="bottom">46,442
<hr></hr>
</td>
<td align="right" valign="bottom">34,489
<hr></hr>
</td>
<td align="right" valign="bottom">33,344
<hr></hr>
</td>
<td align="right" valign="bottom">32,639
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of contigs (≥ 5,000 bp)
<hr></hr>
</td>
<td align="right" valign="bottom">470
<hr></hr>
</td>
<td align="right" valign="bottom">1,296
<hr></hr>
</td>
<td align="right" valign="bottom">1,587
<hr></hr>
</td>
<td align="right" valign="bottom">1, 315
<hr></hr>
</td>
<td align="right" valign="bottom">1,025
<hr></hr>
</td>
<td align="right" valign="bottom">742
<hr></hr>
</td>
<td align="right" valign="bottom">636
<hr></hr>
</td>
<td align="right" valign="bottom">253
<hr></hr>
</td>
<td align="right" valign="bottom">115
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of contigs (≥ 8,000 bp)
<hr></hr>
</td>
<td align="right" valign="bottom">42
<hr></hr>
</td>
<td align="right" valign="bottom">155
<hr></hr>
</td>
<td align="right" valign="bottom">215
<hr></hr>
</td>
<td align="right" valign="bottom">165
<hr></hr>
</td>
<td align="right" valign="bottom">119
<hr></hr>
</td>
<td align="right" valign="bottom">72
<hr></hr>
</td>
<td align="right" valign="bottom">105
<hr></hr>
</td>
<td align="right" valign="bottom">17
<hr></hr>
</td>
<td align="right" valign="bottom">22
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Average length of contigs
<hr></hr>
</td>
<td align="right" valign="bottom">700
<hr></hr>
</td>
<td align="right" valign="bottom">1,114
<hr></hr>
</td>
<td align="right" valign="bottom">1,463
<hr></hr>
</td>
<td align="right" valign="bottom">1,356
<hr></hr>
</td>
<td align="right" valign="bottom">1,383
<hr></hr>
</td>
<td align="right" valign="bottom">1,115
<hr></hr>
</td>
<td align="right" valign="bottom">1,402
<hr></hr>
</td>
<td align="right" valign="bottom">1,120
<hr></hr>
</td>
<td align="right" valign="bottom">914
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Longest contig length
<hr></hr>
</td>
<td align="right" valign="bottom">46,754
<hr></hr>
</td>
<td align="right" valign="bottom">29,394
<hr></hr>
</td>
<td align="right" valign="bottom">16,393
<hr></hr>
</td>
<td align="right" valign="bottom">14,115
<hr></hr>
</td>
<td align="right" valign="bottom">13,754
<hr></hr>
</td>
<td align="right" valign="bottom">13,685
<hr></hr>
</td>
<td align="right" valign="bottom">12,571
<hr></hr>
</td>
<td align="right" valign="bottom">9,484
<hr></hr>
</td>
<td align="right" valign="bottom">12,582
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">N50
<hr></hr>
</td>
<td align="right" valign="bottom">1,594
<hr></hr>
</td>
<td align="right" valign="bottom">2,415
<hr></hr>
</td>
<td align="right" valign="bottom">2,745
<hr></hr>
</td>
<td align="right" valign="bottom">2,624
<hr></hr>
</td>
<td align="right" valign="bottom">2,497
<hr></hr>
</td>
<td align="right" valign="bottom">2,202
<hr></hr>
</td>
<td align="right" valign="bottom">2,349
<hr></hr>
</td>
<td align="right" valign="bottom">1,836
<hr></hr>
</td>
<td align="right" valign="bottom">1,498
<hr></hr>
</td>
</tr>
<tr>
<td align="left">N90</td>
<td align="right">249</td>
<td align="right">470</td>
<td align="right">795</td>
<td align="right">696</td>
<td align="right">733</td>
<td align="right">488</td>
<td align="right">730</td>
<td align="right">515</td>
<td align="right">372</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>Cumulative contig length frequency distributions for individual assemblies.</p>
</caption>
<graphic xlink:href="1471-2105-13-170-1"></graphic>
</fig>
</sec>
<sec>
<title>Effects of
<italic>k</italic>
-mer selection on mapping, functional annotation, and coverage</title>
<p>To compare the differences in attainable functional annotation between each assembly, the contigs originated from single-
<italic>k</italic>
value assemblies were separately mapped to the KEGG gene and protein families, and the number of unique KEGG Ortholog Identifiers (KOIs) was determined. The number of KOIs identified for a single
<italic>k</italic>
-mer value reflected the trend previously observed with the contig number (Figure
<xref ref-type="fig" rid="F2">2</xref>
). The highest number of KOIs was generated from the
<italic>k</italic>
-19 assembly, and the number of identified KOIs decreased as the
<italic>k</italic>
-mer value increased to 63.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Total contig and KOI counts for each </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assembly.</bold>
</p>
</caption>
<graphic xlink:href="1471-2105-13-170-2"></graphic>
</fig>
<p>To investigate if each assembly contained a distinct collection of identified genes, the KOIs unique to each
<italic>k</italic>
-mer assembly were identified and their quantities are presented in Figure
<xref ref-type="fig" rid="F3">3</xref>
. This matrix table displays the number of unique KOIs (for a specific row
<italic>k-</italic>
mer value assembly) not found in the set of column
<italic>k</italic>
-mer assemblies in a pair-wise comparison. Moving down the
<italic>k</italic>
-mer column on Figure
<xref ref-type="fig" rid="F3">3</xref>
, the
<italic>k</italic>
-63 assembly resulted in the highest number of unique KOIs that were missing in the other assemblies, followed by
<italic>k</italic>
-61,
<italic>k</italic>
-59,
<italic>k</italic>
-57, and
<italic>k</italic>
-19. The number of missing KOIs decreased as
<italic>k</italic>
-mer value increased from 19 to 37 and then increased afterwards.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Comparative matrix of number of unique KOIs missing in each single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assembly.</bold>
Each value represents the number of unique KOIs (for a specific row
<italic>k</italic>
-mer value assembly) not identified in the set of column
<italic>k</italic>
-mer assemblies.</p>
</caption>
<graphic xlink:href="1471-2105-13-170-3"></graphic>
</fig>
<p>This analysis has clearly shown that the number of missing KOIs was minimal for mid-range
<italic>k</italic>
-mers, i.e. from 21 to 41, but it was more prominent for short and long
<italic>k</italic>
-mer sizes, i.e. 19, 43, and above. The fact that the highest quantities of missing KOIs corresponded to the highest and lowest number of generated contigs, in
<italic>k</italic>
-19 and
<italic>k-</italic>
63 assemblies respectively, was not surprising as these two extreme assemblies likely contained more unannotated contigs compared to other single
<italic>k</italic>
-mer assemblies, where higher accuracy in biological annotation is achieved with optimal mid-range
<italic>k</italic>
-mer length and ultimately contig length.</p>
<p>To further characterize the relationship between the single
<italic>k</italic>
-mer assemblies and the quantities of generated contigs, trimmed reads were mapped to individual
<italic>k</italic>
-19 to
<italic>k</italic>
-63 transcriptome assemblies and the fold coverage for each assembly was determined (Figure
<xref ref-type="fig" rid="F4">4</xref>
). The results plotted in Figure
<xref ref-type="fig" rid="F4">4</xref>
for the representative
<italic>k</italic>
-mer set demonstrate that under all mismatch parameters tested (i.e. 0, 1, and 2) the coverage was above 1600× for all
<italic>k</italic>
-mer values except
<italic>k-</italic>
19. When one or two mismatches were allowed, more than 2300× coverage was obtained for
<italic>k</italic>
-mers 23 to 53 except 37. Although lower, the contig coverage for the
<italic>k</italic>
-19 assembly was still greater than 1000× (Figure
<xref ref-type="fig" rid="F4">4</xref>
).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Contig coverages of representative single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies as a function of mismatches allowed by Bowtie mismatches.</bold>
</p>
</caption>
<graphic xlink:href="1471-2105-13-170-4"></graphic>
</fig>
<p>Despite the fact that
<italic>k</italic>
-mer 19 has resulted in a lower quality assembly in terms of coverage and missing annotations as discussed above, utilization of the k-19 assembly might still have value in annotating the transcriptome. Overall, lower
<italic>k</italic>
-mer assemblies are more successful in capturing transcripts with lower abundances, but as
<italic>k</italic>
-mer length increases, transcripts with higher abundances are more likely to be detected. Therefore, all individual assemblies of
<italic>k</italic>
-19 to
<italic>k</italic>
-63 were utilized to generate a multi-
<italic>k</italic>
based CA [
<xref ref-type="bibr" rid="B11">11</xref>
].</p>
</sec>
<sec>
<title>Assembly clustering and optimization</title>
<p>The generation of CA was performed using three different sequence clustering programs: Oases (through its own multi-
<italic>k</italic>
option), CD-HIT-EST, and VMATCH. The CAs obtained with varying clustering scenarios allowed by these packages were annotated and the number of KOIs present in individual
<italic>k</italic>
-mer assemblies but not in the CA were determined (Table
<xref ref-type="table" rid="T2">2</xref>
). The best performing package in this regard was the CD-HIT-EST program with 456 missing KOIs in total when a sequence identity threshold parameter of 1.0 was chosen. The Oases package also produced similar results with 635 missing KOIs in total when its multi-
<italic>k</italic>
option was enabled (Table
<xref ref-type="table" rid="T2">2</xref>
). Furthermore, a reversed comparative analysis was conducted based on the Oases multi-
<italic>k</italic>
and CD-HIT-EST (1.0) results to determine the number of KOIs annotated in the CA, but not present in individual
<italic>k</italic>
-mer assemblies (Figure
<xref ref-type="fig" rid="F5">5</xref>
). This analysis demonstrated that the CAs resulted in considerably less missing KOIs than did the individual assemblies. In addition the number of missing unique KOIs was ~2.5-7 times less for the CA generated by CD-HIT-EST and ~5-40 times less for CA generated by the Oases multi-
<italic>k</italic>
compared to the single assemblies of
<italic>k</italic>
-mers 19 to 63.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>
<bold>Number of KOIs present in individual </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies but missing from the combined assemblies generated with different programs</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="char"></col>
<col align="char"></col>
<col align="char"></col>
<col align="char"></col>
<col align="char"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>
<italic>k</italic>
</bold>
</th>
<th align="char" char=".">
<bold>OASES multi-</bold>
<bold>
<italic>k </italic>
</bold>
<bold>merged</bold>
</th>
<th align="char" char=".">
<bold>CD-HIT-EST (0.90)</bold>
<sup>
<bold>1</bold>
</sup>
</th>
<th align="char" char=".">
<bold>CD-HIT-EST (0.95)</bold>
<sup>
<bold>1</bold>
</sup>
</th>
<th align="char" char=".">
<bold>CD-HIT-EST (1.0)</bold>
<sup>
<bold>1</bold>
</sup>
</th>
<th align="char" char=".">
<bold>VMATCH</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<bold>19</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">58
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">108
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">86
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">33
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">481
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>21</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">19
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">55
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">41
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">9
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">433
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>23</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">9
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">37
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">23
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">3
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">402
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>25</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">7
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">44
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">28
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">4
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">409
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>27</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">8
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">45
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">29
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">5
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">415
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>29</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">8
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">53
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">36
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">7
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">420
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>31</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">8
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">47
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">30
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">8
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">413
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>33</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">10
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">43
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">27
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">9
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">413
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>35</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">13
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">51
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">35
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">8
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">422
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>37</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">17
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">60
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">40
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">10
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">432
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>39</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">19
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">57
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">38
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">9
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">412
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>41</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">17
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">59
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">45
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">9
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">412
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>43</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">22
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">63
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">49
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">14
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">413
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>45</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">22
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">67
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">52
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">14
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">420
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>47</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">24
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">76
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">58
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">19
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">426
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>49</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">33
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">81
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">63
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">25
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">432
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>51</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">36
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">89
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">73
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">28
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">446
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>53</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">38
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">95
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">75
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">33
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">447
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>55</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">42
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">100
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">80
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">35
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">449
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>57</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">46
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">103
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">83
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">37
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">460
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>59</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">55
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">113
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">93
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">42
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">473
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>61</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">56
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">112
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">97
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">43
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">468
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>63</bold>
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">68
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">125
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">110
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">52
<hr></hr>
</td>
<td align="char" valign="bottom" char=".">472
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>TOTAL</bold>
</td>
<td align="char" char=".">
<bold>635</bold>
</td>
<td align="char" char=".">
<bold>1683</bold>
</td>
<td align="char" char=".">
<bold>1291</bold>
</td>
<td align="char" char=".">
<bold>456</bold>
</td>
<td align="char" char=".">
<bold>9970</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<sup>1</sup>
Represents sequence identity.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Number of missing KOIs compared in reverse between clustered assemblies and single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies.</bold>
Data represent a reverse comparative analysis where the number of KOIs annotated in the CAs, but missing in single
<italic>k</italic>
-mer assemblies (open triangles for CD-HIT-EST 1.0 and squares for Oases), and the number of KOIs annotated in the single
<italic>k</italic>
-mer assemblies but missing in the CAs (closed triangles for CD-HIT-EST 1.0 and squares for Oases).</p>
</caption>
<graphic xlink:href="1471-2105-13-170-5"></graphic>
</fig>
<p>Nevertheless, Table
<xref ref-type="table" rid="T2">2</xref>
and Figure
<xref ref-type="fig" rid="F5">5</xref>
collectively indicated that there were still missing KOIs in both CAs and single
<italic>k-</italic>
mer assemblies. Although the number of missing KOIs in the CA was low compared to the total number of KOIs identified, there was still some relevant biological information lost during this clustering step. This was attributed to the heuristic-based search used in both CD-HIT-EST and Oases to reduce computation time and memory usage, resulting in minor inconsistencies during the removal of redundant sequences.</p>
<p>To further characterize the degree of lost biological information, the missing KOIs identified during the pair-wise comparisons in Figure
<xref ref-type="fig" rid="F5">5</xref>
were subjected to full biological annotation using KEGG BRITE gene and protein families. The missing KOI lists and their full biological annotations are presented as Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
and Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
[annotated KOIs missing in single
<italic>k</italic>
-mer assemblies otherwise present in CAs, corresponding to an average 0.19% of total KOIs, identified with CD-HIT-EST 1.0 scenario (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
) and 0.27% in Oases multi-
<italic>k</italic>
(Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
)], and Additional file
<xref ref-type="supplementary-material" rid="S3">3</xref>
and Additional file
<xref ref-type="supplementary-material" rid="S4">4</xref>
[annotated KOIs missing in CAs otherwise present in single
<italic>k</italic>
-mer assemblies, corresponding to an average 3.43% of total KOIs, identified with CD-HIT-EST 1.0 scenario (Additional file
<xref ref-type="supplementary-material" rid="S3">3</xref>
) and 3.48% in Oases multi-
<italic>k</italic>
(Additional file
<xref ref-type="supplementary-material" rid="S4">4</xref>
)]. The detailed discussion of lost biological information would be out of the scope of this paper, as its nature and value would differ for each researcher and assembly goal. Nevertheless, a general important interpretation is that there were relevant genes encoding enzymes and proteins (of particular interest for a lipid producing microalgae species with respect to this study) identified as missing in single
<italic>k</italic>
-mer assemblies but present in CAs (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
and Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
) and vice-versa (Additional file
<xref ref-type="supplementary-material" rid="S3">3</xref>
and Additional file
<xref ref-type="supplementary-material" rid="S4">4</xref>
). This suggests that comprehensive annotation should include, in addition to the CA, an interrogation of unique genes in the assemblies of individual
<italic>k</italic>
-mers from 19 and 63.</p>
</sec>
<sec>
<title>Suggested workflow for optimizing
<italic>de novo</italic>
transcriptome annotation</title>
<p>Although the generated CA provided the best annotation results, comparison with single
<italic>k</italic>
-mer assemblies suggested that this approach still results in the loss of some biological information as discussed above. A workflow is presented in Additional file
<xref ref-type="supplementary-material" rid="S5">5</xref>
, along with several useful scripts, as a guide to improve the annotation of
<italic>de novo</italic>
assembled transcriptome. The workflow first includes quantifying the number of annotations that could possibly be generated in single
<italic>k</italic>
-mer assemblies via quick annotation services (such as KAAS) to determine the optimal
<italic>k</italic>
-mer value range targeted to capture the most comprehensive functional annotation. Next, a clustered assembly should be generated using these
<italic>k</italic>
-mer values to produce the full set of non-redundant contigs. Finally, pairwise comparisons are performed to identify the unique contigs that are not present in the multi-
<italic>k</italic>
clustered assembly (otherwise detected in single
<italic>k</italic>
-mer assemblies). These missing contigs should be incorporated into the final assembly product prior to annotation.</p>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>For the
<italic>de novo</italic>
transcriptome assembly of non-model organisms from short read sequencing data, de Bruijn graph based algorithms use
<italic>k</italic>
-mer hash lengths to accommodate transcripts with different sizes. Here, we provide an in-depth analysis of the effects of individual
<italic>k</italic>
-mer length and multiple
<italic>k</italic>
-mer assembly methods on transcriptome annotation. Results demonstrate that different
<italic>k</italic>
-mer choices result in different quantities of unique contigs per single
<italic>k</italic>
-mer assembly, which in turn impact the amount of biological information that is retrievable from the transcriptome. Although this undesirable effect could be minimized with clustering of multi-
<italic>k</italic>
assemblies, it is not completely eliminated due to limitations in the heuristic algorithms used in redundancy removal when clustered
<italic>k</italic>
-mer assemblies are used. We present useful scripts and a workflow to retrieve some of the missing biological information. With high-throughput DNA sequencing methods removing limitations in transcriptome coverage, assembly-based optimization is important for continually improving the completeness of transcriptomes, particularly in non-model organisms for which the reference genome is not available. Taken together, our results provide important guidance on selecting and combining
<italic>k</italic>
-mer lengths to improve the extraction of biological information from
<italic>de novo</italic>
transcriptome assemblies.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Algae growth, cDNA sequencing, read trimming, and assembly</title>
<p>
<italic>Neochloris oleoabundans</italic>
(a Chlorophyceae class green microalgae) was grown in batch cultures under nitrogen stressed and unstressed conditions [
<xref ref-type="bibr" rid="B17">17</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
]. Total RNA was extracted after 11 days of growth using Rneasy Plant Mini Kit (Qiagen, Germantown, MD). Library preparation was conducted using mRNA-Seq Kit supplied by Illumina (Illumina, Inc., San Diego, CA). Briefly, the mRNA fraction was isolated from total RNA using two rounds of hybridization to Dynaloligo(dT) magnetic beads (Invitrogen, Carlsbad, CA). The mRNA was then fragmented in the presence of divalent cations at 94°C, and subsequently converted into double stranded cDNA following the first- and second-strand cDNA synthesis using random hexamer primers. After polishing the ends of the cDNA using T4 DNA polymerase and Klenow DNA polymerase for 30 min at 20°C, a single adenine base was added to the 3’ ends of cDNA molecules. Illumina mRNA-Seq Kit specific adaptors were then ligated to cDNA 3’ ends. Subsequently, the cDNA was PCR-amplified for 15 cycles and amplicons were purified using the Qiagen PCR purification kit (Qiagen, Germantown, MD). The size and concentration of the cDNA libraries were determined on Agilent 2100 bioanalyzer (Agilent Technologies, Santa Clara, CA). Each cDNA library was loaded onto a lane of the Illumina flow cell and sequenced at the Yale Center for Genome Analysis using a Genome Analyzer IIx and the 99 bp single-read recipe. An additional lane was also used to run sequencing controls. Raw sequencing reads (44,568,122 reads; 99 bp single-ended) of cDNA were analyzed with the FastQC quality control tool (v0.10.0) to evaluate the read sequence quality [
<xref ref-type="bibr" rid="B19">19</xref>
]. Low quality reads with a Phred score value of 13 and less were removed using the SolexaQA software package (v1.1) [
<xref ref-type="bibr" rid="B20">20</xref>
]. After trimming, the FastQC analysis was conducted again to ensure quality measures were met in the remaining reads.</p>
<p>Based on their common application in
<italic>de novo </italic>
transcriptomic studies using Illumina reads [
<xref ref-type="bibr" rid="B21">21</xref>
,
<xref ref-type="bibr" rid="B22">22</xref>
], the Velvet (v1.2.03) [
<xref ref-type="bibr" rid="B11">11</xref>
] and Oases (v0.2.06) [
<xref ref-type="bibr" rid="B4">4</xref>
] packages were utilized to assemble the high quality reads. In order to investigate the impact of
<italic>k</italic>
-mer choice on the assembly dynamics, separate assemblies were performed for odd
<italic>k</italic>
-mer values ranging from 19 to 63 using the “oases_pipeline.py” script provided in the Oases package. The raw sequence data used in this study has been submitted to National Center for Biotechnology Information (NCBI) Short Read Archive (SRA), and are available for public access with accession numbers: SRR391512.1 and SRR391513.1.</p>
</sec>
<sec>
<title>KOI assignment and functional annotation</title>
<p>The resulting contigs from each individual
<italic>k</italic>
-mer assembly were submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) (v1.6a) [
<xref ref-type="bibr" rid="B23">23</xref>
] for KOI assignment using the default settings with single-directional best hit (SBH) method and databases that included several eukaryotic organisms (including the green microalgae
<italic>C. reinhardtii</italic>
) [
<xref ref-type="bibr" rid="B23">23</xref>
]. Functional annotation of the KOI assignments was derived from the KEGG BRITE genes and protein families database [
<xref ref-type="bibr" rid="B24">24</xref>
]. Comparison of KOIs for each
<italic>k</italic>
-mer assembly to KOIs for all other individual
<italic>k</italic>
-mer assemblies was performed to determine the number of KOIs unique to each
<italic>k</italic>
-mer assembly. This analysis was generated using a custom built script in the R programming language. This script is provided in Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
.</p>
</sec>
<sec>
<title>Mapping reads to the assembled transcriptome</title>
<p>To understand the relationship between coverage and generated contigs, trimmed reads were mapped against each assembly (
<italic>k</italic>
-19 to
<italic>k</italic>
-63) using Bowtie [
<xref ref-type="bibr" rid="B25">25</xref>
] and the contig coverage was estimated. Prior to mapping, any read shorter than the
<italic>k</italic>
-value of the assembly was removed from the set of trimmed reads. This was done to ensure that reads, which were not used in the assembly, were not mapped to the assembly. Bowtie produced all alignments [
<xref ref-type="bibr" rid="B5">5</xref>
] with 0, 1, and 2 mismatches allowed and utilized the following settings: -a -phred64 -quals -suppress 1,2,4,5,6,7,8 -q -best. Fold coverage was calculated based on the average number of reads mapped per contig in a given
<italic>k</italic>
-mer assembly. This calculation was performed using a custom designed Python script (provided in Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
).</p>
</sec>
<sec>
<title>Assembly clustering and optimization</title>
<p>Clustered assemblies (CA) were generated from the single
<italic>k</italic>
-mer assemblies of 19 to 63. Clustering of contigs and redundancy removal were performed using the following three different programs: Oases (by using its incorporated multi-
<italic>k</italic>
option), CD-HIT-EST (v4.0-2010-04-20) [
<xref ref-type="bibr" rid="B13">13</xref>
], and VMATCH (v2.1.6) [
<xref ref-type="bibr" rid="B14">14</xref>
]. CD-HIT-EST was run with the following parameters: -n 8 -T 4, and 3 different values for the '-c' parameter (0.9, 0.95, 1.0). VMATCH was run with the following parameters: -d -p -l 18 -dbcluster 100 0 -v -nonredundant for vmatch and -allout -pl -dna for mkvtree. Contigs from the combined assemblies were submitted to KAAS for KOI assignment as previously described.</p>
<p>Pair-wise comparison of single
<italic>k</italic>
-mer assemblies versus CA was performed to determine if unique KOIs existed in specific
<italic>k</italic>
-mer assemblies but not in the clustered assemblies, and vice versa. To make these comparisons, scripts written in the Python programming language were developed (See Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
).</p>
</sec>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests. The software packages considered in these analyses were chosen based on their commonly reported usage in the
<italic>de novo</italic>
transcriptome assembly literature.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>BZH, DR, and HRY contributed equally to this study. BZH and DR conducted bioinformatics analyses, designed, and prepared the manuscript; HRY conceived and designed the study, and performed the molecular bench work. JP contributed to manuscript preparation and carefully reviewed it. All authors read and approved the final manuscript.</p>
</sec>
<sec>
<title>Authors’ information</title>
<p>Berat Z. Haznedaroglu, Darryl Reeves and Hamid Rismani-Yazdi equal authorship.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>This spreadsheet contains the list of annotated KOIs missing in single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies (provided as separate tabs), but present in the clustered assembly obtained by CD-HIT-EST with 1.0 sequence identity.</bold>
</p>
</caption>
<media xlink:href="1471-2105-13-170-S1.xls" mimetype="application" mime-subtype="vnd.ms-excel">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>
<bold>This spreadsheet contains the list of annotated KOIs missing in single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies (provided as separate tabs), but present in the clustered assembly obtained by Oases multi-</bold>
<bold>
<italic>k </italic>
</bold>
<bold>option.</bold>
</p>
</caption>
<media xlink:href="1471-2105-13-170-S2.xls" mimetype="application" mime-subtype="vnd.ms-excel">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3">
<caption>
<title>Additional file 3</title>
<p>
<bold>This spreadsheet contains the list of annotated KOIs missing in the clustered assembly obtained by CD-HIT-EST with 1.0 sequence identity, but present in the corresponding single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies (provided as separate tabs).</bold>
</p>
</caption>
<media xlink:href="1471-2105-13-170-S3.xls" mimetype="application" mime-subtype="vnd.ms-excel">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S4">
<caption>
<title>Additional file 4</title>
<p>
<bold>This spreadsheet contains the list of annotated KOIs missing in the clustered assembly obtained by Oases multi-</bold>
<bold>
<italic>k </italic>
</bold>
<bold>option, but present in the corresponding single </bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer assemblies (provided as separate tabs).</bold>
</p>
</caption>
<media xlink:href="1471-2105-13-170-S4.xls" mimetype="application" mime-subtype="vnd.ms-excel">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S5">
<caption>
<title>Additional file 5</title>
<p>
<bold>This file provides the reader with a representative workflow to generate optimized </bold>
<bold>
<italic>de novo </italic>
</bold>
<bold>transcriptome assembly.</bold>
</p>
</caption>
<media xlink:href="1471-2105-13-170-S5.ppt" mimetype="application" mime-subtype="vnd.ms-powerpoint">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S6">
<caption>
<title>Additional file 6</title>
<p>This file contains in-house designed scripts used during the course of the study.</p>
</caption>
<media xlink:href="1471-2105-13-170-S6.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>This research was supported by the Connecticut Center for Advanced Technologies under a Fuel Diversification Grant and by the National Science Foundation, Grant #0854322 awarded to JP. BZH was supported by a joint postdoctoral fellowship from the Yale Climate and Energy Institute and Yale Institute for Biospheric Studies. DR was supported in part from the National Library of Medicine, Grant #T15 LM07056. We acknowledge use of computational facilities at the Yale University Biomedical High Performance Computing Center NIH Grant# RR19895.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Iyer</surname>
<given-names>MK</given-names>
</name>
<name>
<surname>Chinnaiyan</surname>
<given-names>AM</given-names>
</name>
<article-title>RNA-Seq unleashed</article-title>
<source>Nat Biotech</source>
<year>2011</year>
<volume>29</volume>
<issue>7</issue>
<fpage>599</fpage>
<lpage>600</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1915</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Martin</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z</given-names>
</name>
<article-title>Next-generation transcriptome assembly</article-title>
<source>Nat Rev Genet</source>
<year>2011</year>
<volume>12</volume>
<issue>10</issue>
<fpage>671</fpage>
<lpage>682</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3068</pub-id>
<pub-id pub-id-type="pmid">21897427</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>De Bruijn</surname>
<given-names>NG</given-names>
</name>
<article-title>A combinatorical problem</article-title>
<source>Koninklijke Nederlandse Akademie v Wetenschappen</source>
<year>1946</year>
<volume>46</volume>
<fpage>758</fpage>
<lpage>764</lpage>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="other">
<name>
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<article-title>Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts094</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Robertson</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chiu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Corbett</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Field</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Mungall</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Okada</surname>
<given-names>HM</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>JQ</given-names>
</name>
<etal></etal>
<article-title>De novo assembly and analysis of RNA-seq data</article-title>
<source>Nat Meth</source>
<year>2010</year>
<volume>7</volume>
<issue>11</issue>
<fpage>909</fpage>
<lpage>912</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1517</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<etal></etal>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<issue>2</issue>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="doi">10.1101/gr.097261.109</pub-id>
<pub-id pub-id-type="pmid">20019144</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Grabherr</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Haas</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Yassour</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Levin</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Thompson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Amit</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Adiconis</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Raychowdhury</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
<article-title>Full-length transcriptome assembly from RNA-Seq data without a reference genome</article-title>
<source>Nat Biotech</source>
<year>2011</year>
<volume>29</volume>
<issue>7</issue>
<fpage>644</fpage>
<lpage>652</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1883</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Bao</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kwan</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>Y-Q</given-names>
</name>
<article-title>Evaluation of next-generation sequencing software in mapping and assembly</article-title>
<source>J Hum Genet</source>
<year>2011</year>
<volume>56</volume>
<issue>6</issue>
<fpage>406</fpage>
<lpage>414</lpage>
<pub-id pub-id-type="doi">10.1038/jhg.2011.43</pub-id>
<pub-id pub-id-type="pmid">21525877</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Narzisi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mishra</surname>
<given-names>B</given-names>
</name>
<article-title>Comparing De novogenome assembly: the long and short of it</article-title>
<source>PLoS One</source>
<year>2011</year>
<volume>6</volume>
<issue>4</issue>
<fpage>e19175</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0019175</pub-id>
<pub-id pub-id-type="pmid">21559467</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Shang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>B</given-names>
</name>
<article-title>A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies</article-title>
<source>PLoS One</source>
<year>2011</year>
<volume>6</volume>
<issue>3</issue>
<fpage>e17915</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0017915</pub-id>
<pub-id pub-id-type="pmid">21423806</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<article-title>Velvet: Algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Surget-Groba</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Montoya-Burgos</surname>
<given-names>JI</given-names>
</name>
<article-title>Optimization of de novo transcriptome assembly from next-generation sequencing data</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<issue>10</issue>
<fpage>1432</fpage>
<lpage>1440</lpage>
<pub-id pub-id-type="doi">10.1101/gr.103846.109</pub-id>
<pub-id pub-id-type="pmid">20693479</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>13</issue>
<fpage>1658</fpage>
<lpage>1659</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl158</pub-id>
<pub-id pub-id-type="pmid">16731699</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="other">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<collab>Vmatch</collab>
<source>Large scale sequence analysis software</source>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.vmatch.de/">http://www.vmatch.de/</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Sultana</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Karamycheva</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>White</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cheung</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Parvizi</surname>
<given-names>B</given-names>
</name>
<etal></etal>
<article-title>TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>5</issue>
<fpage>651</fpage>
<lpage>652</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg034</pub-id>
<pub-id pub-id-type="pmid">12651724</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Griffiths</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Harrison</surname>
<given-names>S</given-names>
</name>
<article-title>Lipid productivity as a key characteristic for choosing algal species for biodiesel production</article-title>
<source>J Appl Phycol</source>
<year>2009</year>
<volume>21</volume>
<issue>5</issue>
<fpage>493</fpage>
<lpage>507</lpage>
<pub-id pub-id-type="doi">10.1007/s10811-008-9392-7</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Horsman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Lan</surname>
<given-names>C</given-names>
</name>
<article-title>Effects of nitrogen sources on cell growth and lipid accumulation of green alga Neochloris oleoabundans</article-title>
<source>Appl Microbiol Biotech</source>
<year>2008</year>
<volume>81</volume>
<issue>4</issue>
<fpage>629</fpage>
<lpage>636</lpage>
<pub-id pub-id-type="doi">10.1007/s00253-008-1681-1</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Pruvost</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Van Vooren</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Cogne</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Legrand</surname>
<given-names>J</given-names>
</name>
<article-title>Investigation of biomass and lipids production with Neochloris oleoabundans in photobioreactor</article-title>
<source>Bioresource Technol</source>
<year>2009</year>
<volume>100</volume>
<issue>23</issue>
<fpage>5988</fpage>
<lpage>5995</lpage>
<pub-id pub-id-type="doi">10.1016/j.biortech.2009.06.004</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="other">
<name>
<surname>Andrews</surname>
<given-names>S</given-names>
</name>
<collab>FastQC</collab>
<source>A quality control tool for high throughput sequence data</source>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">http://www.bioinformatics.babraham.ac.uk/projects/fastqc/</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Cox</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Biggs</surname>
<given-names>P</given-names>
</name>
<article-title>SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>1</issue>
<fpage>485</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-485</pub-id>
<pub-id pub-id-type="pmid">20875133</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Garg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Tyagi</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>M</given-names>
</name>
<article-title>De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification</article-title>
<source>DNA Res</source>
<year>2011</year>
<volume>18</volume>
<issue>1</issue>
<fpage>53</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1093/dnares/dsq028</pub-id>
<pub-id pub-id-type="pmid">21217129</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Feldmeyer</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wheat</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Krezdorn</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Rotter</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pfenninger</surname>
<given-names>M</given-names>
</name>
<article-title>Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance</article-title>
<source>BMC Genomics</source>
<year>2011</year>
<volume>12</volume>
<issue>1</issue>
<fpage>317</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-12-317</pub-id>
<pub-id pub-id-type="pmid">21679424</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Moriya</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Itoh</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Okuda</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yoshizawa</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Kanehisa</surname>
<given-names>M</given-names>
</name>
<article-title>KAAS: an automatic genome annotation and pathway reconstruction server</article-title>
<source>Nucl Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>suppl 2</issue>
<fpage>W182</fpage>
<lpage>W185</lpage>
<pub-id pub-id-type="pmid">17526522</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="book">
<name>
<surname>Aoki-Kinoshita</surname>
<given-names>KF</given-names>
</name>
<name>
<surname>Kanehisa</surname>
<given-names>M</given-names>
</name>
<person-group person-group-type="editor">Bergman NH</person-group>
<article-title>Gene annotation and pathway mapping in KEGG</article-title>
<source>Comparative Genomics. Volume 2</source>
<year>2007</year>
<publisher-name>Totowa, New Jersey: Humana Press</publisher-name>
<fpage>71</fpage>
<lpage>91</lpage>
<comment>vol. 396</comment>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>R25</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-3-r25</pub-id>
<pub-id pub-id-type="pmid">19261174</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000944 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000944 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3489510
   |texte=   Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22808927" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021