MersV1, Pmc, Corpus, bibRecord, 000274

***** Acces problem to record *****\

Identifieur interne : 000274 ( Pmc/Corpus ); précédent : 0002739; suivant : 0002750 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data</title>
<author><name sortKey="Okamura, Yasunobu" sort="Okamura, Yasunobu" uniqKey="Okamura Y" first="Yasunobu" last="Okamura">Yasunobu Okamura</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Graduate School of Information Sciences,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff2"><institution-wrap><institution-id institution-id-type="ISNI">0000 0004 1763 9951</institution-id>
<institution-id institution-id-type="GRID">grid.459769.0</institution-id>
<institution>Mitsubishi Space Software Co., Ltd,</institution>
</institution-wrap>
Amagasaki, Hyogo Japan</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kinoshita, Kengo" sort="Kinoshita, Kengo" uniqKey="Kinoshita K" first="Kengo" last="Kinoshita">Kengo Kinoshita</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Graduate School of Information Sciences,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff3"><institution-wrap><institution-id institution-id-type="GRID">grid.410829.6</institution-id>
<institution>Tohoku Medical Megabank Organization,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff4"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Institute of Development,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">30012088</idno>
<idno type="pmc">6048772</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6048772</idno>
<idno type="RBID">PMC:6048772</idno>
<idno type="doi">10.1186/s12859-018-2279-y</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000274</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000274</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data</title>
<author><name sortKey="Okamura, Yasunobu" sort="Okamura, Yasunobu" uniqKey="Okamura Y" first="Yasunobu" last="Okamura">Yasunobu Okamura</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Graduate School of Information Sciences,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff2"><institution-wrap><institution-id institution-id-type="ISNI">0000 0004 1763 9951</institution-id>
<institution-id institution-id-type="GRID">grid.459769.0</institution-id>
<institution>Mitsubishi Space Software Co., Ltd,</institution>
</institution-wrap>
Amagasaki, Hyogo Japan</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kinoshita, Kengo" sort="Kinoshita, Kengo" uniqKey="Kinoshita K" first="Kengo" last="Kinoshita">Kengo Kinoshita</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Graduate School of Information Sciences,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff3"><institution-wrap><institution-id institution-id-type="GRID">grid.410829.6</institution-id>
<institution>Tohoku Medical Megabank Organization,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff4"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Institute of Development,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p id="Par1">Data generated by RNA sequencing (RNA-Seq) is now accumulating in vast amounts in public repositories, especially for human and mouse genomes. Reanalyzing these data has emerged as a promising approach to identify gene modules or pathways. Although meta-analyses of gene expression data are frequently performed using microarray data, meta-analyses using RNA-Seq data are still rare. This lag is partly due to the limitations in reanalyzing RNA-Seq data, which requires extensive computational resources. Moreover, it is nearly impossible to calculate the gene expression levels of all samples in a public repository using currently available methods. Here, we propose a novel method, Matataki, for rapidly estimating gene expression levels from RNA-Seq data.</p>
</sec>
<sec><title>Results</title>
<p id="Par2">The proposed method uses k-mers that are unique to each gene for the mapping of fragments to genes. Since aligning fragments to reference sequences requires high computational costs, our method could reduce the calculation cost by focusing on k-mers that are unique to each gene and by skipping uninformative regions. Indeed, Matataki outperformed conventional methods with regards to speed while demonstrating sufficient accuracy.</p>
</sec>
<sec><title>Conclusions</title>
<p id="Par3">The development of Matataki can overcome current limitations in reanalyzing RNA-Seq data toward improving the potential for discovering genes and pathways associated with disease at reduced computational cost. Thus, the main bottleneck of RNA-Seq analyses has shifted to achieving the decompression of sequenced data. The implementation of Matataki is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/informationsea/Matataki">https://github.com/informationsea/Matataki</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2279-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author><name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Pimentel, H" uniqKey="Pimentel H">H Pimentel</name>
</author>
<author><name sortKey="Kelly, R" uniqKey="Kelly R">R Kelly</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Pachter, L" uniqKey="Pachter L">L Pachter</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Roberts, A" uniqKey="Roberts A">A Roberts</name>
</author>
<author><name sortKey="Goff, L" uniqKey="Goff L">L Goff</name>
</author>
<author><name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author><name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author><name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author><name sortKey="Dewey, Cn" uniqKey="Dewey C">CN Dewey</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Roberts, A" uniqKey="Roberts A">A Roberts</name>
</author>
<author><name sortKey="Pachter, L" uniqKey="Pachter L">L Pachter</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Patro, R" uniqKey="Patro R">R Patro</name>
</author>
<author><name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author><name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author><name sortKey="Wang, W" uniqKey="Wang W">W Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Janzen, D" uniqKey="Janzen D">D Janzen</name>
</author>
<author><name sortKey="Tiourin, E" uniqKey="Tiourin E">E Tiourin</name>
</author>
<author><name sortKey="Salehi, J" uniqKey="Salehi J">J Salehi</name>
</author>
<author><name sortKey="Paik, Dy" uniqKey="Paik D">DY Paik</name>
</author>
<author><name sortKey="Lu, J" uniqKey="Lu J">J Lu</name>
</author>
<author><name sortKey="Pellegrini, M" uniqKey="Pellegrini M">M Pellegrini</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Madan, B" uniqKey="Madan B">B Madan</name>
</author>
<author><name sortKey="Ke, Z" uniqKey="Ke Z">Z Ke</name>
</author>
<author><name sortKey="Harmston, N" uniqKey="Harmston N">N Harmston</name>
</author>
<author><name sortKey="Ho, Sy" uniqKey="Ho S">SY Ho</name>
</author>
<author><name sortKey="Frois, Ao" uniqKey="Frois A">AO Frois</name>
</author>
<author><name sortKey="Alam, J" uniqKey="Alam J">J Alam</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cacchiarelli, D" uniqKey="Cacchiarelli D">D Cacchiarelli</name>
</author>
<author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Ziller, Mj" uniqKey="Ziller M">MJ Ziller</name>
</author>
<author><name sortKey="Soumillon, M" uniqKey="Soumillon M">M Soumillon</name>
</author>
<author><name sortKey="Cesana, M" uniqKey="Cesana M">M Cesana</name>
</author>
<author><name sortKey="Karnik, R" uniqKey="Karnik R">R Karnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lu, H" uniqKey="Lu H">H Lu</name>
</author>
<author><name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author><name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author><name sortKey="Schulze Gahmen, U" uniqKey="Schulze Gahmen U">U Schulze-Gahmen</name>
</author>
<author><name sortKey="Xue, Y" uniqKey="Xue Y">Y Xue</name>
</author>
<author><name sortKey="Zhou, Q" uniqKey="Zhou Q">Q Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, Y" uniqKey="Wu Y">Y Wu</name>
</author>
<author><name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author><name sortKey="Wu, F" uniqKey="Wu F">F Wu</name>
</author>
<author><name sortKey="Huang, R" uniqKey="Huang R">R Huang</name>
</author>
<author><name sortKey="Xue, F" uniqKey="Xue F">F Xue</name>
</author>
<author><name sortKey="Liang, G" uniqKey="Liang G">G Liang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author><name sortKey="Lieu, Yk" uniqKey="Lieu Y">YK Lieu</name>
</author>
<author><name sortKey="Ali, Am" uniqKey="Ali A">AM Ali</name>
</author>
<author><name sortKey="Penson, A" uniqKey="Penson A">A Penson</name>
</author>
<author><name sortKey="Reggio, Ks" uniqKey="Reggio K">KS Reggio</name>
</author>
<author><name sortKey="Rabadan, R" uniqKey="Rabadan R">R Rabadan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Okamura, Y" uniqKey="Okamura Y">Y Okamura</name>
</author>
<author><name sortKey="Aoki, Y" uniqKey="Aoki Y">Y Aoki</name>
</author>
<author><name sortKey="Obayashi, T" uniqKey="Obayashi T">T Obayashi</name>
</author>
<author><name sortKey="Tadaka, S" uniqKey="Tadaka S">S Tadaka</name>
</author>
<author><name sortKey="Ito, S" uniqKey="Ito S">S Ito</name>
</author>
<author><name sortKey="Narise, T" uniqKey="Narise T">T Narise</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Obayashi, T" uniqKey="Obayashi T">T Obayashi</name>
</author>
<author><name sortKey="Okamura, Y" uniqKey="Okamura Y">Y Okamura</name>
</author>
<author><name sortKey="Ito, S" uniqKey="Ito S">S Ito</name>
</author>
<author><name sortKey="Tadaka, S" uniqKey="Tadaka S">S Tadaka</name>
</author>
<author><name sortKey="Motoike, In" uniqKey="Motoike I">IN Motoike</name>
</author>
<author><name sortKey="Kinoshita, K" uniqKey="Kinoshita K">K Kinoshita</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Soneson, C" uniqKey="Soneson C">C Soneson</name>
</author>
<author><name sortKey="Love, Mi" uniqKey="Love M">MI Love</name>
</author>
<author><name sortKey="Robinson, Md" uniqKey="Robinson M">MD Robinson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Harrow, J" uniqKey="Harrow J">J Harrow</name>
</author>
<author><name sortKey="Frankish, A" uniqKey="Frankish A">A Frankish</name>
</author>
<author><name sortKey="Gonzalez, Jm" uniqKey="Gonzalez J">JM Gonzalez</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Chen, Ya" uniqKey="Chen Y">YA Chen</name>
</author>
<author><name sortKey="Tripathi, Lp" uniqKey="Tripathi L">LP Tripathi</name>
</author>
<author><name sortKey="Mizuguchi, K" uniqKey="Mizuguchi K">K Mizuguchi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author><name sortKey="Hendrickson, Dg" uniqKey="Hendrickson D">DG Hendrickson</name>
</author>
<author><name sortKey="Sauvageau, M" uniqKey="Sauvageau M">M Sauvageau</name>
</author>
<author><name sortKey="Goff, L" uniqKey="Goff L">L Goff</name>
</author>
<author><name sortKey="Rinn, Jl" uniqKey="Rinn J">JL Rinn</name>
</author>
<author><name sortKey="Pachter, L" uniqKey="Pachter L">L Pachter</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">30012088</article-id>
<article-id pub-id-type="pmc">6048772</article-id>
<article-id pub-id-type="publisher-id">2279</article-id>
<article-id pub-id-type="doi">10.1186/s12859-018-2279-y</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Okamura</surname>
<given-names>Yasunobu</given-names>
</name>
<address><email>okamura@ingem.oas.tohoku.ac.jp</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-3453-2171</contrib-id>
<name><surname>Kinoshita</surname>
<given-names>Kengo</given-names>
</name>
<address><email>kengo@ecei.tohoku.ac.jp</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<aff id="Aff1"><label>1</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Graduate School of Information Sciences,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</aff>
<aff id="Aff2"><label>2</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0004 1763 9951</institution-id>
<institution-id institution-id-type="GRID">grid.459769.0</institution-id>
<institution>Mitsubishi Space Software Co., Ltd,</institution>
</institution-wrap>
Amagasaki, Hyogo Japan</aff>
<aff id="Aff3"><label>3</label>
<institution-wrap><institution-id institution-id-type="GRID">grid.410829.6</institution-id>
<institution>Tohoku Medical Megabank Organization,</institution>
</institution-wrap>
Sendai, Miyagi Japan</aff>
<aff id="Aff4"><label>4</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2248 6943</institution-id>
<institution-id institution-id-type="GRID">grid.69566.3a</institution-id>
<institution>Institute of Development,</institution>
<institution>Tohoku University,</institution>
</institution-wrap>
Sendai, Miyagi Japan</aff>
</contrib-group>
<pub-date pub-type="epub"><day>16</day>
<month>7</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>16</day>
<month>7</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection"><year>2018</year>
</pub-date>
<volume>19</volume>
<elocation-id>266</elocation-id>
<history><date date-type="received"><day>12</day>
<month>11</month>
<year>2017</year>
</date>
<date date-type="accepted"><day>9</day>
<month>7</month>
<year>2018</year>
</date>
</history>
<permissions><copyright-statement>© The Author(s). 2018</copyright-statement>
<license license-type="OpenAccess"><license-p><bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1"><sec><title>Background</title>
<p id="Par1">Data generated by RNA sequencing (RNA-Seq) is now accumulating in vast amounts in public repositories, especially for human and mouse genomes. Reanalyzing these data has emerged as a promising approach to identify gene modules or pathways. Although meta-analyses of gene expression data are frequently performed using microarray data, meta-analyses using RNA-Seq data are still rare. This lag is partly due to the limitations in reanalyzing RNA-Seq data, which requires extensive computational resources. Moreover, it is nearly impossible to calculate the gene expression levels of all samples in a public repository using currently available methods. Here, we propose a novel method, Matataki, for rapidly estimating gene expression levels from RNA-Seq data.</p>
</sec>
<sec><title>Results</title>
<p id="Par2">The proposed method uses k-mers that are unique to each gene for the mapping of fragments to genes. Since aligning fragments to reference sequences requires high computational costs, our method could reduce the calculation cost by focusing on k-mers that are unique to each gene and by skipping uninformative regions. Indeed, Matataki outperformed conventional methods with regards to speed while demonstrating sufficient accuracy.</p>
</sec>
<sec><title>Conclusions</title>
<p id="Par3">The development of Matataki can overcome current limitations in reanalyzing RNA-Seq data toward improving the potential for discovering genes and pathways associated with disease at reduced computational cost. Thus, the main bottleneck of RNA-Seq analyses has shifted to achieving the decompression of sequenced data. The implementation of Matataki is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/informationsea/Matataki">https://github.com/informationsea/Matataki</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2279-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en"><title>Keywords</title>
<kwd>RNA-Seq</kwd>
<kwd>Mapping</kwd>
<kwd>Gene expression</kwd>
</kwd-group>
<funding-group><award-group><funding-source><institution-wrap><institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100001691</institution-id>
<institution>Japan Society for the Promotion of Science</institution>
</institution-wrap>
</funding-source>
<award-id>15H02773</award-id>
</award-group>
</funding-group>
<custom-meta-group><custom-meta><meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec id="Sec1"><title>Background</title>
<p id="Par9">The number of published studies on RNA sequencing (RNA-Seq) data is rapidly increasing owing to improvements in RNA-Seq measurement technology. Thus, meta-analyses of publicly available data have become a new promising approach to obtain novel insights into biological systems. However, merging quantified expression data provided by authors is generally difficult because of the use of different reference sequences, ID systems, and quantification methods among individual studies. These variations make it impossible to distinguish true biological differences from calculation protocol biases when comparing gene expression profiles quantified using different methods. Therefore, quantification using raw sequences for all data is an important step for RNA-Seq meta-analyses.</p>
<p id="Par10">Many quantification methods for RNA-Seq data have been proposed to date, including the most common pipeline method using TopHat2 [<xref ref-type="bibr" rid="CR1">1</xref>
, <xref ref-type="bibr" rid="CR2">2</xref>
] and cufflinks [<xref ref-type="bibr" rid="CR3">3</xref>
]. This method aligns sequenced reads to a reference genome, counts the number of fragments mapped onto gene regions, and estimates gene expression as transcript levels. Importantly, this method can be applied to species without a reference transcript and can predict transcript candidates. Some other methods such as RSEM [<xref ref-type="bibr" rid="CR4">4</xref>
] and eXpress [<xref ref-type="bibr" rid="CR5">5</xref>
] map sequences to the transcript reference; since they require only reference transcript sequences, they can be applied to species without a reference genome. A de novo transcript assembler or an expressed sequence tag database can be used as reference transcript sequences in place of curated reference transcript databases. Both RSEM and eXpress employ bowtie [<xref ref-type="bibr" rid="CR6">6</xref>
] to map a read sequence to a transcript, and some read sequences are mapped to multiple transcripts due to splicing variants. RSEM and eXpress use the Expectation-Maximization (EM) algorithm to resolve the problem of assigning multi-mapped reads to transcripts for quantifying the expression level of transcripts.</p>
<p id="Par11">Despite their advantages for quantification, these alignment-based methods require extensive computational resources. When quantifying the expression levels of an RNA-Seq sample, alignment is an optional step because the position of a read is not essential for quantification. Thus, several methods have also been proposed to reduce the calculation cost for large RNA-Seq analyses and avoid the mapping step by focusing on the k-mers in transcripts. For example, Sailfish [<xref ref-type="bibr" rid="CR7">7</xref>
] uses all k-mers that appear in the reference transcript, creates a transcript table containing the k-mers, counts the number of occurrences of each k-mer in the RNA-Seq data, and finally estimates the most probable expression level of each transcript from the counts using the EM algorithm. RNA-Skim [<xref ref-type="bibr" rid="CR8">8</xref>
] uses a similar but more efficient approach by introducing <italic>sig-mers</italic>
 that appear only once in a subset of reference transcripts, counts the number of occurrences of the <italic>sig-mers</italic>
 while processing the RNA-Seq data, and then estimates the most probable expression levels using the EM algorithm. Kallisto [<xref ref-type="bibr" rid="CR9">9</xref>
] also uses k-mers, and further reduces the calculation cost by skipping fragments when searching an index. When a k-mer appears, the next k-mer is limited to one or a few patterns. If the next k-mer is limited to one pattern, hashing the k-mer is not required to determine the source isoform. Kallisto then skips these non-informative k-mers, resulting in a faster estimation process.</p>
<p id="Par12">The speed of quantification is a critical step in developing a method to process thousands of publicly available RNA-Seq reads. Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis. In addition, all of these alignment-free methods rely on transcript-level quantification, although gene-level expression data contain sufficient information for most analyses. Moreover, several RNA-Seq studies [<xref ref-type="bibr" rid="CR10">10</xref>
–<xref ref-type="bibr" rid="CR13">13</xref>
] do not include isoform-specific expression data; even if isoform-specific expression is relevant, these analyses typically only focus on a few splicing changes [<xref ref-type="bibr" rid="CR14">14</xref>
, <xref ref-type="bibr" rid="CR15">15</xref>
]. For example, Wu et al. [<xref ref-type="bibr" rid="CR14">14</xref>
] performed gene-level quantification for all genes initially, followed by isoform-level quantification. Therefore, gene-level expression data are useful in many cases. In particular, large-scale reanalysis of human and mouse RNA-Seq data such as in gene co-expression analysis [<xref ref-type="bibr" rid="CR16">16</xref>
] or comparison of similar expression profiles does not require precise expression data at the transcript level. For example, the growing the number of expression profiles provides a better quality gene co-expression dataset [<xref ref-type="bibr" rid="CR17">17</xref>
]. In this case, simple gene-level quantification is sufficiently accurate, which can then be improved by transcript-level estimation [<xref ref-type="bibr" rid="CR18">18</xref>
].</p>
<p id="Par13">To further enhance large-scale meta-analyses of RNA-Seq data, we here propose a new quantification algorithm called Matataki. Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of <italic>unique</italic>
 k-mers. However, our method has an additional advantage of reducing computational costs with the integration of two novel approaches. First, Matataki quantifies expression directly without implementation of the EM algorithm by focusing on the gene level. Second, our method checks fragments of reads at fixed skips even if the k-mer was not indexed. Because k-mers unique to a gene are usually found continuously, hashing all fragments of a read does not improve performance. Thus, Matataki provides a novel approach for ultra-fast RNA-Seq quantification based on unique k-mers to each gene. More specifically, our method searches for all k-mers that appear only once in a gene among a set of transcripts in only two steps: an index building step and a quantifying expression step. Here, we describe the proposed method and its implementation, and compare the performance against available methods using reference sequence and simulation datasets as test data.</p>
</sec>
<sec id="Sec2"><title>Methods</title>
<sec id="Sec3"><title>Index building step</title>
<p id="Par14">To achieve a fast mapping process, Matataki has to search for all k-mers that are unique to each gene. When multiple transcripts are available for a gene, the selected k-mers should include all isoforms of the gene to avoid any effects of the differential expression of isoforms.</p>
<p id="Par15">First, Matataki searches all unique k-mers to each gene in consideration of all k-mers in all transcript sequences. To judge the uniqueness of the k-mers, Matataki stores the k-mers in a hash table. Except in cases of a strand-specific read, all reverse complements of the k-mers are also considered. Second, Matataki checks whether all of the isoforms of a gene have a k-mer. Because Matataki quantifies expression at the gene level, differences in isoform-specific expression will be ignored. In other words, Matataki builds an index of k-mers that are unique to a gene and are found in all isoforms of the gene. Finally, Matataki counts the number of indexed k-mers for each gene, which will be used to determine the fragment per kilobase of million (FPKM) and transcript per million (TPM) values that are used in the quantification step. The pseudocode is shown in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Method S1. This building step is required only once for each species before using Matataki.</p>
</sec>
<sec id="Sec4"><title>Quantification step</title>
<p id="Par16">The quantification step can be divided into two sub-steps: counting the k-mers, and calculating FPKM and TPM values from the read counts.</p>
<p id="Par17">First, Matataki searches the indexed k-mers in a short read obtained through a next-generation sequencing experiment. When a read has k-mers associated with a gene, it is assigned to that gene. Matataki then counts the number of reads assigned to each gene. When a read has k-mers from two or more genes, the read will be excluded from further analyses.</p>
<p id="Par18">In the first step, the identified k-mers tend to be found sequentially; thus, we considered that searching all fragments of reads in a step-by-step manner is not required. Therefore, Matataki creates k-mers in step-size (<italic>S</italic>
) base intervals instead of creating all possible k-mers from a sequenced read so as to reduce the number of k-mer searches, and ultimately the computational time and cost. We also introduced the “accept-count” parameter <italic>M</italic>
, which is the minimum number of matched k-mers required to select a gene, to avoid the noise caused by fragments of a read sequence that matched to an indexed k-mer by chance. A read without an <italic>M</italic>
 times match to a gene is neglected because it is considered to have potentially matched by chance. Since some reads might have a sequencing error, mutation, or insertion/deletion, a fragment of a read might incorrectly match to an indexed k-mer. Usually, these incorrect matches are not found consecutively in a read; thus, the accept-count parameter <italic>M</italic>
 helps to avoid this type of incorrect match. When processing a pair-end sequenced file, each read is processed separately. The pseudocode is shown in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Methods S2.</p>
<p id="Par19">In the next step, Matataki calculates FPKM and TPM from gene-specific read counts using the following formulas:<disp-formula id="Equ1"><label>1</label>
<alternatives><tex-math id="M1">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ {F}_i=\frac{C_i/{K}_i}{\sum_j{C}_j}{10}^9 $$\end{document}</tex-math>
<mml:math id="M2" display="block"><mml:msub><mml:mi>F</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msub><mml:mi>K</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow><mml:msub><mml:mo>∑</mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub><mml:mi>C</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:msup><mml:mn>10</mml:mn>
<mml:mn>9</mml:mn>
</mml:msup>
</mml:math>
<graphic xlink:href="12859_2018_2279_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<disp-formula id="Equ2"><label>2</label>
<alternatives><tex-math id="M3">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ {T}_i=\frac{C_i/{K}_i}{\sum_j\left({C}_j/{K}_j\right)}{10}^6 $$\end{document}</tex-math>
<mml:math id="M4" display="block"><mml:msub><mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msub><mml:mi>K</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow><mml:msub><mml:mo>∑</mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mfenced close=")" open="("><mml:mrow><mml:msub><mml:mi>C</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msub><mml:mi>K</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:msup><mml:mn>10</mml:mn>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:math>
<graphic xlink:href="12859_2018_2279_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where <italic>F</italic>
<sub><italic>i</italic>
</sub>
 is FPKM, <italic>T</italic>
<sub><italic>i</italic>
</sub>
 is TPM, <italic>C</italic>
<sub><italic>i</italic>
</sub>
 is the count of gene-specific reads, and <italic>K</italic>
<sub><italic>i</italic>
</sub>
 is the number of indexed k-mers in a gene. Because Matataki uses only gene-specific k-mers, the EM or another algorithm is not needed to calculate the expression levels.</p>
</sec>
<sec id="Sec5"><title>Implementation</title>
<p id="Par20">We implemented Matataki with C++ 03, autotools, and KyotoCabinet [<xref ref-type="bibr" rid="CR19">19</xref>
]. To reduce memory usage and increase speed, a hash table format was optimized for the RNA/DNA k-mers. The first 4 K bytes contain the header of an index, including the number of entries, size of the hash table, and <italic>k</italic>
, and the k-mers and corresponding gene indexes are written after each header. Each entry has two subsections: a gene index and k-mers. A k-mer is compressed as a 2-bit representation of nucleic acids to reduce memory usage and hash value calculation time. Because each k-mer has a fixed length in one index, the entries do not contain length data. The hash function is also important for enabling a quick search of items in the table. We used the fast and widely accepted hash function MurMurHash3 for the hash table. Since building an index requires abundant resource, we distributed the pre-calculated index for publicly available human and mouse sequences.</p>
<p id="Par21">The source code, pre-built binaries, and pre-calculated index of human and mouse data are available at Github (<ext-link ext-link-type="uri" xlink:href="https://github.com/informationsea/Matataki">https://github.com/informationsea/Matataki</ext-link>
) and Additional file <xref rid="MOESM2" ref-type="media">2</xref>
.</p>
</sec>
<sec id="Sec6"><title>Comparison with other software products</title>
<p id="Par22">We compared the performance of Matataki with that of the currently available quantification methods bowtie 1.1.2 [<xref ref-type="bibr" rid="CR6">6</xref>
]/eXpress 1.5.1 [<xref ref-type="bibr" rid="CR5">5</xref>
], RSEM 1.2.22 [<xref ref-type="bibr" rid="CR4">4</xref>
], Sailfish 0.10.0 [<xref ref-type="bibr" rid="CR7">7</xref>
], and Kallisto 0.44.0 [<xref ref-type="bibr" rid="CR9">9</xref>
]. These comparisons were carried out using the default parameters of each software. We used binary-distributed files for bowtie/eXpress. Matataki, Sailfish, Kallisto, and RSEM were compiled with GCC 5.2.0. For this study, all running times and memory usages were measured in cluster machines. Each cluster node had two Intel® Xeon® CPU Silver 4116 2.10 GHz and 96 GB RAM.</p>
</sec>
<sec id="Sec7"><title>Test dataset</title>
<p id="Par23">We used RefSeq and gene2refseq [<xref ref-type="bibr" rid="CR20">20</xref>
] to create a reference database, which were downloaded on June 26, 2015 from the Human Genome Center, a mirror site of the National Center for Biotechnology Information. In the human RefSeq, 25,894 genes and 55,100 transcripts were available at the time of download. We also used GENCODE version 28 to create a reference database [<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
<p id="Par24">To examine the quantification quality, we used ERR188125. This run is a part of ERS185259, “RNA-sequencing of 465 lymphoblastoid cell lines from the 1000 Genomes.” The length of reads in ERR188125 was 75, and the number of reads was 28,810,860.</p>
<p id="Par25">We also compared quantification quality using simulation data. To create the simulation data, we used the rsem-simulate-reads included in RSEM. The simulation models were created by quantifying ERR188074, ERR188125, ERR188171, and ERR188362 with RSEM.</p>
</sec>
</sec>
<sec id="Sec8"><title>Results & Discussion</title>
<sec id="Sec9"><title>Statistics of indexed k-mers</title>
<sec id="Sec10"><title>Number of genes with indexed k-mers</title>
<p id="Par26">We first checked the number of genes with indexed k-mers in human, mouse, and <italic>Arabidopsis</italic>
 genomes when the parameter <italic>k</italic>
 in the considered k-mer was varied from 10 to 100. To effectively compare the results for different species, the numbers were converted to the ratio of genes (i.e., the gene coverage), which are shown in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1A. For <italic>k</italic>
 = 10, only a few human genes had unique k-mers in all species, while for <italic>k</italic>
 = 14, 96.8% of the human genes in RefSeq had indexed k-mers. The coverage of indexed genes reached a maximum at <italic>k</italic>
 = 34. However, <italic>k</italic>
 values that were too large resulted in lower gene coverage because some genes had only small transcripts.</p>
<p id="Par27">Similarly, we evaluated the nucleotide coverages of indexed k-mers, ratio of the number of total indexed position for each transcript, and total length of the transcripts (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1B). For the human data, <italic>k</italic>
 = 14 did not allow for sufficient coverage of sequences with indexed k-mer regions, and the nucleotide coverage almost reached its maximum at <italic>k</italic>
 = 18. This observation suggested that <italic>k</italic>
 should be larger than 18 to cover a sufficient number gene-specific gene regions. Similar trends were observed in the mouse and <italic>Arabidopsis</italic>
 datasets. Because the average length of genes in <italic>Arabidopsis</italic>
 is smaller than that in human and mouse genes, both gene and nucleotide coverage for <italic>Arabidopsis</italic>
 at <italic>k</italic>
 = 10 and 12 were better than those for the other species.</p>
</sec>
<sec id="Sec11"><title>Distribution of indexed k-mers in human transcript sequences</title>
<p id="Par28">To check the distribution of unique k-mers in each gene, we calculated the nucleotide coverage for each human gene at <italic>k</italic>
 = 32 (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2A). As a result, most human genes (86.4%) had a coverage rate higher than 50, and 61% of the human genes had coverage rates higher than 90%, indicating the existence of successive unique k-mers. As this pattern is reminiscent of islands in the sea, we call such a continuous region of nucleotides made from a successive index of k-mers a “cover island”.</p>
<p id="Par29">To clarify the nature of the cover islands, we checked the number of cover islands and their lengths for each gene (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2B, S2C). As a result, 60% of the genes had only one or two cover islands, and the median length of second longest cover island for each gene (327) was much smaller than that of the longest cover island (1262). We determined the existence of successive continuous nucleotides of unique k-mers, designated as “cover islands”, and found that most genes have a main cover island and several small satellite cover islands. Because the lengths of the longest cover islands for each gene were sufficiently longer than the <italic>k</italic>
 and the step size <italic>S</italic>
 used in this study, they did not interfere with the quantification accuracy when introducing the step size <italic>S</italic>
. It may be noteworthy that all unique k-mers should be listed in the index to implement the idea of step size, indicating that fast heuristic methods such as bloom filter [<xref ref-type="bibr" rid="CR22">22</xref>
] cannot be applied to build the index, as such methods could miss some hits of unique k-mers. Therefore, although introduction of the step size parameter will require a longer time to construct the indexes, for large-scale meta-analyses, the speed of quantification is more important than the speed of building the index. Importantly, our method depends on the quality and completeness of the transcript database. For this assessment, we used RefSeq instead of GENCODE, because GENCODE has less reliable transcripts that are not our target [<xref ref-type="bibr" rid="CR23">23</xref>
].</p>
</sec>
</sec>
<sec id="Sec12"><title>Comparison of quantification quality using simulation data</title>
<p id="Par30">We also compared TPM among eXpress, RSEM, Sailfish, Kallisto, and our method using simulation data. In this comparison, we used <italic>k</italic>
 = 32, <italic>S</italic>
 = 12, and <italic>M</italic>
 = 2 for Matataki, and default parameters were used for the other methods. The results (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3, Fig. <xref rid="Fig1" ref-type="fig">1</xref>
) indicated that our method had the second best performance with respect to correlation (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3A, C, E, G and I; Fig. <xref rid="Fig1" ref-type="fig">1a</xref>
 except MatatakiSubset) and the minimum absolute mean difference among alignment free methods (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3B, D, F, H and J; Fig. <xref rid="Fig1" ref-type="fig">1b</xref>
 except MatatakiSubset). Because RSEM had the best performance for both correlation and error, using the result from this alignment-based method would be the best choice to evaluate prediction performance if the calculation costs are acceptable. In this analysis, we used all genes; however, some genes did not have any indexed k-mers, which cannot be managed by our method. Therefore, as a practical reference, we have provided the results obtained when excluding the genes without any indexed k-mers in Fig. <xref rid="Fig1" ref-type="fig">1a</xref>
 and <xref rid="Fig1" ref-type="fig">b</xref>
 as the MatatakiSubset. Since we used RSEM’s RNA-Seq simulator for this evaluation, comparison with RSEM was not appropriate. Therefore, we used eXpress to compare the results with real data, which emerged as the best performance tool aside from RSEM and our method.<fig id="Fig1"><label>Fig. 1</label>
<caption><p>Summary of the results using simulation data. <bold>a</bold>
 Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. <bold>b</bold>
 Means of absolute difference from the expected expression levels</p>
</caption>
<graphic xlink:href="12859_2018_2279_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
</sec>
<sec id="Sec13"><title>Comparison of quantification quality using real data</title>
<sec id="Sec14"><title>Comparison of TPM</title>
<p id="Par31">Figure <xref rid="Fig2" ref-type="fig">2</xref>
 shows the comparisons of TPM values obtained with our method and eXpress for different <italic>k</italic>
 values. Our method gave similar TPM values for all <italic>k</italic>
 values, and larger <italic>k</italic>
 values provided better Spearman correlation coefficient (SCC) values, reaching up to 0.949 with <italic>k</italic>
 = 56<italic>.</italic>
 These results indicated that higher <italic>k</italic>
 values are preferable for better estimation; however, a large <italic>k</italic>
 is not always the best choice for a given analysis. For example, in the Short Read Archive, 9.2% of human RNA-Seq data have reads with a length shorter than 50. Accordingly, to cover 99% of human RNA-Seq data, <italic>k</italic>
 should be smaller than 34. Therefore, we used <italic>k</italic>
 = 32 in the following analyses, for which the SCC of TPM values obtained between our method and eXpress was 0.931. We summed the TPM values of a given gene for comparison with Matataki’s TPM.<fig id="Fig2"><label>Fig. 2</label>
<caption><p>Comparison of TPM when <italic>k</italic>
 was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing <italic>k</italic>
 from 16 to 56 with a step of 8</p>
</caption>
<graphic xlink:href="12859_2018_2279_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par32">We also determined the effect of the correlation of TPM values between eXpress and Matataki when changing the step size parameter <italic>S</italic>
 from 1 to 16 with a step of 4 (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4). Overall, larger <italic>S</italic>
 values produced better correlations based on SCC values, suggesting that introducing the step size parameter <italic>S</italic>
 can reduce accidental matches of indexed k-mers with short reads. Usually, an indexed k-mer is matched in a successive way and forms a few cover islands, whereas accidental matches will show a different pattern and can therefore be eliminated by skipping all matches. Similar to the considerations for selecting <italic>k</italic>
 values, an <italic>S</italic>
 value that is too large will be problematic; therefore, we used <italic>S</italic>
 = 12 for the following analyses as a representative value showing a sufficient degree of correlation with the existing method.</p>
<p id="Par33">We further checked the effects of the accept-count <italic>M</italic>
 parameter by changing it from 1 to 4 (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5). This parameter was introduced with the aim of avoiding the mis-assignment of some reads to genes due to accidental matches between indexed k-mers and the reads. We found that the SCC value was better with <italic>M</italic>
 > 1 than with <italic>M</italic>
 = 1, indicating that some reads were actually counted as mis-assigned genes. However, the SCC value was worse at <italic>M</italic>
 = 4 than at <italic>M</italic>
 = 3. These results indicated that a certain level of mis-assignment should be allowed for more accurate quantification.</p>
<p id="Par34">The mapping rate is also an important measure for evaluating the performance of the method. We compared mapping rates by varying <italic>k</italic>
, <italic>S,</italic>
 and <italic>M</italic>
. As expected, the mapping rate became smaller as <italic>k</italic>
 became larger, because the matching condition was stricter for larger <italic>k</italic>
 values (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6A). When <italic>k</italic>
 = 16, the mapping rate exceeded the rate of bowtie, indicating that <italic>k</italic>
 = 16 may be too small to avoid accidental matches of indexed k-mers and the resulting mis-assignment of the read to genes. In a similar way, larger <italic>M</italic>
 values resulted in lower mapping rates, as expected (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6C). In particular, the mapping rate dropped rapidly at <italic>M</italic>
 = 4, suggesting that <italic>M</italic>
 = 4 may be too strict for these data. By contrast, <italic>S</italic>
 only had a minimal effect on the mapping rates (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6B), and selection of the <italic>S</italic>
 parameter was not problematic in this case. Thus, selection of the best combination of <italic>k</italic>
, step size <italic>S,</italic>
 and accept-count <italic>M</italic>
 is one of the problems that must be addressed in implementing the method, which will depend on the read length and experimental qualities.</p>
<p id="Par35">When <italic>k</italic>
 = 32, the number of genes without indexed k-mers was 717. The details of these uncovered genes are shown in Table <xref rid="Tab1" ref-type="table">1</xref>
, and the full coverage list of transcripts is shown in Additional file <xref rid="MOESM3" ref-type="media">3</xref>
: Table S1. Half of the uncovered genes were non-coding genes. Because non-coding genes cannot be amplified in the translation step, a high copy number in the genome is required for functional activity. The other half of the uncovered genes were protein-coding genes. Noted that paralogous genes can be one of the causes of finding non-unique k-mers. According to the HomoloGene group, but only 21.1% of paralogous genes were uncovered. (see Additional file <xref rid="MOESM3" ref-type="media">3</xref>
: Table S2). We also performed enrichment analysis of the uncovered genes with TargetMine [<xref ref-type="bibr" rid="CR24">24</xref>
], which revealed five biological-process Gene Ontology (GO) terms (Additional file <xref rid="MOESM3" ref-type="media">3</xref>
: Table S3) and four molecular function GO terms (Additional file <xref rid="MOESM3" ref-type="media">3</xref>
: Table S4) that were significantly enriched. Since genes related to ubiquitin and defense response have many paralogous genes, these GO terms were particularly enriched.<table-wrap id="Tab1"><label>Table 1</label>
<caption><p>Details of the uncovered genes</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th>Type of Gene</th>
<th>Number of uncovered genes</th>
<th>Total number of genes</th>
<th>Percentage of uncovered genes</th>
</tr>
</thead>
<tbody><tr><td>Non-coding RNA</td>
<td>393</td>
<td>6250</td>
<td>6.3%</td>
</tr>
<tr><td> MicroRNA</td>
<td>233</td>
<td>1880</td>
<td>11.9%</td>
</tr>
<tr><td> Ribosomal RNA</td>
<td>19</td>
<td>21</td>
<td>90.5%</td>
</tr>
<tr><td> Small nuclear RNA</td>
<td>35</td>
<td>109</td>
<td>32.1%</td>
</tr>
<tr><td> Small nucleolar RNA</td>
<td>45</td>
<td>390</td>
<td>11.5%</td>
</tr>
<tr><td> Other non-coding RNA</td>
<td>61</td>
<td>3850</td>
<td>1.6%</td>
</tr>
<tr><td>Pseudo-gene</td>
<td>21</td>
<td>927</td>
<td>2.7%</td>
</tr>
<tr><td>Protein-coding gene</td>
<td>303</td>
<td>18,720</td>
<td>1.6%</td>
</tr>
<tr><td> Paralogous gene</td>
<td>137</td>
<td>505</td>
<td>27.1%</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec15"><title>Comparison of CPU time and memory usage</title>
<p id="Par36">We compared the CPU time and memory usage of six existing methods with those of Matataki using real data in four runs, ERR188074, ERR188125, ERR188171, and ERR188362. In this comparison, we used <italic>k</italic>
 = 32, <italic>S</italic>
 = 12, and <italic>M</italic>
 = 3 as the parameters. The results confirmed that our method was much faster than the alignment-based methods bowtie without quantification, RSEM, and eXpress. Matataki was twice as fast as the alignment-free methods Sailfish and Kallisto (Table <xref rid="Tab2" ref-type="table">2</xref>
, Fig. <xref rid="Fig3" ref-type="fig">3</xref>
). With respect to memory usage, Matataki used 3.5 GB RAM, while the other methods used 3.8 GB or more RAM. It should be noted that Matataki was also faster than gzip (~ 55 s) and bzip2 (~ 285 s).<table-wrap id="Tab2"><label>Table 2</label>
<caption><p>Comparison of running times among methods</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th colspan="2">Run accession</th>
<th>ERR188074</th>
<th>ERR188125</th>
<th>ERR188171</th>
<th>ERR188362</th>
</tr>
</thead>
<tbody><tr><td rowspan="3">Run and mapping statics</td>
<td>Number of reads</td>
<td>31,540,813</td>
<td>28,810,860</td>
<td>30,386,179</td>
<td>26,255,381</td>
</tr>
<tr><td>Length of reads</td>
<td>75</td>
<td>75</td>
<td>75</td>
<td>75</td>
</tr>
<tr><td>Bowtie mapping rate</td>
<td>84.7%</td>
<td>80.2%</td>
<td>84.6%</td>
<td>80.4%</td>
</tr>
<tr><td rowspan="6">CPU time comparison (s)<sup>a</sup>
</td>
<td>eXpress</td>
<td>14,546.6</td>
<td>24,036.1</td>
<td>13,429.5</td>
<td>23,103.9</td>
</tr>
<tr><td>RSEM</td>
<td>22,700.6</td>
<td>20,545.9</td>
<td>21,753.1</td>
<td>18,842.2</td>
</tr>
<tr><td>Bowtie</td>
<td>1487.8</td>
<td>1477.5</td>
<td>1472.6</td>
<td>1319.5</td>
</tr>
<tr><td>Sailfish</td>
<td>299.0</td>
<td>281.0</td>
<td>294.2</td>
<td>285.5</td>
</tr>
<tr><td>Kallisto</td>
<td>138.7</td>
<td>144.2</td>
<td>136.7</td>
<td>129.5</td>
</tr>
<tr><td>Matataki</td>
<td>57.2</td>
<td>46.4</td>
<td>43.9</td>
<td>42.5</td>
</tr>
<tr><td rowspan="5">Acceleration rate compared with existing methods</td>
<td>eXpress</td>
<td>254</td>
<td>517</td>
<td>305</td>
<td>543</td>
</tr>
<tr><td>RSEM</td>
<td>397</td>
<td>442</td>
<td>495</td>
<td>443</td>
</tr>
<tr><td>Bowtie</td>
<td>26.0</td>
<td>31.8</td>
<td>33.5</td>
<td>31.0</td>
</tr>
<tr><td>Sailfish</td>
<td>5.23</td>
<td>6.05</td>
<td>6.69</td>
<td>6.71</td>
</tr>
<tr><td>Kallisto</td>
<td>2.43</td>
<td>3.107</td>
<td>3.11</td>
<td>3.05</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
Values represent the median for 10 measurements</p>
</table-wrap-foot>
</table-wrap>
<fig id="Fig3"><label>Fig. 3</label>
<caption><p>Comparison of CPU time for different methods</p>
</caption>
<graphic xlink:href="12859_2018_2279_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p id="Par37">It should be noted that our approach is not designed for precise quantification of transcripts and minor expressed genes. The speed of quantification takes priority over these limitations in our method because increasing the amount of RNA-Seq data improves the value of reanalysis, such as the quality of gene co-expression network [<xref ref-type="bibr" rid="CR17">17</xref>
].</p>
</sec>
<sec id="Sec16"><title>Expected use-cases and limitations</title>
<p id="Par38">Since Matataki was designed with the objective of improving the speed of quantifying RNA-Seq data, the accuracy of quantification can be worse than that of other methods. Therefore, Matataki is suitable for large-scale reanalysis such as searching similar gene expression profiles or gene co-expression. As shown in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S7, a larger number of samples in gene co-expression improves the accuracy of GO term prediction. Since the amount of RNA-Seq data is rapidly increasing in public databases, it is important to increase the number of reanalyzed samples to determine gene co-expression patterns.</p>
<p id="Par39">Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 <italic>k</italic>
 − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value. It was also previously reported that transcript-level abundance inference improves gene-level expression estimation, both theoretically [<xref ref-type="bibr" rid="CR25">25</xref>
] and practically [<xref ref-type="bibr" rid="CR18">18</xref>
]. Another weak point of this method is that the ratio of uncovered genes was over half when we used GENCODE version 28 [<xref ref-type="bibr" rid="CR21">21</xref>
] to create the index, because the comprehensive GENCODE annotation includes many incomplete transcripts without a start codon and stop codon (see Additional file <xref rid="MOESM3" ref-type="media">3</xref>
: Table S5). Since Matataki requires unique k-mers between genes and common k-mers among transcripts, major transcripts should be selected as reference transcripts. For these reasons, the expected use-case of Matataki is in the large-scale reanalysis of RNA-Seq data, such as for gene co-expression or searching similar expression profiles.</p>
</sec>
</sec>
<sec id="Sec17"><title>Conclusion</title>
<p id="Par40">We present Matataki, a much faster and user-friendly quantification method for RNA-Seq data analysis. This method archived the data at a rate more than 300 times faster than achieved with the alignment-based method bowtie/eXpress and two times faster than that achieved with other alignment-free methods, and had smaller memory requirements. In addition, Matataki had shorter calculation times, comparable quantification accuracy levels to alignment-based methods, and better accuracy than alignment-free methods. Because Matataki was even faster than decompressing gzip and bzip2, the improved computational cost and speed of Matataki resolves one of the major limitations of RNA-Seq analyses, shifting the bottleneck to decompression from mapping reads.</p>
</sec>
<sec sec-type="supplementary-material"><title>Additional files</title>
<sec id="Sec18"><p><supplementary-material content-type="local-data" id="MOESM1"><media xlink:href="12859_2018_2279_MOESM1_ESM.docx"><label>Additional file 1:</label>
<caption><p>Supplementary methods (pseudocode and mapping) and figures. (DOCX 1581 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM2"><media xlink:href="12859_2018_2279_MOESM2_ESM.gz"><label>Additional file 2:</label>
<caption><p>Source code of Matataki. (GZ 7760 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM3"><media xlink:href="12859_2018_2279_MOESM3_ESM.xlsx"><label>Additional file 3:</label>
<caption><p><bold>Table S1.</bold>
 Numbers of indexed k-mer for each transcript. <bold>Table S2.</bold>
 List of paralogous genes and number of indexed k-mers. <bold>Table S3.</bold>
 List of enriched biological process GO terms in uncovered genes. <bold>Table S4.</bold>
 List of enriched molecular function GO terms in uncovered genes. Table S5: Details of the uncovered genes in GENCODE transcripts. (XLSX 3579 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back><glossary><title>Abbreviations</title>
<def-list><def-item><term>EM</term>
<def><p id="Par5">Expectation maximization</p>
</def>
</def-item>
<def-item><term>FPKM</term>
<def><p id="Par6">Fragment per kilobase million</p>
</def>
</def-item>
<def-item><term>SCC</term>
<def><p id="Par7">Spearman’s correlation coefficient</p>
</def>
</def-item>
<def-item><term>TPM</term>
<def><p id="Par8">Transcripts per million</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group><fn><p><bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (10.1186/s12859-018-2279-y) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack><title>Acknowledgements</title>
<p>The super-computing resource was provided by Human Genome Center of the University of Tokyo.</p>
<sec id="FPar1"><title>Funding</title>
<p id="Par41">This research was supported by Platform Project for Supporting Drug Discovery and Life Science Research [Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)] from the Japan Agency for Medical Research and Development (AMED; grant number JP18am0101067) and Grant-in-Aid for Challenging Exploratory Research (grant number 16 K12519) from the Japan Society for the Promotion of Science (JSPS). Funding bodies did not play any role of the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.</p>
</sec>
<sec id="FPar2"><title>Availability of data and materials</title>
<p id="Par42">Not applicable.</p>
</sec>
</ack>
<notes notes-type="author-contribution"><title>Authors’ contributions</title>
<p>YO and KK designed the study and wrote the paper. YO performed the programming and evaluation of the method. Both authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement"><sec id="FPar3"><title>Ethics approval and consent to participate</title>
<p id="Par43">Not applicable.</p>
</sec>
<sec id="FPar4"><title>Consent for publication</title>
<p id="Par44">Not applicable.</p>
</sec>
<sec id="FPar5"><title>Competing interests</title>
<p id="Par45">The authors declare that they have no competing interests.</p>
</sec>
<sec id="FPar6"><title>Publisher’s Note</title>
<p id="Par46">Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1"><title>References</title>
<ref id="CR1"><label>1.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname>
<given-names>D</given-names>
</name>
<name><surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Pimentel</surname>
<given-names>H</given-names>
</name>
<name><surname>Kelly</surname>
<given-names>R</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions</article-title>
<source>Genome Biol</source>
<year>2013</year>
<volume>14</volume>
<fpage>R36</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-4-r36</pub-id>
<pub-id pub-id-type="pmid">23618408</pub-id>
</element-citation>
</ref>
<ref id="CR2"><label>2.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Pachter</surname>
<given-names>L</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>TopHat: discovering splice junctions with RNA-Seq</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1l05</fpage>
<lpage>1l11</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp120</pub-id>
<pub-id pub-id-type="pmid">19033273</pub-id>
</element-citation>
</ref>
<ref id="CR3"><label>3.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Roberts</surname>
<given-names>A</given-names>
</name>
<name><surname>Goff</surname>
<given-names>L</given-names>
</name>
<name><surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name><surname>Kim</surname>
<given-names>D</given-names>
</name>
<name><surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks</article-title>
<source>Nat Protocol</source>
<year>2014</year>
<volume>7</volume>
<fpage>562</fpage>
<lpage>578</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2012.016</pub-id>
</element-citation>
</ref>
<ref id="CR4"><label>4.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>B</given-names>
</name>
<name><surname>Dewey</surname>
<given-names>CN</given-names>
</name>
</person-group>
<article-title>RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>323</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-323</pub-id>
<pub-id pub-id-type="pmid">21816040</pub-id>
</element-citation>
</ref>
<ref id="CR5"><label>5.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Roberts</surname>
<given-names>A</given-names>
</name>
<name><surname>Pachter</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Streaming fragment assignment for real-time analysis of sequencing experiments</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>10</volume>
<fpage>71</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2251</pub-id>
<pub-id pub-id-type="pmid">23160280</pub-id>
</element-citation>
</ref>
<ref id="CR6"><label>6.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Pop</surname>
<given-names>M</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<fpage>R25</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-3-r25</pub-id>
<pub-id pub-id-type="pmid">19261174</pub-id>
</element-citation>
</ref>
<ref id="CR7"><label>7.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Patro</surname>
<given-names>R</given-names>
</name>
<name><surname>Mount</surname>
<given-names>SM</given-names>
</name>
<name><surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms</article-title>
<source>Nat Biotechnol</source>
<year>2014</year>
<volume>32</volume>
<fpage>462</fpage>
<lpage>464</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2862</pub-id>
<pub-id pub-id-type="pmid">24752080</pub-id>
</element-citation>
</ref>
<ref id="CR8"><label>8.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name><surname>Wang</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>RNA-skim: a rapid method for RNA-Seq quantification at transcript level</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<fpage>i283</fpage>
<lpage>i292</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu288</pub-id>
<pub-id pub-id-type="pmid">24931995</pub-id>
</element-citation>
</ref>
<ref id="CR9"><label>9.</label>
<mixed-citation publication-type="other">Bray N, Pimentel H, Melsted P, Pachter L. Near-optimal RNA-seq quantification. arXiv. 2015; <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1505.02710">http://arxiv.org/abs/1505.02710</ext-link>
</mixed-citation>
</ref>
<ref id="CR10"><label>10.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Janzen</surname>
<given-names>D</given-names>
</name>
<name><surname>Tiourin</surname>
<given-names>E</given-names>
</name>
<name><surname>Salehi</surname>
<given-names>J</given-names>
</name>
<name><surname>Paik</surname>
<given-names>DY</given-names>
</name>
<name><surname>Lu</surname>
<given-names>J</given-names>
</name>
<name><surname>Pellegrini</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>An apoptosis-enhancing drug overcomes platinum resistance in a tumour-initiating subpopulation of ovarian cancer</article-title>
<source>Nat Commun</source>
<year>2015</year>
<volume>6</volume>
<fpage>7956</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms8956</pub-id>
<pub-id pub-id-type="pmid">26234182</pub-id>
</element-citation>
</ref>
<ref id="CR11"><label>11.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Madan</surname>
<given-names>B</given-names>
</name>
<name><surname>Ke</surname>
<given-names>Z</given-names>
</name>
<name><surname>Harmston</surname>
<given-names>N</given-names>
</name>
<name><surname>Ho</surname>
<given-names>SY</given-names>
</name>
<name><surname>Frois</surname>
<given-names>AO</given-names>
</name>
<name><surname>Alam</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Wnt addiction of genetically defined cancers reversed by PORCN inhibition</article-title>
<source>Oncogene</source>
<year>2016</year>
<volume>35</volume>
<fpage>2197</fpage>
<lpage>2207</lpage>
<pub-id pub-id-type="doi">10.1038/onc.2015.280</pub-id>
<pub-id pub-id-type="pmid">26257057</pub-id>
</element-citation>
</ref>
<ref id="CR12"><label>12.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cacchiarelli</surname>
<given-names>D</given-names>
</name>
<name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Ziller</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Soumillon</surname>
<given-names>M</given-names>
</name>
<name><surname>Cesana</surname>
<given-names>M</given-names>
</name>
<name><surname>Karnik</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Integrative analyses of human reprogramming reveal dynamic nature of induced pluripotency</article-title>
<source>Cell</source>
<year>2015</year>
<volume>162</volume>
<fpage>412</fpage>
<lpage>424</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2015.06.016</pub-id>
<pub-id pub-id-type="pmid">26186193</pub-id>
</element-citation>
</ref>
<ref id="CR13"><label>13.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname>
<given-names>H</given-names>
</name>
<name><surname>Li</surname>
<given-names>Z</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>W</given-names>
</name>
<name><surname>Schulze-Gahmen</surname>
<given-names>U</given-names>
</name>
<name><surname>Xue</surname>
<given-names>Y</given-names>
</name>
<name><surname>Zhou</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Gene target specificity of the super elongation complex (SEC) family: how HIV-1 tat employs selected SEC members to activate viral transcription</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<fpage>5868</fpage>
<lpage>5879</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkv541</pub-id>
<pub-id pub-id-type="pmid">26007649</pub-id>
</element-citation>
</ref>
<ref id="CR14"><label>14.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname>
<given-names>Y</given-names>
</name>
<name><surname>Wang</surname>
<given-names>X</given-names>
</name>
<name><surname>Wu</surname>
<given-names>F</given-names>
</name>
<name><surname>Huang</surname>
<given-names>R</given-names>
</name>
<name><surname>Xue</surname>
<given-names>F</given-names>
</name>
<name><surname>Liang</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Transcriptome profiling of the cancer, adjacent non-tumor and distant normal tissues from a colorectal cancer patient by deep sequencing</article-title>
<source>PLoS One</source>
<year>2012</year>
<volume>7</volume>
<fpage>e41001</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0041001</pub-id>
<pub-id pub-id-type="pmid">22905095</pub-id>
</element-citation>
</ref>
<ref id="CR15"><label>15.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name><surname>Lieu</surname>
<given-names>YK</given-names>
</name>
<name><surname>Ali</surname>
<given-names>AM</given-names>
</name>
<name><surname>Penson</surname>
<given-names>A</given-names>
</name>
<name><surname>Reggio</surname>
<given-names>KS</given-names>
</name>
<name><surname>Rabadan</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Disease-associated mutation in SRSF2 misregulates splicing by altering RNA-binding affinities</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2015</year>
<volume>112</volume>
<fpage>E4726</fpage>
<lpage>E4734</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1514105112</pub-id>
<pub-id pub-id-type="pmid">26261309</pub-id>
</element-citation>
</ref>
<ref id="CR16"><label>16.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Okamura</surname>
<given-names>Y</given-names>
</name>
<name><surname>Aoki</surname>
<given-names>Y</given-names>
</name>
<name><surname>Obayashi</surname>
<given-names>T</given-names>
</name>
<name><surname>Tadaka</surname>
<given-names>S</given-names>
</name>
<name><surname>Ito</surname>
<given-names>S</given-names>
</name>
<name><surname>Narise</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<article-title>COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<fpage>D82</fpage>
<lpage>D86</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku1163</pub-id>
<pub-id pub-id-type="pmid">25392420</pub-id>
</element-citation>
</ref>
<ref id="CR17"><label>17.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Obayashi</surname>
<given-names>T</given-names>
</name>
<name><surname>Okamura</surname>
<given-names>Y</given-names>
</name>
<name><surname>Ito</surname>
<given-names>S</given-names>
</name>
<name><surname>Tadaka</surname>
<given-names>S</given-names>
</name>
<name><surname>Motoike</surname>
<given-names>IN</given-names>
</name>
<name><surname>Kinoshita</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<fpage>D1014</fpage>
<lpage>D1020</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gks1014</pub-id>
<pub-id pub-id-type="pmid">23203868</pub-id>
</element-citation>
</ref>
<ref id="CR18"><label>18.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Soneson</surname>
<given-names>C</given-names>
</name>
<name><surname>Love</surname>
<given-names>MI</given-names>
</name>
<name><surname>Robinson</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences</article-title>
<source>F1000 Res</source>
<year>2016</year>
<volume>4</volume>
<fpage>1521</fpage>
<pub-id pub-id-type="doi">10.12688/f1000research.7563.2</pub-id>
</element-citation>
</ref>
<ref id="CR19"><label>19.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><collab>FAL Labs</collab>
</person-group>
<source>KyotoCabinet</source>
<year>2011</year>
</element-citation>
</ref>
<ref id="CR20"><label>20.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><collab>NCBI Resource Coordinators</collab>
</person-group>
<article-title>Database resources of the national center for biotechnology information</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<fpage>D6</fpage>
<lpage>17</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku1130</pub-id>
<pub-id pub-id-type="pmid">25398906</pub-id>
</element-citation>
</ref>
<ref id="CR21"><label>21.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Harrow</surname>
<given-names>J</given-names>
</name>
<name><surname>Frankish</surname>
<given-names>A</given-names>
</name>
<name><surname>Gonzalez</surname>
<given-names>JM</given-names>
</name>
</person-group>
<article-title>GENCODE: the reference human genome annotation for the ENCODE project</article-title>
<source>Genome Res</source>
<year>2012</year>
<volume>22</volume>
<fpage>1760</fpage>
<lpage>1774</lpage>
<pub-id pub-id-type="doi">10.1101/gr.135350.111</pub-id>
<pub-id pub-id-type="pmid">22955987</pub-id>
</element-citation>
</ref>
<ref id="CR22"><label>22.</label>
<mixed-citation publication-type="other">Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970; 10.1145/362686.362692.</mixed-citation>
</ref>
<ref id="CR23"><label>23.</label>
<mixed-citation publication-type="other">Frankish A, Uszczynska B, Richie GRS, Gonzalaz JM, Pervouchine D, Petryszak R, et al. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015; 10.1186/1471-2164-16-S8-S2.</mixed-citation>
</ref>
<ref id="CR24"><label>24.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname>
<given-names>YA</given-names>
</name>
<name><surname>Tripathi</surname>
<given-names>LP</given-names>
</name>
<name><surname>Mizuguchi</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery</article-title>
<source>PLoS One</source>
<year>2011</year>
<volume>6</volume>
<fpage>e17844</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0017844</pub-id>
<pub-id pub-id-type="pmid">21408081</pub-id>
</element-citation>
</ref>
<ref id="CR25"><label>25.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name><surname>Hendrickson</surname>
<given-names>DG</given-names>
</name>
<name><surname>Sauvageau</surname>
<given-names>M</given-names>
</name>
<name><surname>Goff</surname>
<given-names>L</given-names>
</name>
<name><surname>Rinn</surname>
<given-names>JL</given-names>
</name>
<name><surname>Pachter</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Differential analysis of gene regulation at transcript resolution with RNA-seq</article-title>
<source>Nat Biotechnol</source>
<year>2013</year>
<volume>31</volume>
<fpage>46</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2450</pub-id>
<pub-id pub-id-type="pmid">23222703</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000274  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000274  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri