Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition

Identifieur interne : 000367 ( Pmc/Corpus ); précédent : 000366; suivant : 000368

DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition

Auteurs : Jérôme Audoux ; Nicolas Philippe ; Rayan Chikhi ; Mikaël Salson ; Mélina Gallopin ; Marc Gabriel ; Jérémy Le Coz ; Emilie Drouineau ; Thérèse Commes ; Daniel Gautheret

Source :

RBID : PMC:5747171

Abstract

We introduce a k-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all k-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation present in an RNA-seq data set. This variation is subsequently assigned to biological events or entities such as differential long non-coding RNAs, splice and polyadenylation variants, introns, repeats, editing or mutation events, and exogenous RNA. Applying DE-kupl to human RNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-017-1372-2) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s13059-017-1372-2
PubMed: 29284518
PubMed Central: 5747171

Links to Exploration step

PMC:5747171

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">DE-kupl: exhaustive capture of biological variation in RNA-seq data through
<italic>k</italic>
-mer decomposition</title>
<author>
<name sortKey="Audoux, Jerome" sort="Audoux, Jerome" uniqKey="Audoux J" first="Jérôme" last="Audoux">Jérôme Audoux</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>INSERM U1183 IRMB,</institution>
<institution>Université de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295 France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Philippe, Nicolas" sort="Philippe, Nicolas" uniqKey="Philippe N" first="Nicolas" last="Philippe">Nicolas Philippe</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>Institut de Biologie Computationnelle,</institution>
<institution>Université Montpellier,</institution>
</institution-wrap>
Montpellier, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9961 060X</institution-id>
<institution-id institution-id-type="GRID">grid.157868.5</institution-id>
<institution>SeqOne, IRMB,</institution>
<institution>CHRU de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2112 9282</institution-id>
<institution-id institution-id-type="GRID">grid.4444.0</institution-id>
<institution>Univ. Lille, CNRS, Inria,</institution>
</institution-wrap>
UMR 9189 - CRIStAL - F-59000, Lille, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salson, Mikael" sort="Salson, Mikael" uniqKey="Salson M" first="Mikaël" last="Salson">Mikaël Salson</name>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2112 9282</institution-id>
<institution-id institution-id-type="GRID">grid.4444.0</institution-id>
<institution>Univ. Lille, CNRS, Inria,</institution>
</institution-wrap>
UMR 9189 - CRIStAL - F-59000, Lille, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gallopin, Melina" sort="Gallopin, Melina" uniqKey="Gallopin M" first="Mélina" last="Gallopin">Mélina Gallopin</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gabriel, Marc" sort="Gabriel, Marc" uniqKey="Gabriel M" first="Marc" last="Gabriel">Marc Gabriel</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff6">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2284 9388</institution-id>
<institution-id institution-id-type="GRID">grid.14925.3b</institution-id>
<institution>Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655,</institution>
</institution-wrap>
Villejuif, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Le Coz, Jeremy" sort="Le Coz, Jeremy" uniqKey="Le Coz J" first="Jérémy" last="Le Coz">Jérémy Le Coz</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Drouineau, Emilie" sort="Drouineau, Emilie" uniqKey="Drouineau E" first="Emilie" last="Drouineau">Emilie Drouineau</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Commes, Therese" sort="Commes, Therese" uniqKey="Commes T" first="Thérèse" last="Commes">Thérèse Commes</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>INSERM U1183 IRMB,</institution>
<institution>Université de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295 France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>Institut de Biologie Computationnelle,</institution>
<institution>Université Montpellier,</institution>
</institution-wrap>
Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gautheret, Daniel" sort="Gautheret, Daniel" uniqKey="Gautheret D" first="Daniel" last="Gautheret">Daniel Gautheret</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff6">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2284 9388</institution-id>
<institution-id institution-id-type="GRID">grid.14925.3b</institution-id>
<institution>Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655,</institution>
</institution-wrap>
Villejuif, France</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29284518</idno>
<idno type="pmc">5747171</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5747171</idno>
<idno type="RBID">PMC:5747171</idno>
<idno type="doi">10.1186/s13059-017-1372-2</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000367</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000367</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">DE-kupl: exhaustive capture of biological variation in RNA-seq data through
<italic>k</italic>
-mer decomposition</title>
<author>
<name sortKey="Audoux, Jerome" sort="Audoux, Jerome" uniqKey="Audoux J" first="Jérôme" last="Audoux">Jérôme Audoux</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>INSERM U1183 IRMB,</institution>
<institution>Université de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295 France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Philippe, Nicolas" sort="Philippe, Nicolas" uniqKey="Philippe N" first="Nicolas" last="Philippe">Nicolas Philippe</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>Institut de Biologie Computationnelle,</institution>
<institution>Université Montpellier,</institution>
</institution-wrap>
Montpellier, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9961 060X</institution-id>
<institution-id institution-id-type="GRID">grid.157868.5</institution-id>
<institution>SeqOne, IRMB,</institution>
<institution>CHRU de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2112 9282</institution-id>
<institution-id institution-id-type="GRID">grid.4444.0</institution-id>
<institution>Univ. Lille, CNRS, Inria,</institution>
</institution-wrap>
UMR 9189 - CRIStAL - F-59000, Lille, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salson, Mikael" sort="Salson, Mikael" uniqKey="Salson M" first="Mikaël" last="Salson">Mikaël Salson</name>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2112 9282</institution-id>
<institution-id institution-id-type="GRID">grid.4444.0</institution-id>
<institution>Univ. Lille, CNRS, Inria,</institution>
</institution-wrap>
UMR 9189 - CRIStAL - F-59000, Lille, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gallopin, Melina" sort="Gallopin, Melina" uniqKey="Gallopin M" first="Mélina" last="Gallopin">Mélina Gallopin</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gabriel, Marc" sort="Gabriel, Marc" uniqKey="Gabriel M" first="Marc" last="Gabriel">Marc Gabriel</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff6">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2284 9388</institution-id>
<institution-id institution-id-type="GRID">grid.14925.3b</institution-id>
<institution>Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655,</institution>
</institution-wrap>
Villejuif, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Le Coz, Jeremy" sort="Le Coz, Jeremy" uniqKey="Le Coz J" first="Jérémy" last="Le Coz">Jérémy Le Coz</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Drouineau, Emilie" sort="Drouineau, Emilie" uniqKey="Drouineau E" first="Emilie" last="Drouineau">Emilie Drouineau</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Commes, Therese" sort="Commes, Therese" uniqKey="Commes T" first="Thérèse" last="Commes">Thérèse Commes</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>INSERM U1183 IRMB,</institution>
<institution>Université de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295 France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>Institut de Biologie Computationnelle,</institution>
<institution>Université Montpellier,</institution>
</institution-wrap>
Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gautheret, Daniel" sort="Gautheret, Daniel" uniqKey="Gautheret D" first="Daniel" last="Gautheret">Daniel Gautheret</name>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff6">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2284 9388</institution-id>
<institution-id institution-id-type="GRID">grid.14925.3b</institution-id>
<institution>Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655,</institution>
</institution-wrap>
Villejuif, France</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Genome Biology</title>
<idno type="ISSN">1474-7596</idno>
<idno type="eISSN">1474-760X</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We introduce a
<italic>k</italic>
-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all
<italic>k</italic>
-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation present in an RNA-seq data set. This variation is subsequently assigned to biological events or entities such as differential long non-coding RNAs, splice and polyadenylation variants, introns, repeats, editing or mutation events, and exogenous RNA. Applying DE-kupl to human RNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.</p>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13059-017-1372-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, C" uniqKey="Zhang C">C Zhang</name>
</author>
<author>
<name sortKey="Zhang, B" uniqKey="Zhang B">B Zhang</name>
</author>
<author>
<name sortKey="Lin, Ll" uniqKey="Lin L">LL Lin</name>
</author>
<author>
<name sortKey="Zhao, S" uniqKey="Zhao S">S Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Soneson, C" uniqKey="Soneson C">C Soneson</name>
</author>
<author>
<name sortKey="Matthes, Kl" uniqKey="Matthes K">KL Matthes</name>
</author>
<author>
<name sortKey="Nowicka, M" uniqKey="Nowicka M">M Nowicka</name>
</author>
<author>
<name sortKey="Law, Cw" uniqKey="Law C">CW Law</name>
</author>
<author>
<name sortKey="Robinson, Md" uniqKey="Robinson M">MD Robinson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teng, M" uniqKey="Teng M">M Teng</name>
</author>
<author>
<name sortKey="Love, Mi" uniqKey="Love M">MI Love</name>
</author>
<author>
<name sortKey="Davis, Ca" uniqKey="Davis C">CA Davis</name>
</author>
<author>
<name sortKey="Djebali, S" uniqKey="Djebali S">S Djebali</name>
</author>
<author>
<name sortKey="Dobin, A" uniqKey="Dobin A">A Dobin</name>
</author>
<author>
<name sortKey="Graveley, Br" uniqKey="Graveley B">BR Graveley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kanitz, A" uniqKey="Kanitz A">A Kanitz</name>
</author>
<author>
<name sortKey="Gypas, F" uniqKey="Gypas F">F Gypas</name>
</author>
<author>
<name sortKey="Gruber, Aj" uniqKey="Gruber A">AJ Gruber</name>
</author>
<author>
<name sortKey="Gruber, Ar" uniqKey="Gruber A">AR Gruber</name>
</author>
<author>
<name sortKey="Martin, G" uniqKey="Martin G">G Martin</name>
</author>
<author>
<name sortKey="Zavolan, M" uniqKey="Zavolan M">M Zavolan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Middleton, R" uniqKey="Middleton R">R Middleton</name>
</author>
<author>
<name sortKey="Gao, D" uniqKey="Gao D">D Gao</name>
</author>
<author>
<name sortKey="Thomas, A" uniqKey="Thomas A">A Thomas</name>
</author>
<author>
<name sortKey="Singh, B" uniqKey="Singh B">B Singh</name>
</author>
<author>
<name sortKey="Au, A" uniqKey="Au A">A Au</name>
</author>
<author>
<name sortKey="Wong, Jj" uniqKey="Wong J">JJ Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pimentel, H" uniqKey="Pimentel H">H Pimentel</name>
</author>
<author>
<name sortKey="Kelley, R" uniqKey="Kelley R">R Kelley</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benelli, M" uniqKey="Benelli M">M Benelli</name>
</author>
<author>
<name sortKey="Pescucci, C" uniqKey="Pescucci C">C Pescucci</name>
</author>
<author>
<name sortKey="Marseglia, G" uniqKey="Marseglia G">G Marseglia</name>
</author>
<author>
<name sortKey="Severgnini, M" uniqKey="Severgnini M">M Severgnini</name>
</author>
<author>
<name sortKey="Torricelli, F" uniqKey="Torricelli F">F Torricelli</name>
</author>
<author>
<name sortKey="Magi, A" uniqKey="Magi A">A Magi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Memczak, S" uniqKey="Memczak S">S Memczak</name>
</author>
<author>
<name sortKey="Jens, M" uniqKey="Jens M">M Jens</name>
</author>
<author>
<name sortKey="Elefsinioti, A" uniqKey="Elefsinioti A">A Elefsinioti</name>
</author>
<author>
<name sortKey="Torti, F" uniqKey="Torti F">F Torti</name>
</author>
<author>
<name sortKey="Krueger, J" uniqKey="Krueger J">J Krueger</name>
</author>
<author>
<name sortKey="Rybak, A" uniqKey="Rybak A">A Rybak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deelen, P" uniqKey="Deelen P">P Deelen</name>
</author>
<author>
<name sortKey="Zhernakova, Dv" uniqKey="Zhernakova D">DV Zhernakova</name>
</author>
<author>
<name sortKey="De Haan, M" uniqKey="De Haan M">M de Haan</name>
</author>
<author>
<name sortKey="Van Der Sijde, M" uniqKey="Van Der Sijde M">M van der Sijde</name>
</author>
<author>
<name sortKey="Bonder, Mj" uniqKey="Bonder M">MJ Bonder</name>
</author>
<author>
<name sortKey="Karjalainen, J" uniqKey="Karjalainen J">J Karjalainen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weinstein, Jn" uniqKey="Weinstein J">JN Weinstein</name>
</author>
<author>
<name sortKey="Collisson, Ea" uniqKey="Collisson E">EA Collisson</name>
</author>
<author>
<name sortKey="Mills, Gb" uniqKey="Mills G">GB Mills</name>
</author>
<author>
<name sortKey="Shaw, Krm" uniqKey="Shaw K">KRM Shaw</name>
</author>
<author>
<name sortKey="Ozenberger, Ba" uniqKey="Ozenberger B">BA Ozenberger</name>
</author>
<author>
<name sortKey="Ellrott, K" uniqKey="Ellrott K">K Ellrott</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodriguez, Jm" uniqKey="Rodriguez J">JM Rodriguez</name>
</author>
<author>
<name sortKey="Maietta, P" uniqKey="Maietta P">P Maietta</name>
</author>
<author>
<name sortKey="Ezkurdia, I" uniqKey="Ezkurdia I">I Ezkurdia</name>
</author>
<author>
<name sortKey="Pietrelli, A" uniqKey="Pietrelli A">A Pietrelli</name>
</author>
<author>
<name sortKey="Wesselink, Jj" uniqKey="Wesselink J">JJ Wesselink</name>
</author>
<author>
<name sortKey="Lopez, G" uniqKey="Lopez G">G Lopez</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benjamini, Y" uniqKey="Benjamini Y">Y Benjamini</name>
</author>
<author>
<name sortKey="Hochberg, Y" uniqKey="Hochberg Y">Y Hochberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Genome Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Genome Biol</journal-id>
<journal-title-group>
<journal-title>Genome Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1474-7596</issn>
<issn pub-type="epub">1474-760X</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29284518</article-id>
<article-id pub-id-type="pmc">5747171</article-id>
<article-id pub-id-type="publisher-id">1372</article-id>
<article-id pub-id-type="doi">10.1186/s13059-017-1372-2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Method</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>DE-kupl: exhaustive capture of biological variation in RNA-seq data through
<italic>k</italic>
-mer decomposition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Audoux</surname>
<given-names>Jérôme</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Philippe</surname>
<given-names>Nicolas</given-names>
</name>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chikhi</surname>
<given-names>Rayan</given-names>
</name>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Salson</surname>
<given-names>Mikaël</given-names>
</name>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gallopin</surname>
<given-names>Mélina</given-names>
</name>
<xref ref-type="aff" rid="Aff5">5</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gabriel</surname>
<given-names>Marc</given-names>
</name>
<xref ref-type="aff" rid="Aff5">5</xref>
<xref ref-type="aff" rid="Aff6">6</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Le Coz</surname>
<given-names>Jérémy</given-names>
</name>
<xref ref-type="aff" rid="Aff5">5</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Drouineau</surname>
<given-names>Emilie</given-names>
</name>
<xref ref-type="aff" rid="Aff5">5</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Commes</surname>
<given-names>Thérèse</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-1508-8469</contrib-id>
<name>
<surname>Gautheret</surname>
<given-names>Daniel</given-names>
</name>
<address>
<email>daniel.gautheret@u-psud.fr</email>
</address>
<xref ref-type="aff" rid="Aff5">5</xref>
<xref ref-type="aff" rid="Aff6">6</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>INSERM U1183 IRMB,</institution>
<institution>Université de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295 France</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2097 0141</institution-id>
<institution-id institution-id-type="GRID">grid.121334.6</institution-id>
<institution>Institut de Biologie Computationnelle,</institution>
<institution>Université Montpellier,</institution>
</institution-wrap>
Montpellier, France</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9961 060X</institution-id>
<institution-id institution-id-type="GRID">grid.157868.5</institution-id>
<institution>SeqOne, IRMB,</institution>
<institution>CHRU de Montpellier,</institution>
</institution-wrap>
Hopital St Eloi, Montpellier, France</aff>
<aff id="Aff4">
<label>4</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2112 9282</institution-id>
<institution-id institution-id-type="GRID">grid.4444.0</institution-id>
<institution>Univ. Lille, CNRS, Inria,</institution>
</institution-wrap>
UMR 9189 - CRIStAL - F-59000, Lille, France</aff>
<aff id="Aff5">
<label>5</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 2558</institution-id>
<institution-id institution-id-type="GRID">grid.5842.b</institution-id>
<institution>Institute for Integrative Biology of the Cell, CEA, CNRS,</institution>
<institution>Université Paris-Sud, Université Paris Saclay,</institution>
</institution-wrap>
Gif sur Yvette, France</aff>
<aff id="Aff6">
<label>6</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2284 9388</institution-id>
<institution-id institution-id-type="GRID">grid.14925.3b</institution-id>
<institution>Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655,</institution>
</institution-wrap>
Villejuif, France</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>28</day>
<month>12</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>28</day>
<month>12</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>18</volume>
<elocation-id>243</elocation-id>
<history>
<date date-type="received">
<day>1</day>
<month>6</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>5</day>
<month>12</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p>We introduce a
<italic>k</italic>
-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all
<italic>k</italic>
-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation present in an RNA-seq data set. This variation is subsequently assigned to biological events or entities such as differential long non-coding RNAs, splice and polyadenylation variants, introns, repeats, editing or mutation events, and exogenous RNA. Applying DE-kupl to human RNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.</p>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13059-017-1372-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<funding-group>
<award-group>
<funding-source>
<institution>Plan Cancer – Systems Biology</institution>
</funding-source>
<award-id>#bio2014-04</award-id>
</award-group>
<award-group>
<funding-source>
<institution>ANR (France)</institution>
</funding-source>
<award-id>ANR-10-INBS-0009 (France Génomique)</award-id>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>Successive generations of RNA-sequencing technologies have bolstered the notion that organisms produce a highly diverse and adaptable set of RNA molecules. Modern transcript catalogs, such as GENCODE [
<xref ref-type="bibr" rid="CR1">1</xref>
], now include hundreds of thousands of transcripts, reflecting pervasive transcription and widespread alternative RNA processing. However, despite years of high-throughput sequencing efforts and bioinformatics analysis, we contend that large amounts of transcriptomic information remain essentially disregarded.</p>
<p>Three major classes of biological events drive transcript diversity. Firstly, transcription initiation occurs at multiple alternative promoters in protein-coding and non-coding genes and at multiple antisense or inter/intragenic loci. Secondly, transcripts are processed by a large variety of mechanisms, including splicing and polyadenylation, editing [
<xref ref-type="bibr" rid="CR2">2</xref>
], circularization [
<xref ref-type="bibr" rid="CR3">3</xref>
], and cleavage/degradation by various nucleases [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
]. Thirdly, an essential, yet often overlooked source of transcript diversity is genomic variation. Polymorphism and structural variations within transcribed regions produce RNAs with single-nucleotide variations (SNVs), tandem duplications or deletions, transposon integrations, unstable microsatellites, or fusion events. These events are major sources of transcript variation that can strongly impact RNA processing, transport, and coding potential.</p>
<p>Current bioinformatics strategies for RNA-seq analysis do not fully account for this vast diversity of transcripts. A widely used approach consists of aligning or pseudo-aligning RNA-seq reads on a reference transcriptome to quantify transcripts [
<xref ref-type="bibr" rid="CR6">6</xref>
<xref ref-type="bibr" rid="CR8">8</xref>
]. Although it may be used in detecting isoform switching events, this analysis is by definition limited to transcripts present in the input reference [
<xref ref-type="bibr" rid="CR9">9</xref>
<xref ref-type="bibr" rid="CR12">12</xref>
]. Another approach attempts to reconstruct full-length transcripts, either reference-based [
<xref ref-type="bibr" rid="CR13">13</xref>
] or de novo [
<xref ref-type="bibr" rid="CR14">14</xref>
]. Although these protocols can identify novel transcripts, they do not account for true transcriptional diversity as they ignore small-scale variations, such as single-nucleotide polymorphisms, indels, and edited bases, and struggle with repeat-containing transcripts. Yet another class of protocols is devoted to the discovery of specific events, such as splicing events [
<xref ref-type="bibr" rid="CR15">15</xref>
<xref ref-type="bibr" rid="CR17">17</xref>
], alternative polyadenylation events [
<xref ref-type="bibr" rid="CR18">18</xref>
], intron retention events [
<xref ref-type="bibr" rid="CR19">19</xref>
], fusion transcripts [
<xref ref-type="bibr" rid="CR20">20</xref>
,
<xref ref-type="bibr" rid="CR21">21</xref>
], circular RNAs [
<xref ref-type="bibr" rid="CR22">22</xref>
], or allele-specific expression [
<xref ref-type="bibr" rid="CR23">23</xref>
]. Strategies combining multiple software items for a comprehensive transcriptome analysis [
<xref ref-type="bibr" rid="CR24">24</xref>
] are difficult to implement and cannot be truly exhaustive.</p>
<p>Using public human RNA-seq data sets, we show that a large amount of captured RNA variation is not represented in existing transcript catalogs. We propose a new approach to RNA-seq analysis that facilitates the discovery of such events, independently of alignment or transcript assembly. Our approach relies on
<italic>k</italic>
-mer indexing of sequence files, a technique that recently gained momentum in next-generation sequencing data analysis [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
,
<xref ref-type="bibr" rid="CR25">25</xref>
<xref ref-type="bibr" rid="CR27">27</xref>
]. To identify biologically meaningful transcript variations, our method filters out
<italic>k</italic>
-mers present in a reference transcriptome and selects those with differential expression (DE) between two experimental conditions; hence its name, DE-kupl. When several
<italic>k</italic>
-mers represent the same variation, they are merged into a larger contig. As a proof of concept, we applied DE-kupl to RNA-seq data from an epithelial–mesenchymal transition (EMT) model and a variety of human tissues. DE-kupl identified significant numbers of novel events and was able to identify similar events reproducibly in independent RNA-seq experiments.</p>
</sec>
<sec id="Sec2" sec-type="results">
<title>Results</title>
<sec id="Sec3">
<title>Reference data sets are an incomplete representation of actual transcriptomes</title>
<p>We first analyzed
<italic>k</italic>
-mer diversity in different human references and high-throughput experimental sequences. Thus, we extracted all 31-nt
<italic>k</italic>
-mers from sequence files using the Jellyfish program [
<xref ref-type="bibr" rid="CR28">28</xref>
]. Figure 
<xref rid="Fig1" ref-type="fig">1</xref>
a, b compares
<italic>k</italic>
-mers from GENCODE transcripts and the human genome reference, with RNA-seq libraries from 18 different individuals [
<xref ref-type="bibr" rid="CR29">29</xref>
] corresponding to three primary tissues (six libraries/tissue). To minimize the risk of including
<italic>k</italic>
-mers containing sequencing errors, for each tissue we retained only the set of
<italic>k</italic>
-mers appearing in at least six individuals.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>The diversity of 31-mers in RNA-seq libraries exceeds that of reference sequences.
<bold>a</bold>
Intersection of
<italic>k</italic>
-mers present in GENCODE transcripts and RNA-seq data from three tissues: bone marrow, skin, and colon. The set of
<italic>k</italic>
-mers for each tissue was defined as the set of
<italic>k</italic>
-mers shared by all six individuals.
<bold>b</bold>
Intersection of
<italic>k</italic>
-mers present in GENCODE transcripts, the reference human genome (GRCh38), and RNA-seq data (same as in
<bold>a</bold>
).
<bold>b1</bold>
Distribution of
<italic>k</italic>
-mer abundances for each tissue represented in
<bold>a</bold>
and
<bold>b</bold>
.
<italic>k</italic>
-mers shared with GENCODE are labeled as GENCODE. Among other
<italic>k</italic>
-mers, those shared with the human genome are labeled as GRCh38. The remaining
<italic>k</italic>
-mers are labeled as tissue-specific. The same procedure was applied in
<bold>b2</bold>
and
<bold>b3</bold>
.
<bold>b2</bold>
Repartition of
<italic>k</italic>
-mer diversity for each tissue.
<bold>b3</bold>
Mapping statistics of
<italic>k</italic>
-mers labeled as tissue-specific in
<bold>b2</bold>
. These
<italic>k</italic>
-mers were first mapped to GENCODE transcripts, and unmapped
<italic>k</italic>
-mers were then mapped to the GRCh38 reference using Bowtie1, with a tolerance of up to two mismatches in a 31-mer</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>Measures of
<italic>k</italic>
-mer abundance show that
<italic>k</italic>
-mers are overwhelmingly associated with GENCODE transcripts (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">b1</xref>
). However, when considering
<italic>k</italic>
-mer diversity, a large proportion of
<italic>k</italic>
-mers are tissue-specific and not found in the GENCODE reference (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
a). These tissue-specific
<italic>k</italic>
-mers may result from sequencing errors, genetic variation in individuals, or novel or non-reference transcripts. The majority of RNA-seq
<italic>k</italic>
-mers that do not occur in GENCODE are found in the human genome reference (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b,
<xref rid="Fig1" ref-type="fig">b2</xref>
). This suggests that polymorphisms and errors represent a small fraction of tissue-specific
<italic>k</italic>
-mers and that many
<italic>k</italic>
-mers result from expressed genome regions that are not represented in GENCODE. Further scrutiny of tissue-specific
<italic>k</italic>
-mers shows that many can be mapped to the transcriptome with one substitution. However, for each tissue, there is an average of 1 million
<italic>k</italic>
-mers that cannot be mapped to either reference (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">b3</xref>
).</p>
<p>Non-reference
<italic>k</italic>
-mers classify samples as accurately as reference transcripts. We performed a principal component analysis (PCA) of the human tissue samples described above using conventional transcript counts and
<italic>k</italic>
-mer counts. PCA based on 20,000 randomly selected unmapped
<italic>k</italic>
-mers was able to differentiate tissues as accurately as PCA based on estimated gene expression or transcript expression (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). This illustrates the biological relevance of non-reference transcriptome information that is not accounted for in standard analyses.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Principal component analysis for non-reference
<italic>k</italic>
-mers discriminates tissues. Samples are labeled according to their tissues (bone marrow, colon, and skin). PCs were produced with normalized log-transformed counts. For genes and transcripts, counts were generated with Kallisto based on GENCODE V25. Genomic
<italic>k</italic>
-mers correspond to 20k random
<italic>k</italic>
-mers from the RNA-seq libraries that did not map to GENCODE transcripts but successfully mapped to GRCh38. PC principal component</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>When comparing RNA-seq and whole-genome sequence (WGS) data from the same individual [
<xref ref-type="bibr" rid="CR30">30</xref>
], library-specific
<italic>k</italic>
-mers are observed much more frequently in RNA-seq than in WGS
<italic>k</italic>
-mers (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
). This shows that non-reference sequence diversity is larger in RNA-seq than in WGS. Altogether, these results suggest the existence of a significant amount of untapped biological information in RNA-seq data.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>The diversity of non-reference
<italic>k</italic>
-mers is greater for RNA-seq than for WGS. Intersection of
<italic>k</italic>
-mers between GENCODE transcripts, the human genome (GRCh38), RNA-seq, and WGS data. RNA-seq and WGS data originate from the same lymphoblastoid cell line (HCC1395). WGS whole-genome sequencing</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>Non-reference
<italic>k</italic>
-mers may result from the three aforementioned classes of biological events. Specifically, we expect that genetic polymorphism, intergenic expression (e.g., long intergenic non-coding RNA or lincRNA, antisense RNA, expressed repeats, or endogenous viral sequences) and alternative RNA processing (polyadenylation, splicing, and intron retention) are the predominant sources of non-reference
<italic>k</italic>
-mers. In combination, these genetic, transcriptional, and post-transcriptional events may have a profound impact on transcript function.</p>
</sec>
<sec id="Sec4">
<title>A new
<italic>k</italic>
-mer based protocol for deriving transcriptome variation from RNA-seq data</title>
<p>We designed the DE-kupl computational protocol with the aim of capturing all
<italic>k</italic>
-mer variation in an input set of RNA-seq libraries. This protocol has four main components (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
):
<list list-type="order">
<list-item>
<p>Indexing: index and count all
<italic>k</italic>
-mers (
<italic>k</italic>
=31) in the input libraries
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>The DE-kupl pipeline for the discovery and analysis of differentially expressed
<italic>k</italic>
-mers. First, Jellyfish is applied to count
<italic>k</italic>
-mers in all libraries.
<italic>k</italic>
-mers counts are then joined into a count matrix and filtered for low recurrence and matching to the reference transcriptome. Normalization factors are computed from raw
<italic>k</italic>
-mer counts and the differential expression procedure is applied. Finally, overlapping differentially expressed
<italic>k</italic>
-mers are extended into contigs and annotated based on their alignment to the reference and overlap with annotated genes</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
</list-item>
<list-item>
<p>Filtering and masking: delete
<italic>k</italic>
-mers representing potential sequencing errors or perfectly matching reference transcripts</p>
</list-item>
<list-item>
<p>Differential expression (DE): select
<italic>k</italic>
-mers with significantly different abundances across conditions</p>
</list-item>
<list-item>
<p>Extending and annotating: build
<italic>k</italic>
-mer contigs and annotate contigs based on sequence alignment.</p>
</list-item>
</list>
</p>
<p>DE-kupl departs radically from existing RNA-seq analysis procedures in that it performs neither map-first (like Tuxedo suite [
<xref ref-type="bibr" rid="CR31">31</xref>
]) nor assemble-first (like Trinity [
<xref ref-type="bibr" rid="CR32">32</xref>
]) but instead directly analyzes the contents of the raw FASTQ files, displacing mapping to the final stage of the procedure. In this way, DE-kupl guarantees that no variation in the input sequence (even at the level of a single nucleotide) is lost at the initial stage of the analysis. Even unmappable
<italic>k</italic>
-mers from repeats, low complexity regions, or exogenous organisms are retained till the final stage and can, thus, be analyzed.</p>
<p>The DE-kupl protocol is detailed in “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”. We highlight here some of its key features. First, DE-kupl must accommodate the large size of the
<italic>k</italic>
-mer index. A single human RNA-seq library contains of the order of 10
<sup>7</sup>
to 10
<sup>8</sup>
distinct
<italic>k</italic>
-mers. We selected the Jellyfish tool for counting
<italic>k</italic>
-mers [
<xref ref-type="bibr" rid="CR28">28</xref>
] as it has very fast computing times and allows the storage of the full index on disk for further querying.</p>
<p>A central process in DE-kupl is
<italic>k</italic>
-mer filtering and masking. Filtering out unique or rare
<italic>k</italic>
-mers is relatively straightforward and considerably reduces
<italic>k</italic>
-mer diversity and the number of sequence errors. Masking entails the removal of
<italic>k</italic>
-mers matching a reference transcript collection. The rationale for this is that the bulk of
<italic>k</italic>
-mers in RNA-seq data comes from known exons, a form of canonical exon expression ignored in this study as it can be captured efficiently by conventional reference-based protocols [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. Discarding these
<italic>k</italic>
-mers enables us to ignore the strong signal caused by known transcripts, allowing us to focus better on expressed regions harboring differences from the reference transcriptome. Depending on the application, masking can be performed using a full annotation such as GENCODE or a simpler transcriptome limited to major transcripts, or skipped altogether.</p>
<p>Two modes are available for the differential analysis of
<italic>k</italic>
-mers (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1 and “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”). The
<italic>t</italic>
-test mode is fast and has low sensitivity, i.e., it retrieves only the most significantly DE
<italic>k</italic>
-mers. The DESeq2-based mode [
<xref ref-type="bibr" rid="CR33">33</xref>
] is slower, more sensitive, and is, therefore, recommended for small samples (fewer than six vs six samples). Finally, a
<italic>k</italic>
-mer extension procedure merges overlapping
<italic>k</italic>
-mers into contigs and stops as soon as a fork is encountered (i.e., when a contig extremity is overlapped by two different
<italic>k</italic>
-mers). Rather than producing full-length transcripts, this procedure is intended to group
<italic>k</italic>
-mers overlapping a single event. Whenever possible, the key steps of the procedure (
<italic>k</italic>
-mer table merging,
<italic>t</italic>
-test, and
<italic>k</italic>
-mer extension) were written in C, enabling the whole procedure to run on a relatively standard computer in a reasonable amount of time.</p>
</sec>
<sec id="Sec5">
<title>Discovery of differential RNA contigs with DE-kupl</title>
<p>To assess DE-kupl’s capacity to discover novel differential events, we applied it to 12 RNA-seq samples from an EMT cell-line model [
<xref ref-type="bibr" rid="CR34">34</xref>
], in which non-small cell lung cancer (NSCLC) cells were induced by ZEB1 expression over a 7-day time course. We compared six RNA-seq libraries from the epithelial stage of the time course (uninduced and day 1) with six libraries from the mesenchymal stage (days 6 and 7). The full DE-kupl procedure was completed in about 4 h in
<italic>t</italic>
-test mode (single threaded) and 6.5 h in DESeq2 mode (multi-threaded), using eight computing cores, 54 GB RAM, and 7 to 42 GB of hard disk space (Table 
<xref rid="Tab1" ref-type="table">1</xref>
). Recurrence filters efficiently reduced
<italic>k</italic>
-mer counts from 707 to 92.5M and GENCODE masking further reduced counts to 40.3M. Differential analysis in
<italic>t</italic>
-test mode eventually retained 3.8M
<italic>k</italic>
-mers that were assembled into 133,690 contigs (Table 
<xref rid="Tab2" ref-type="table">2</xref>
). The resulting contigs ranged in size from 31 bp (corresponding to an orphaned unextended
<italic>k</italic>
-mer) to 3.6 kbp, with a major peak of short 31–40 bp contigs and a minor peak around 61 bp contigs (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a).
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Specificity of differentially expressed contigs.
<bold>a</bold>
Density plot of contig lengths for mapped and unmapped contigs. The red line indicates contigs built from
<italic>k</italic>
<italic>k</italic>
-mers and likely corresponding to SNVs.
<bold>b</bold>
Mismatch ratio (number of mismatches/contig size) as a function of contig length.
<bold>c</bold>
Number of hits in the reference genome as a function of contig length. The
<bold>b</bold>
and
<bold>c</bold>
curves were obtained using a smoothing function. SNV single-nucleotide variation</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig5_HTML" id="MO5"></graphic>
</fig>
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>DE-kupl parameters and resources used for analyzing epithelial–mesenchymal transition data (12 libraries) using the
<italic>t</italic>
-test or DESeq2 method (GENCODE masking)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Parameter/resources</th>
<th align="left" colspan="3">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">nb_threads</td>
<td align="center" colspan="3">8</td>
</tr>
<tr>
<td align="left">min_recurrence</td>
<td align="center" colspan="3">6</td>
</tr>
<tr>
<td align="left">min_recurrence_abundance</td>
<td align="center" colspan="3">5</td>
</tr>
<tr>
<td align="left">pvalue_threshold</td>
<td align="center" colspan="3">0.05</td>
</tr>
<tr>
<td align="left">lib_type</td>
<td align="center" colspan="3">Stranded</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>t</italic>
-test</td>
<td align="left" colspan="2">DESeq2</td>
</tr>
<tr>
<td align="left">Maximum memory usage</td>
<td align="left">54 GB</td>
<td align="left" colspan="2">53 GB</td>
</tr>
<tr>
<td align="left">Maximum disk used (1)</td>
<td align="left">7 GB</td>
<td align="left" colspan="2">42 GB</td>
</tr>
<tr>
<td align="left">Running time (1)</td>
<td align="left">4 h 2 m</td>
<td align="left" colspan="2">6 h 33 m</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>(1) excluding reference genome and transcriptome indexing for the annotation step</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>DE-kupl pipeline results for the epithelial–mesenchymal transition experiment</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Files</th>
<th align="left">Description</th>
<th align="left" colspan="2">Number of
<italic>k</italic>
-mers</th>
<th align="left" colspan="3">Sizes</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left" colspan="5">or contigs</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">raw_counts (no filter)</td>
<td align="left">Matrix of
<italic>k</italic>
-mers counts</td>
<td align="center" colspan="2">707,067,278</td>
<td align="left" colspan="3">(not generated)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="6">from all libraries</td>
</tr>
<tr>
<td align="left">filtered_counts.tsv.gz</td>
<td align="left">Matrix of all
<italic>k</italic>
-mer counts from</td>
<td align="center" colspan="2">92,525,450</td>
<td align="center" colspan="3">1.9 GB</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="6">all libraries with recurrence filters</td>
</tr>
<tr>
<td align="left">masked-counts.tsv.gz</td>
<td align="left">Matrix of counts after</td>
<td align="center" colspan="2">40,398,848</td>
<td align="center" colspan="3">728 MB</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="6">GENCODE masking</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>t</italic>
-test</td>
<td align="left">DESeq2</td>
<td align="left">
<italic>t</italic>
-test</td>
<td align="left" colspan="2">DESeq2</td>
</tr>
<tr>
<td align="left">diff-counts.tsv.gz</td>
<td align="left">Counts with differential</td>
<td align="left">3,813,418</td>
<td align="left">6,102,447</td>
<td align="left">186 MB</td>
<td align="left" colspan="2">510 MB</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">expression test, filtered on</td>
<td align="left" colspan="5"></td>
</tr>
<tr>
<td align="left"></td>
<td align="left">adujsted
<italic>P</italic>
value</td>
<td align="left" colspan="5"></td>
</tr>
<tr>
<td align="left">merged-diff-counts.tsv.gz</td>
<td align="left">Differentially expressed
<italic>k</italic>
-mers</td>
<td align="left">133,690</td>
<td align="left">169,613</td>
<td align="left">3.0 MB</td>
<td align="left" colspan="2">18 MB</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">assembled into contigs</td>
<td align="left" colspan="5"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>This is a description of output files sequentially generated by DE-kupl. The numbers of
<italic>k</italic>
-mers and contigs correspond to the number of lines in each file</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Almost all (99.2 %) of the 133k DE contigs mapped to the human genome. Mapping revealed that most 61 bp contigs result from the assembly of 31 overlapping
<italic>k</italic>
-mers harboring a SNV at every position of the
<italic>k</italic>
-mer. This phenomenon also causes a higher mismatch ratio for contigs around 61 bp (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
b). Contigs that do not map to the human genome are generally shorter than mapped contigs (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a), indicating a lower signal-to-noise ratio in unmapped contigs. As expected, shorter mapped contigs tend to map at multiple loci more often than longer ones (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
c). However, 80 % of all contigs are uniquely mapped (not shown).</p>
<p>Analysis of contig locations reveals distinct contig classes. Most contigs are in annotated introns and exons (Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
). However, intronic contigs are predominantly exact matches while exonic contigs are predominantly mismatched. This is due to reference transcript masking: contigs with exact matches to introns are usually not masked, as they do not pertain to a reference transcript, while contigs that match exons are filtered out unless they differ from the reference. This difference might be in the form of SNVs, or through exons extending into flanking intergenic or intronic regions. By the same rationale, contigs mapping to intergenic and antisense regions are depleted in SNVs (Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
), consistent with their location in unannotated lncRNAs and antisense RNAs, while contigs overlapping exon–exon junctions behave like exonic contigs (with a high rate of SNV). However, a significant fraction of exon junction contigs are exact matches, indicating they may correspond to novel junctions.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Genomic location of differentially expressed contigs. Contigs are separated by genomic location, according to their overlap with exons, exon–exon junctions, introns, antisense regions of annotated genes, or intergenic regions. Right: Total number of contigs in each class. Left: Contig distribution according to their alignment status. Contigs with a single mapping location are labeled as a perfect match, one mismatch, or multi mismatches. Contigs with multiple mapping locations are labeled as multi-map. nb number of</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
</sec>
<sec id="Sec6">
<title>Assigning contigs to biological events</title>
<p>We assigned DE contigs generated from the EMT data set to 11 classes of potential biological events, using the rule set described in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. Since intragenic DE contigs may result from a mere over- or under-expression of their host gene and do not necessarily reflect a differential usage (DU) of transcript isoforms, we implemented a simple strategy to distinguish between the two situations based on the expression level of the host gene (see “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”). We made this distinction for splicing, polyadenylation, SNVs, and intron retention (Table 
<xref rid="Tab3" ref-type="table">3</xref>
).
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Assignment rules for differentially expressed contigs</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="14">Conditions</th>
</tr>
<tr>
<th align="left">Event class</th>
<th align="left">DU
<italic>P</italic>
value</th>
<th align="left">Number of junctions</th>
<th align="left">Maps gene</th>
<th align="left">Maps antisense gene</th>
<th align="left">Clipped 3’</th>
<th align="left">Is mapped</th>
<th align="left">SNV</th>
<th align="left">Exonic</th>
<th align="left">Intronic</th>
<th align="left">Number of hits</th>
<th align="left">Contig length</th>
<th align="left">Other rules</th>
<th align="left">Contigs</th>
<th align="left">Loci</th>
</tr>
</thead>
<tbody>
<tr>
<td align="justify">Splicing</td>
<td align="justify"></td>
<td align="justify">>0</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">F</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1879</td>
<td align="left">1280</td>
</tr>
<tr>
<td align="justify">Splicing DU</td>
<td align="justify"><0.01</td>
<td align="justify">>0</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">F</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">391</td>
<td align="left">345</td>
</tr>
<tr>
<td align="justify">PolyA</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">≥5</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify">105</td>
<td align="left">95</td>
</tr>
<tr>
<td align="justify">PolyA DU</td>
<td align="justify"><0.01</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify">≥5</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify">9</td>
<td align="left">8</td>
</tr>
<tr>
<td align="justify">lincRNA</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">F</td>
<td align="justify">F</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify">>200</td>
<td align="justify"></td>
<td align="justify">1061</td>
<td align="left">329</td>
</tr>
<tr>
<td align="justify">asRNA</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">F</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify">>200</td>
<td align="justify"></td>
<td align="justify">479</td>
<td align="left">180</td>
</tr>
<tr>
<td align="justify">SNV DU</td>
<td align="justify"><0.01</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">F</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">T</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify">2</td>
<td align="justify">929</td>
<td align="left">680</td>
</tr>
<tr>
<td align="justify">Intron</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">F</td>
<td align="justify">0</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify">3</td>
<td align="justify">49897</td>
<td align="left">6689</td>
</tr>
<tr>
<td align="justify">Intron DU</td>
<td align="justify"><0.01</td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">F</td>
<td align="justify">0</td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify">1</td>
<td align="justify"></td>
<td align="justify">3</td>
<td align="justify">10688</td>
<td align="left">3128</td>
</tr>
<tr>
<td align="justify">Repeats</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">T</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">≥5</td>
<td align="justify">>50</td>
<td align="justify">4</td>
<td align="justify">1136</td>
<td align="left">612</td>
</tr>
<tr>
<td align="justify">Unmapped</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">F</td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify"></td>
<td align="justify">>50</td>
<td align="justify"></td>
<td align="justify">112</td>
<td align="left"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Each class of event is defined by a set of rules applied to annotated contigs. Other rules refers to the following: (1) contig ends with AAAAA, (2) mean counts >20 in at least one condition and mapped region <10 kb, (3) mapped region <10 kb, and (4) the mapped gene is not differentially expressed. Contigs indicates the number of contigs of each class found in the epithelial–mesenchymal transition experiment. Loci is the number of loci implicated by these contigs (see “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”)</p>
<p>
<italic>asRNA</italic>
antisense RNA,
<italic>DU</italic>
differential usage,
<italic>F</italic>
false,
<italic>lincRNA</italic>
long intergenic non-coding RNA,
<italic>PolyA</italic>
polyadenylated,
<italic>SNV</italic>
single-nucleotide variation,
<italic>T</italic>
true</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>From the total set of 133k DE contigs (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
), we extracted about 76,000 contigs matching our rule set for either event class (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). Note that certain events generate multiple contigs. We, thus, further grouped contigs into loci (defined as independent annotated genes or intergenic regions harboring one or more contigs) (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). We describe below the main classes of events identified.</p>
<sec id="Sec7">
<title>Differential splicing</title>
<p>An analysis of split-mapped contigs found evidence of potentially novel differential splice variants in 1879 contigs (Table 
<xref rid="Tab3" ref-type="table">3</xref>
, Fig. 
<xref rid="Fig7" ref-type="fig">7</xref>
a–c). Furthermore, 391 of these contigs were classified as DU, suggesting that differential splicing at these sites may not be a consequence of DE of the whole gene. Surprisingly, these novel events include a number of subtle variations at 5
<sup></sup>
and 3
<sup></sup>
splice sites with 3–15 bp difference from the annotated reference, which escaped prior annotation (see, e,g., Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2).
<fig id="Fig7">
<label>Fig. 7</label>
<caption>
<p>Examples of differentially expressed contigs. Sashimi plots generated from Integrative Genomic Viewer (IGV) using read alignments produced with STAR [
<xref ref-type="bibr" rid="CR52">52</xref>
]. Sample SRR2966453 from condition D0 is labeled with E (epithelial). Sample SRR2966474 from condition D7 is labeled with M (mesenchymal). Annotations from GENCODE and DE-kupl differentially expressed contigs are shown at the bottom of each frame.
<bold>a</bold>
New splicing variant involving an unannotated exon, overexpressed in condition E.
<bold>b</bold>
Tandem repeat at chr8:143,204-870-143,206,916 (red region) that is overexpressed in condition M vs E. Note that the overexpressed tandem repeat is part of a larger overexpressed unannotated locus.
<bold>c</bold>
A novel long intergenic non-coding RNA overexpressed in condition E.
<bold>d</bold>
A novel antisense RNA. RNA-seq reads are aligned in the forward orientation while the gene at this locus is in the reverse orientation. The annotated gene is not expressed. E epithelial, M mesenchymal</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig7_HTML" id="MO7"></graphic>
</fig>
</p>
</sec>
<sec id="Sec8">
<title>Differential polyadenylation</title>
<p>We extracted all contigs aligned with five or more clipped (e.g., non-reference) bases at their 3
<sup></sup>
end, and containing five or more trailing A’s. Out of 140 such polyA-terminated contigs, 105 (75 %) contained an AATAAA or variant polyadenylation signal (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1), indicating they result from actual polyadenylated transcripts (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). Note these are not necessarily novel polyadenylation sites since polyadenylated transcripts always create
<italic>k</italic>
-mers that differ from the reference transcriptome and are, hence, retained by DE-kupl. Indeed, only six of the 105 polyA contigs mapped to intergenic regions. Furthermore, nine polyA contigs were classified as differentially used between the two conditions (Table 
<xref rid="Tab3" ref-type="table">3</xref>
and Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1). Altogether this analysis demonstrates that DE-kupl can capture bona fide polyadenylated transcripts present in the sequencing reads and polyadenylation sites with possible DU.</p>
</sec>
<sec id="Sec9">
<title>LincRNA</title>
<p>We identified a subset of 1061 DE contigs (329 loci) corresponding to potential lincRNAs (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). The criteria for lincRNAs were contigs of size >200 nt mapped to an intergenic locus. Visual inspection revealed clear lincRNA-like patterns, with contigs clustered into well-defined transcription units with abundant read coverage and evidence of splicing (Fig. 
<xref rid="Fig7" ref-type="fig">7</xref>
c, Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3). DE-kupl is, thus, an effective tool for the identification of novel DE lincRNAs.</p>
</sec>
<sec id="Sec10">
<title>Antisense RNAs</title>
<p>When DE-kupl is applied to stranded RNA-seq libraries (as with the EMT libraries used in this study), the resulting contigs are strand-specific and can, thus, be used for identifying antisense RNAs and for disambiguating loci with intricated expression on both strands. We identified 479 contigs from 180 loci mapping to the reverse strand of an annotated gene (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). These antisense RNAs include very strong cases of DE (Fig. 
<xref rid="Fig7" ref-type="fig">7</xref>
d), sometimes combined with apparent repression of the sense gene (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4).</p>
</sec>
<sec id="Sec11">
<title>Allele-specific expression</title>
<p>As DE-kupl quantifies every SNV-containing
<italic>k</italic>
-mer, we set out to exploit this capacity to identify potential allele-specific expression events. We extracted all contigs including an SNV (either a base substitution or indel) and for which DU was predicted (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). This procedure was less than ideal, as we did not explicitly test for a switch in allelic balance between the two conditions. Yet, among the 929 contigs identified, some appeared to display strong apparent changes in allelic balance between the E and M conditions (e.g., Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5). The ability of DE-kupl to capture differential SNV between data sets may be particularly relevant when looking for recurrent mutations in subpopulations.</p>
</sec>
<sec id="Sec12">
<title>Intron retention and other intronic events</title>
<p>As highly expressed transcripts often carry intronic by-products, we expected DE-kupl to identify many parasitic intronic contigs. Indeed, 49,897 contigs mapped to intronic loci (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). We, thus, focused on intronic
<italic>k</italic>
-mers for which DU was predicted, indicating intron retention events. This filter identified 10,688 intronic contigs from 3128 different genes. Inspection of the read mapping at these loci revealed clear instances of novel skipped or extended exons (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6), as well as cases where a specific short intronic region was DE, reminiscent of the pattern observed for intronic processed microRNAs and small nucleolar RNAs [
<xref ref-type="bibr" rid="CR35">35</xref>
] (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S7). DE-kupl can, therefore, be used for screening a wide variety of exon and intron processing events in addition to alternative splicing.</p>
</sec>
<sec id="Sec13">
<title>Expressed repeats</title>
<p>Assessing the expression of human repeats by conventional RNA-seq analysis protocols is difficult, as ambiguous alignments render repeat regions unmappable [
<xref ref-type="bibr" rid="CR36">36</xref>
]. Since DE-kupl first measures expression independently of mapping, we were able to collect and analyze differential contigs with multiple genome hits. We found that 7521 contigs larger than 50 nt have multiple hits (data not shown), and 1136 are repeated more than 5 times (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). RepeatMasker [
<xref ref-type="bibr" rid="CR37">37</xref>
] found 693 out of these 1136 sequences to match known repeats, mostly long interspersed nuclear elements, long terminal repeats, and short interspersed nuclear elements (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S8). Further inspection showed that most of the remaining multiple-hit contigs correspond to unannotated repeats or low-complexity regions. One of the most striking differential repeats is an unannotated 22×66 bp tandem repeat, located about 2 Mbp from the chromosome 8 telomere. This repeat is found about 50-fold overexpressed in the mesenchymal condition (Fig. 
<xref rid="Fig7" ref-type="fig">7</xref>
b, Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S9). These results indicate DE-kupl can serve as a screen for DE or activation of endogenous viral sequences and other repeat-containing transcripts.</p>
</sec>
<sec id="Sec14">
<title>Unmapped contigs</title>
<p>Finally, we analyzed DE contigs that did not map to the human genome. Unmapped contigs may result from transcripts produced by rearranged genes or by exogenous viral genomes and could, thus, be highly relevant biologically. In principle, DE-kupl is able to detect such events when levels of RNA vary across samples. In this test set, where all samples come from an in vitro cell line, we did not expect to observe this phenomenon. Indeed, out of 112 unmapped contigs of size >50 bp (Table 
<xref rid="Tab3" ref-type="table">3</xref>
), the vast majority (76 %) correspond to vector sequences overexpressed in the M condition (data not shown), indicating that these contigs come from the expression vector used for EMT induction. The remaining unmapped contigs correspond to a GA tandem repeat and several non-human primate sequences.</p>
</sec>
</sec>
<sec id="Sec15">
<title>Impact of transcriptome masking</title>
<p>Using GENCODE as a reference transcriptome removed about half of the
<italic>k</italic>
-mers (Table 
<xref rid="Tab2" ref-type="table">2</xref>
). We analyzed the impact of using different reference transcriptomes on differential
<italic>k</italic>
-mer and contig calls. We ran DE-kupl on the EMT data set using a lightweight masking transcriptome limited to major transcripts (1 transcript/gene, see “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”) and in the absence of masking (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S2). Masking with the lightweight transcriptome had a moderate impact on the number of DE
<italic>k</italic>
-mers and contigs (1.6- and 1.4-fold increase, respectively). However, a complete bypass of the masking procedure caused a large increase in DE
<italic>k</italic>
-mers and contigs (3.4- and 2.4-fold, respectively). Importantly, less stringent masking produced longer contigs (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S10) and a higher number of detected events, especially in the splicing and intron categories (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S3). These results indicate that, in a typical DE-kupl use case, lightweight masking may be the preferred option, returning a higher number of events for little additional computational cost.</p>
</sec>
<sec id="Sec16">
<title>Comparison with specialized tools</title>
<p>We compared DE-kupl events with predictions from two specialized tools. Since DE-kupl reports only events with DE, the protocols compared should involve an event-calling stage combined with a differential filter. IRFinder [
<xref ref-type="bibr" rid="CR19">19</xref>
] and KisSplice [
<xref ref-type="bibr" rid="CR15">15</xref>
] predict intron retention and de novo differential splicing events, respectively. Both pieces of software report changes in relative inclusion, i.e., variants whose proportions vary between conditions. Therefore, their results can be compared with differential (DU) introns and splice sites from DE-kupl. After running IRFinder on the EMT data set, we observed a strong enrichment in IRFinder predictions among the top DE-kupl intron retention events (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S11). Conversely, 68 % of IRFinder intron retention events were predicted by DE-kupl as intron DU (DU
<italic>p</italic>
value <0.05) and this fraction rose to 80 % among the 100 top ranking IRfinder predictions (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S4). A comparison with KisSplice showed a similar enrichment in KisSplice predictions among the top DE-kupl splice events (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S12). While only 36.4 % of all KisSplice predictions were present among the total DE-kupl splice events, DE-kupl predicted as splice DU (splice events with DU) 82 of the top 100 KisSplice predictions (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S4). These results suggest DE-kupl is able to recall the majority of top ranking predictions made by two specialized tools.</p>
</sec>
<sec id="Sec17">
<title>DE-kupl event detection reproduced across independent data sets</title>
<p>We sought independent validation of DE-kupl findings with two distinct human RNA-seq data sets, from the Genotype-Tissue Expression (GTEx) [
<xref ref-type="bibr" rid="CR38">38</xref>
] and the Human Protein Atlas (HPA) [
<xref ref-type="bibr" rid="CR29">29</xref>
]. DE contigs were first obtained by running DE-kupl on eight colon vs eight skin libraries from GTEx. Events were classified as above into intron retentions, lincRNAs, polyadenylation sites, repeats, splice sites, and unmapped. The 100 top events from each class (50 for class unmapped) were extracted and their
<italic>k</italic>
-mer labels saved as a sequence file. We then counted the occurrence of each
<italic>k</italic>
-mer in the colon and skin libraries from the HPA project and applied DEseq2 [
<xref ref-type="bibr" rid="CR33">33</xref>
] to evaluate the significance of the expression change between colon and skin (see “
<xref rid="Sec20" ref-type="sec">Methods</xref>
”). Altogether, 79 % of the 550 DE
<italic>k</italic>
-mers identified by GTEx were also significantly DE in the HPA data (Fig. 
<xref rid="Fig8" ref-type="fig">8</xref>
). Each event class showed clear reproducibility, with particularly strong effects for lincRNAs and splice variants. This demonstrates that novel events identified by DE-kupl are reproducible across independent data sets despite independent RNA extraction, library preparation, and sequencing protocols.
<fig id="Fig8">
<label>Fig. 8</label>
<caption>
<p>Validation of DE-kupl events across independent data sets. Altogether, 550 differentially expressed contigs from six different event classes (intron with differential usage, lincRNA, polyA site, repeat, splice site, and unmapped) were identified using DE-kupl on GTEx libraries from two human tissues (skin and colon). A representative
<italic>k</italic>
-mer from each contig was then tested for differential expression in the skin and colon libraries in the Human Protein Atlas. Box plots represent distributions of DESeq2 adjusted
<italic>p</italic>
values for all
<italic>k</italic>
-mers in the different classes. The red line shows the adjusted
<italic>p</italic>
value cutoff of 0.05. DU differential usage, lincRNA long intergenic non-coding RNA, padj adjusted
<italic>p</italic>
value, polyA polyadenylation</p>
</caption>
<graphic xlink:href="13059_2017_1372_Fig8_HTML" id="MO8"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec18" sec-type="discussion">
<title>Discussion</title>
<p>In contrast to popular RNA-seq analysis software, DE-kupl does not attempt full-length transcript assignment or assembly but focuses on local transcript variations instead. Indeed, we do not consider full-length transcript analysis to be realistic when screening for unspecified RNA variation, since the combinatorial nature of genomic, transcriptomic, and post-transcriptomic events would require an indefinitely expanding transcript catalog. In this sense, DE-kupl is closer in spirit to methods analyzing local RNA-seq coverage such as RNAprof [
<xref ref-type="bibr" rid="CR39">39</xref>
] and DERfinder [
<xref ref-type="bibr" rid="CR40">40</xref>
], with the notable difference that DE-kupl does not involve mapping and, thus, avoids mapping-related pitfalls while considerably widening the range of detectable events. Another important benefit of the
<italic>k</italic>
-mer strategy is that
<italic>k</italic>
-mers representing events of interest can be used efficiently to assess the occurrence of similar events in the huge public compendium of RNA-seq data.</p>
<p>In this proof-of-concept study, we analyzed RNA-seq libraries from a small number of individuals and from a single cell line. We expect
<italic>k</italic>
-mer diversity to rise significantly with the number of individuals included in the analysis. However, preliminary tests with over 100 libraries from The Cancer Genome Atlas [
<xref ref-type="bibr" rid="CR41">41</xref>
] show a sublinear growth in the number of
<italic>k</italic>
-mers with the number of libraries (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S13), which suggests there is good scalability of the DE-kupl concept. Analysis of large-scale patient RNA-seq data opens exciting perspectives. For instance, the ability of DE-kupl to detect genetic variation and RNA expression/processing events simultaneously may serve as a basis for studying genotype/phenotype relations. Analysis of patient RNA-seq data may also reveal event classes not studied in this work, such as fusion transcripts and circular RNAs.</p>
</sec>
<sec id="Sec19" sec-type="conclusion">
<title>Conclusion</title>
<p>
<italic>k</italic>
-mer decomposition followed by filtering, masking, and DE analysis is a novel way of analyzing RNA-seq data. It can detect a wider spectrum of transcript variation than previous protocols. DE-kupl explores all
<italic>k</italic>
-mers in the input RNA-seq files (vs only
<italic>k</italic>
-mers from annotated transcripts in recent software [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]), which potentially requires substantial computational time and memory resources. Using the Jellyfish
<italic>k</italic>
-mer indexing software and C-programming code for the key table manipulation, we achieved time/memory requirements on par with popular mapping-based software for similarly sized data sets. A key aspect of our protocol that rendered a full
<italic>k</italic>
-mer analysis tractable was the application of successive filters for rare
<italic>k</italic>
-mers, reference transcripts, and DE, which altogether resulted in a 200-fold reduction in
<italic>k</italic>
-mer counts. These filters are not only useful for technical considerations (they reduce run times and enable us to get rid of most sequence errors), but also they allow the user to focus on
<italic>k</italic>
-mers that (i) vary significantly between the conditions under study and (ii) encompass events that would not be captured by conventional reference-based protocols.</p>
<p>We showed that DE-kupl is able to detect a wide range of differential transcription and RNA processing events. Although specialized software may perform better at assessing specific event classes, such as differential splicing, no method known to the authors provides such a comprehensive screen. As differential RNA-seq analysis is often conducted with an exploratory spirit, we argue that it is preferable to cast a wide net with no preconceptions for target events, using DE-kupl along with a conventional gene-by-gene DE analysis. Note that DE-kupl might also be an interesting option for exploring other types of next-generation sequencing data, such as small RNA-seq, ChIP-seq, or whole-exome/genome sequencing, after adjusting its parameters and event annotation rules.</p>
</sec>
<sec id="Sec20">
<title>Methods</title>
<sec id="Sec21">
<title>Characterization of
<italic>k</italic>
-mer diversity in human RNA-seq libraries</title>
<p>RNA-seq data for bone marrow, skin, and colon from 18 individuals (six replicates per tissue) were retrieved from the HPA project [
<xref ref-type="bibr" rid="CR29">29</xref>
] (E-MTAB-2836). We counted
<italic>k</italic>
-mers in each RNA-seq and reference sequence set using Jellyfish (2.2.6), with options
<italic>k</italic>
=32 and -C (canonical
<italic>k</italic>
-mers). The
<italic>k</italic>
-mer list for each tissue (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
a, b) was produced by merging counts for all six samples and conserving only those found in all replicates.</p>
<p>For mapping statistics (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">b3</xref>
), we extracted
<italic>k</italic>
-mers specific to each tissue and mapped them to the Ensembl 86 transcript reference using Bowtie (version 1.1.2). Unmapped
<italic>k</italic>
-mers were mapped a second time with Bowtie to the GRCh38 genome reference. Reads with three or more mismatches are not mapped by Bowtie and, therefore, are considered as unmapped.</p>
<p>The intersection of
<italic>k</italic>
-mers between RNA-seq and WGS data (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
) is based on the transcriptome and genome of lymphoblastoid cell lines [
<xref ref-type="bibr" rid="CR30">30</xref>
].
<italic>k</italic>
-mers were counted in these libraries with the same procedure as above. To reduce noise from sequencing errors,
<italic>k</italic>
-mers with only one occurrence were filtered out.</p>
</sec>
<sec id="Sec22">
<title>DE-kupl implementation</title>
<p>The DE-kupl pipeline (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S14) is implemented using the Snakemake [
<xref ref-type="bibr" rid="CR42">42</xref>
] workflow manager (v3.10.1). There is a configuration file containing the location of FASTQ files, the condition of each sample, as well as global parameters such as
<italic>k</italic>
-mer length, CPU number, maximum memory, and other parameters for each step of the pipeline, as described hereinafter.</p>
<sec id="Sec23">
<title>
<italic>k</italic>
-mer counting</title>
<p>Raw sequences (FASTQ files) are first processed with the jellyfish count command of the Jellyfish software, which produces one index (a disk representation of the Jellyfish hash table) for each sequence library. For stranded RNA-seq libraries, reads in the reverse direction relative to the transcript are reverse-complemented, ensuring the proper orientation of
<italic>k</italic>
-mers. At this point, for each library, only
<italic>k</italic>
-mers having at least two occurrences are recorded (a user-defined parameter). Once a Jellyfish index is built, we use the jellyfish dump command to output the raw counts in a two-column text file, which contains at each line a
<italic>k</italic>
-mer and its frequency of occurrence. Raw counts are then sorted alphabetically by
<italic>k</italic>
-mer sequence with the Unix sort command.</p>
</sec>
<sec id="Sec24">
<title>
<italic>k</italic>
-mer filtering and masking</title>
<p>All sample counts are joined together using the dekupl-joinCounts binary to produce a single matrix with all
<italic>k</italic>
-mers and their abundance in all samples. Given an integer
<italic>a</italic>
≥0, we define the
<italic>recurrence</italic>
of a
<italic>k</italic>
-mer
<italic>x</italic>
as the number of samples where
<italic>x</italic>
appears more than
<italic>a</italic>
times, i.e.,
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\text{recurrence}(x,a) = \sum_{i=1}^{n} \mathbbm{1}_{\{x_{i} > a \}}, $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mrow>
<mml:mtext>recurrence</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>1</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>></mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="13059_2017_1372_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>n</italic>
is the total number of samples and
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
is the number of times the
<italic>k</italic>
-mer
<italic>x</italic>
appears in sample
<italic>i</italic>
. The
<italic>k</italic>
-mer filtering step involves two user-defined parameters (an integer min_recurrence_abundance and an integer min_recurrence), such that a
<italic>k</italic>
-mer
<italic>x</italic>
is filtered out if
<disp-formula id="Equb">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${{} \begin{aligned} &\text{recurrence}(x,\texttt{min\_recurrence\_abundance})\\\qquad& < \texttt{min\_recurrence}, \end{aligned}} $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mtext>recurrence</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext mathvariant="monospace">min_recurrence_abundance</mml:mtext>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="2em"></mml:mspace>
</mml:mtd>
<mml:mtd>
<mml:mo><</mml:mo>
<mml:mtext mathvariant="monospace">min_recurrence</mml:mtext>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="13059_2017_1372_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
i.e., if the
<italic>k</italic>
-mer
<italic>x</italic>
appears more than min_recurrence_abundance times in fewer than min_recurrence of the samples. Usually min_recurrence is set to the number of replicates in each condition, and min_recurrence_abundance is set to 5.</p>
<p>The masking process uses the same Jellyfish-based procedure to create the set of
<italic>k</italic>
-mers appearing in the reference transcriptome and to subtract this set from the experimental
<italic>k</italic>
-mers. Masking can be performed using any reference transcriptome. Here, we use either GENCODE V.24 or a simplified transcriptome containing one major transcript per gene, built as follows. Principal transcripts for protein-coding genes are extracted from the APPRIS database [
<xref ref-type="bibr" rid="CR43">43</xref>
]. When several isoforms have the same principal level, the longest one is selected. All non-coding RNA transcripts are extracted from GENCODE and the longest transcript is retained when isoforms are present. The lightweight transcriptome, referred to as 1 transcript/gene, is produced by merging the protein-coding and non-coding RNA transcript sets.</p>
</sec>
<sec id="Sec25">
<title>Differential
<italic>k</italic>
-mer expression</title>
<p>Prior to differential analysis, we compute normalization factors (NFs) using the median ratio method [
<xref ref-type="bibr" rid="CR44">44</xref>
] with the table of
<italic>k</italic>
-mers after the recurrence filter. For each sample, the NF is the median of the ratios between sample counts and counts of a pseudo-reference obtained by taking the geometric mean of each
<italic>k</italic>
-mer across all samples. To avoid dealing with the complete table of
<italic>k</italic>
-mers, we extracted a random subset of 30
<italic>%</italic>
of the
<italic>k</italic>
-mers and computed NFs for this subset. Computing NFs for the complete table of
<italic>k</italic>
-mers, for the table of
<italic>k</italic>
-mers after the recurrence filters and reference masking, or for the table of transcript abundances produced by Kallisto v0.43.0 [
<xref ref-type="bibr" rid="CR7">7</xref>
] resulted in similar values (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S15).</p>
<p>Two options are implemented for the differential analysis (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1). The first option is to apply a
<italic>t</italic>
-test for each
<italic>k</italic>
-mer on the log-transformed counts, normalized with the previously computed NF. Transformation of raw counts in conjunction with linear model analysis has been successfully used for differential analysis of counts [
<xref ref-type="bibr" rid="CR45">45</xref>
]. We perform the
<italic>t</italic>
-test independently on each
<italic>k</italic>
-mer and avoid complex variance modeling strategies to reduce the execution time of the analysis. The
<italic>t</italic>
-test option has been implemented in C in the dekupl-TtestFilter binary. Note that this
<italic>t</italic>
-test option is not appropriate for small samples [
<xref ref-type="bibr" rid="CR46">46</xref>
]. To increase the power of the analysis, in particular for small samples (typically less than six vs six libraries), we strongly advise the use of the second option based on a generalized linear model, implemented in the R package DESeq2 [
<xref ref-type="bibr" rid="CR33">33</xref>
]. On top of modeling raw counts (normalization or prior log-transformation of the counts is not required), this approach shares information across
<italic>k</italic>
-mers to improve variance estimation and the differential analysis results. However, given the large number of
<italic>k</italic>
-mers, we do not apply this approach to the complete matrix of
<italic>k</italic>
-mer counts. We divide the matrix of
<italic>k</italic>
-mer counts into random chunks of approximately equal size (around 1 million
<italic>k</italic>
-mers) and apply the DESeq2 model independently on each chunk. Previously computed NFs are used as an input to the method for each chunk, and are not computed independently on each chunk. Raw
<italic>p</italic>
values, unadjusted for multiple testing, are collected as an output for each chunk, and merged into one single vector containing the raw
<italic>p</italic>
values for all
<italic>k</italic>
-mers to test. Subsequently, raw
<italic>p</italic>
values obtained from either the
<italic>t</italic>
-test or the DESeq2 test are adjusted for multiple comparisons using the Benjamini–Hochberg procedure [
<xref ref-type="bibr" rid="CR47">47</xref>
] and
<italic>k</italic>
-mers with adjusted
<italic>p</italic>
values above a user-set cutoff are filtered out.</p>
</sec>
<sec id="Sec26">
<title>
<italic>k</italic>
-mer extension</title>
<p>DE
<italic>k</italic>
-mers that potentially overlap the same event (i.e., all
<italic>k</italic>
-mers overlapping a splice junction or SNV) are joined together using a technique inspired by de novo assembly. The
<italic>k</italic>
-mer extension procedure, called mergeTags, works as follows. We first identify all exact
<italic>k</italic>
−1 prefix–suffix overlaps between
<italic>k</italic>
-mers. We consider only
<italic>k</italic>
-mers that overlap with exactly one other
<italic>k</italic>
-mer, and merge all pairs of
<italic>k</italic>
-mers involved in such overlaps into
<italic>contigs</italic>
. For example, given a set of
<italic>k</italic>
-mers {ATG, TGA, TGC, CAT}, the following contigs are produced: {CATG, TGA, TGC}. We repeatedly merge contigs that overlap exactly over
<italic>k</italic>
−1 bp with exactly one other contig. We then repeat this extension process with
<italic>k</italic>
−2 exact prefix–suffix overlaps, using as input the contigs produced at the previous step, and so forth for increasing values of
<italic>i</italic>
such that
<italic>k</italic>
<italic>i</italic>
>15 bp. The effect of varying
<italic>i</italic>
on the final number of contigs is presented in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S16. A minimal overlap
<italic>k</italic>
<italic>i</italic>
=15 was empirically selected. Finally, a set of DE contigs is produced with each contig, being labeled by its constitutive
<italic>k</italic>
-mer of lowest
<italic>p</italic>
value. This extension procedure is implemented in C in the dekupl-mergeTags binary.</p>
</sec>
<sec id="Sec27">
<title>Contig annotation</title>
<p>Finally, DE contigs are annotated to facilitate biological event identification. Contigs are first aligned using BLAST [
<xref ref-type="bibr" rid="CR48">48</xref>
] against Illumina adapters. Contigs matching these adapters are discarded. Retained contigs are further mapped to the reference Hg38 human genome using the GSNAP short read aligner [
<xref ref-type="bibr" rid="CR49">49</xref>
] (v2017-01-14), which provided the best speed/sensitivity ratio for aligning both short and long contigs in internal tests (data not shown). GSNAP is used with option -N 1 to enable identification of new splice junctions. Contigs not mapped by GSNAP are collected and re-aligned using BLAST.</p>
<p>Alignment characteristics are extracted from GSNAP and BLAST outputs. Alignment coordinates are compared with Ensembl (v86) annotations (in GFF3 format) using BEDTools [
<xref ref-type="bibr" rid="CR50">50</xref>
] and a set of locus-related features is extracted. The final set of annotated features (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S5) is reported in a contig summary table. The annotation procedure generates two additional files: a per locus summary of contigs (one line per genic or intergenic locus), and a BED file of contig locations that can be used as a display track in genome browsers. In the per locus table, a locus is defined as an annotated gene, the genomic region located on the opposite strand of an annotated gene, or the genomic region separating two annotated genes. The table records the number of contigs overlapping each locus as well as the contig with the lowest false discovery rate for this genomic interval.</p>
<p>Parallel to
<italic>k</italic>
-mer counting, filtering, and masking, we analyze the RNA-seq data libraries using a conventional DE protocol. Reads are processed with Kallisto [
<xref ref-type="bibr" rid="CR7">7</xref>
] to estimate transcript abundances. Transcript-level counts are then collapsed to the gene level and processed with DESeq2 [
<xref ref-type="bibr" rid="CR33">33</xref>
] to produce a set of DE genes. This information is stored in the contig summary table and used later to define events with DU (Table 
<xref rid="Tab3" ref-type="table">3</xref>
).</p>
</sec>
<sec id="Sec28">
<title>DE-kupl run on EMT data</title>
<p>DE-kupl was run using RNA-syeq libraries from reference [
<xref ref-type="bibr" rid="CR34">34</xref>
]. The DE-kupl parameters were kmer_length 31, min_recurrence 6, min_recurrence_abundance 5, pvalue_threshold 0.05, lib_type stranded, and diff_method Ttest, with the GENCODE reference. Output files are provided in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
. The DE-kupl contig summary table was analyzed interactively using R commands to extract lists of contigs based on the filtering rules described in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. Visualization of selected contigs was performed with IGV [
<xref ref-type="bibr" rid="CR51">51</xref>
], using the BED file produced by DE-kupl and read mapping files produced by STAR [
<xref ref-type="bibr" rid="CR52">52</xref>
].</p>
<p>For comparison with KisSplice and IRFinder, DE-kupl was used with the same parameters as above, except for diff_method DeSeq2 and the 1-transcript/gene reference. KisSplice scripts were run in the following order: kissplice (v2.4.0) > kisstar (v2.5.3a) > kiss2ref (v1.0.0) > kissDE (v1.5.0). The final kissDE step provides the list of splice variant pairs with significant change in percentage inclusion across conditions. IRFinder (v1.2.3) was run with parameters IR ratio > 0.1 and intron coverage > 10. IRfinder outputs a list of introns with differential inclusion levels across conditions. The outputs of both IRfinder and KisSplice were filtered to retain only events matching annotated genes.</p>
</sec>
<sec id="Sec29">
<title>Validation in independent data sets</title>
<p>DE-kupl was applied to eight skin and eight colon libraries from GTEx [
<xref ref-type="bibr" rid="CR38">38</xref>
] using parameters kmer_length 31, min_recurrence 6, min_recurrence_abundance 5, pvalue_threshold 0.05, lib_type unstranded, diff_method Ttest, and reference_transcriptome Gencode. DE-kupl contigs were interactively classified using R commands, applying the same rules as in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. Classes antisense RNA and SNV-DU were excluded since identification of antisense RNA is not possible using the unstranded GTEx and HPA libraries, and we had no reason to expect common SNVs with DU in this data set. DE contigs were sorted by fold-change. The
<italic>k</italic>
-mer labels of the top 100 DE contigs in each class were extracted (50 for class unmapped due to the fewer events). GTex
<italic>k</italic>
-mers were then sought in the six skin and six colon libraries from HPA described above [
<xref ref-type="bibr" rid="CR29">29</xref>
] (E-MTAB-2836). The
<italic>k</italic>
-mers were counted in each library using Jellyfish with options
<italic>k</italic>
=31 and -C (canonical
<italic>k</italic>
-mers) as GTEx data were unstranded. All
<italic>k</italic>
-mers selected from the GTEx analysis were queried against the Jellyfish databases using the jellyfish query command. Finally, the extracted
<italic>k</italic>
-mers counts were processed with DESeq2 [
<xref ref-type="bibr" rid="CR33">33</xref>
] and the resulting adjusted
<italic>p</italic>
values were plotted for each event class (Fig. 
<xref rid="Fig8" ref-type="fig">8</xref>
).</p>
</sec>
</sec>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec30">
<title>Additional file</title>
<p>
<media position="anchor" xlink:href="13059_2017_1372_MOESM1_ESM.pdf" id="MOESM1">
<label>Additional file 1</label>
<caption>
<p>Supplementary
<bold>Tables S1–S5</bold>
, Supplementary
<bold>Figures S1–S16</bold>
. (PDF 3112 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<fn-group>
<fn>
<p>
<bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (doi:10.1186/s13059-017-1372-2) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack>
<p>We thank Damien Drubay for useful statistical discussions, William Ritchie and Lucile Broseus for running IRFinder, Haoliang Xue and Thibault Dayris for setting up the 1-transcript/gene reference transcriptome, and Jean-Marc Holder for English proofreading.</p>
<sec id="d29e2763">
<title>Funding</title>
<p>This project was supported by grants Plan Cancer – Systems Biology (bio2014-04) and Agence Nationale pour la Recherche “France Génomique” (ANR-10-INBS-0009) to DG, by Canceropole GSO to TC and by ANR “Investissement d’avenir en bioinformatique” to the Institute of Computational Biology. JA is a doctoral fellow of the Fondation pour la Recherche Medicale (FRM, 788BIOINFO2013 call, grant noDBI20131228566).</p>
</sec>
<sec id="d29e2768">
<title>Availability of data and materials</title>
<p>HPA [
<xref ref-type="bibr" rid="CR29">29</xref>
] RNA-seq libraries were downloaded from the European Nucleotide Archive of the European Bioinformatics Institute [
<xref ref-type="bibr" rid="CR53">53</xref>
] (bone marrow: ERR315469, ERR315425, ERR315486, ERR315396, ERR315404, ERR315406; colon: ERR315348, ERR315403, ERR315357, ERR315484, ERR315400, ERR315462; skin: ERR315401, ERR315464, ERR315460, ERR315372, ERR315376, ERR315339). EMT RNA-seq libraries [
<xref ref-type="bibr" rid="CR34">34</xref>
] were retrieved from the Gene Expression Omnibus website [
<xref ref-type="bibr" rid="CR54">54</xref>
] under accession GSE75492 (libraries GSM1956974, GSM1956975, GSM1956976, GSM1956977, GSM1956978, GSM1956979 for stage E and GSM1956992, GSM1956993, GSM1956994, GSM1956995, GSM1956996, GSM1956997 for stage M). GTEx [
<xref ref-type="bibr" rid="CR38">38</xref>
] data were downloaded from the dbGaP website [
<xref ref-type="bibr" rid="CR55">55</xref>
] under authorization phs000178/GRU (skin library IDs: SRR1308800, SRR1309051, SRR1309767, SRR1310075, SRR1311040, SRR1351501, SRR1400467, SRR1479595; colon library IDs: SRR1316343, SRR1396146, SRR1397292, SRR1477732, SRR1488307, SRR807751, SRR812697, SRR819486). The reference GRCh38 genome and Ensembl 86 transcripts were downloaded from Ensembl. DE-kupl is distributed under the MIT license. The DE-kupl software, documentation, and supplemental material presented herein are available from
<ext-link ext-link-type="uri" xlink:href="https://transipedia.github.io/dekupl/">https://transipedia.github.io/dekupl/</ext-link>
. The DOI for the source version used in this article is
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.1065976">https://doi.org/10.5281/zenodo.1065976</ext-link>
.</p>
</sec>
<sec id="d29e2802">
<title>Ethical approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>JA, NP, RC, MS, MeG, TC, and DG designed the study and analyzed the results. JA, MaG, MeG, and JLC developed the code. ED performed the tests and produced the figures. DG and JA drafted the manuscript. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="d29e2813">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="d29e2818">
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<mixed-citation publication-type="other">Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al.GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012; 22(9):1760–74.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1101/gr.135350.111">https://doi.org/10.1101/gr.135350.111</ext-link>
.</mixed-citation>
</ref>
<ref id="CR2">
<label>2</label>
<mixed-citation publication-type="other">Nishikura K. Functions and regulation of RNA editing by ADAR deaminases. Ann Rev Biochem. 2010; 79:321–49.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1146/annurev-biochem-060208-105251">https://doi.org/10.1146/annurev-biochem-060208-105251</ext-link>
.</mixed-citation>
</ref>
<ref id="CR3">
<label>3</label>
<mixed-citation publication-type="other">Chen LL. The biogenesis and emerging roles of circular RNAs. Nat Rev Mol Cell Biol. 2016; 17(4):205–11.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nrm.2015.32">https://doi.org/10.1038/nrm.2015.32</ext-link>
.</mixed-citation>
</ref>
<ref id="CR4">
<label>4</label>
<mixed-citation publication-type="other">Kirchner S, Ignatova Z. Emerging roles of tRNA in adaptive translation, signalling dynamics and disease. Nat Rev Genet. 2015; 16(2):98–112.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nrg3861">https://doi.org/10.1038/nrg3861</ext-link>
.</mixed-citation>
</ref>
<ref id="CR5">
<label>5</label>
<mixed-citation publication-type="other">Dieci G, Preti M, Montanini B. Eukaryotic snoRNAs: a paradigm for gene expression flexibility. Genomics. 2009; 94(2):83–8.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.ygeno.2009.05.002">https://doi.org/10.1016/j.ygeno.2009.05.002</ext-link>
.</mixed-citation>
</ref>
<ref id="CR6">
<label>6</label>
<mixed-citation publication-type="other">Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinforma. 2011; 12:323.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/1471-2105-12-323">https://doi.org/10.1186/1471-2105-12-323</ext-link>
.</mixed-citation>
</ref>
<ref id="CR7">
<label>7</label>
<mixed-citation publication-type="other">Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016; 34(5):525–7.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nbt.3519">https://doi.org/10.1038/nbt.3519</ext-link>
.</mixed-citation>
</ref>
<ref id="CR8">
<label>8</label>
<mixed-citation publication-type="other">Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017; 14(4):417–19.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nmeth.4197">https://doi.org/10.1038/nmeth.4197</ext-link>
.</mixed-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>LL</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Evaluation and comparison of computational tools for RNA-seq isoform quantification</article-title>
<source>BMC Genomics</source>
<year>2017</year>
<volume>18</volume>
<issue>1</issue>
<fpage>583</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-017-4002-1</pub-id>
<pub-id pub-id-type="pmid">28784092</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soneson</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Matthes</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Nowicka</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Law</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage</article-title>
<source>Genome Biol</source>
<year>2016</year>
<volume>17</volume>
<issue>1</issue>
<fpage>12</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-015-0862-3</pub-id>
<pub-id pub-id-type="pmid">26813113</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Love</surname>
<given-names>MI</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Djebali</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Dobin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Graveley</surname>
<given-names>BR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A benchmark for RNA-seq quantification pipelines</article-title>
<source>Genome Biol</source>
<year>2016</year>
<volume>17</volume>
<issue>1</issue>
<fpage>74</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-016-0940-1</pub-id>
<pub-id pub-id-type="pmid">27107712</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kanitz</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gypas</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Gruber</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Gruber</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zavolan</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data</article-title>
<source>Genome Biol</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>150</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-015-0702-5</pub-id>
<pub-id pub-id-type="pmid">26201343</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<mixed-citation publication-type="other">Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al.Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511–15.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nbt.1621">https://doi.org/10.1038/nbt.1621</ext-link>
.</mixed-citation>
</ref>
<ref id="CR14">
<label>14</label>
<mixed-citation publication-type="other">Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al.Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 2011; 29(7):644–52.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nbt.1883">https://doi.org/10.1038/nbt.1883</ext-link>
.</mixed-citation>
</ref>
<ref id="CR15">
<label>15</label>
<mixed-citation publication-type="other">Sacomoto GA, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF, et al.Kis splice: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinforma. 2012; 13(6):5.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/1471-2105-13-S6-S5">https://doi.org/10.1186/1471-2105-13-S6-S5</ext-link>
.</mixed-citation>
</ref>
<ref id="CR16">
<label>16</label>
<mixed-citation publication-type="other">Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, et al.Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016; 33(24):4033–40.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btw575">https://doi.org/10.1093/bioinformatics/btw575</ext-link>
.</mixed-citation>
</ref>
<ref id="CR17">
<label>17</label>
<mixed-citation publication-type="other">Vitting-Seerup K, Sandelin A. The landscape of isoform switches in human cancers. Mol Cancer Res. 2017; 15(9):1206–20.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1158/1541-7786.MCR-16-0459">https://doi.org/10.1158/1541-7786.MCR-16-0459</ext-link>
.</mixed-citation>
</ref>
<ref id="CR18">
<label>18</label>
<mixed-citation publication-type="other">Birol I, Raymond A, Chiu R, Nip KM, Jackman SD, Kreitzman M, et al.Kleat: cleavage site analysis of transcriptomes. In: Pacific Symposium on Biocomputing: 2015. p. 347.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1142/9789814644730_0034">https://doi.org/10.1142/9789814644730_0034</ext-link>
.</mixed-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Middleton</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Au</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>JJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>IRFinder: assessing the impact of intron retention on mammalian gene expression</article-title>
<source>Genome Biol</source>
<year>2017</year>
<volume>18</volume>
<issue>1</issue>
<fpage>51</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-017-1184-4</pub-id>
<pub-id pub-id-type="pmid">28298237</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pimentel</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions</article-title>
<source>Genome Biol</source>
<year>2013</year>
<volume>14</volume>
<issue>4</issue>
<fpage>36</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-4-r36</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benelli</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pescucci</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Marseglia</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Severgnini</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Torricelli</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Magi</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>24</issue>
<fpage>3232</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts617</pub-id>
<pub-id pub-id-type="pmid">23093608</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Memczak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jens</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Elefsinioti</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Torti</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Krueger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rybak</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Circular RNAs are a large class of animal RNAs with regulatory potency</article-title>
<source>Nature</source>
<year>2013</year>
<volume>495</volume>
<issue>7441</issue>
<fpage>333</fpage>
<pub-id pub-id-type="doi">10.1038/nature11928</pub-id>
<pub-id pub-id-type="pmid">23446348</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deelen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Zhernakova</surname>
<given-names>DV</given-names>
</name>
<name>
<surname>de Haan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>van der Sijde</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bonder</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Karjalainen</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels</article-title>
<source>Genome Med</source>
<year>2015</year>
<volume>7</volume>
<issue>1</issue>
<fpage>30</fpage>
<pub-id pub-id-type="doi">10.1186/s13073-015-0152-4</pub-id>
<pub-id pub-id-type="pmid">25954321</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<mixed-citation publication-type="other">Sahraeian SME, Mohiyuddin M, Sebra R, Tilgner H, Afshar PT, Au KF, et al.Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat Commun. 2017; 8(1):59.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41467-017-00050-4">https://doi.org/10.1038/s41467-017-00050-4</ext-link>
.</mixed-citation>
</ref>
<ref id="CR25">
<label>25</label>
<mixed-citation publication-type="other">Nordström KJV, Albani MC, James GV, Gutjahr C, Hartwig B, Turck F, et al.Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using
<italic>k</italic>
-mers. Nat Biotechnol. 2013; 31(4):325–30.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nbt.2515">https://doi.org/10.1038/nbt.2515</ext-link>
.</mixed-citation>
</ref>
<ref id="CR26">
<label>26</label>
<mixed-citation publication-type="other">Shajii AR, Yorukoglu D, Yu YW, Berger B. Fast genotyping of known SNPs through approximate
<italic>k</italic>
-mer matching. Bioinformatics. 2016; 32(17):i538–44.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btw460">https://doi.org/10.1093/bioinformatics/btw460</ext-link>
.</mixed-citation>
</ref>
<ref id="CR27">
<label>27</label>
<mixed-citation publication-type="other">Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al.Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17:132.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/s13059-016-0997-x">https://doi.org/10.1186/s13059-016-0997-x</ext-link>
.</mixed-citation>
</ref>
<ref id="CR28">
<label>28</label>
<mixed-citation publication-type="other">Marçais G, Kingsford C. A fast, loc
<italic>k</italic>
-free approach for efficient parallel counting of occurrences of
<italic>k</italic>
-mers. Bioinformatics. 2011; 27(6):764–70.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btr011">https://doi.org/10.1093/bioinformatics/btr011</ext-link>
.</mixed-citation>
</ref>
<ref id="CR29">
<label>29</label>
<mixed-citation publication-type="other">Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al.Tissue-based map of the human proteome. Science. 2015; 347(6220):1260419.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1126/science.1260419">https://doi.org/10.1126/science.1260419</ext-link>
.</mixed-citation>
</ref>
<ref id="CR30">
<label>30</label>
<mixed-citation publication-type="other">Griffith M, Griffith OL, Smith SM, Ramu A, Callaway MB, Brummett AM, et al.Genome modeling system: a knowledge management platform for genomics. PLoS Comput Biol. 2015; 11(7):1004274.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pcbi.1004274">https://doi.org/10.1371/journal.pcbi.1004274</ext-link>
.</mixed-citation>
</ref>
<ref id="CR31">
<label>31</label>
<mixed-citation publication-type="other">Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al.Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012; 7(3):562–78.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nprot.2012.016">https://doi.org/10.1038/nprot.2012.016</ext-link>
.</mixed-citation>
</ref>
<ref id="CR32">
<label>32</label>
<mixed-citation publication-type="other">Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al.De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013; 8(8):1494–512.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nprot.2013.084">https://doi.org/10.1038/nprot.2013.084</ext-link>
.</mixed-citation>
</ref>
<ref id="CR33">
<label>33</label>
<mixed-citation publication-type="other">Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12):550.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/s13059-014-0550-8">https://doi.org/10.1186/s13059-014-0550-8</ext-link>
.</mixed-citation>
</ref>
<ref id="CR34">
<label>34</label>
<mixed-citation publication-type="other">Yang Y, Park JW, Bebee TW, Warzecha CC, Guo Y, Shang X, et al.Determination of a comprehensive alternative splicing regulatory network and combinatorial regulation by key factors during the epithelial-to-mesenchymal transition. Mol Cell Biol. 2016; 36(11):1704–19.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1128/MCB.00019-16">https://doi.org/10.1128/MCB.00019-16</ext-link>
.</mixed-citation>
</ref>
<ref id="CR35">
<label>35</label>
<mixed-citation publication-type="other">Miyoshi K, Miyoshi T, Siomi H. Many ways to generate microRNA-like small RNAs: non-canonical pathways for microRNA production. Mol Gen Genomics. 2010; 284(2):95–103.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s00438-010-0556-1">https://doi.org/10.1007/s00438-010-0556-1</ext-link>
.</mixed-citation>
</ref>
<ref id="CR36">
<label>36</label>
<mixed-citation publication-type="other">Derrien T, Estellé J, Sola SM, Knowles DG, Raineri E, Guigó R, et al.Fast computation and applications of genome mappability. PLoS One. 2012; 7(1):30377.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pone.0030377">https://doi.org/10.1371/journal.pone.0030377</ext-link>
.</mixed-citation>
</ref>
<ref id="CR37">
<label>37</label>
<mixed-citation publication-type="other">Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013.
<ext-link ext-link-type="uri" xlink:href="http://www.repeatmasker.org">http://www.repeatmasker.org</ext-link>
.</mixed-citation>
</ref>
<ref id="CR38">
<label>38</label>
<mixed-citation publication-type="other">Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al.The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013; 45(6):580–5.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/ng.2653">https://doi.org/10.1038/ng.2653</ext-link>
.</mixed-citation>
</ref>
<ref id="CR39">
<label>39</label>
<mixed-citation publication-type="other">Tran VDT, Souiai O, Romero-Barrios N, Crespi M, Gautheret D. Detection of generic differential RNA processing events from RNA-seq data. RNA Biol. 2016; 13(1):59–67.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/15476286.2015.1118604">https://doi.org/10.1080/15476286.2015.1118604</ext-link>
.</mixed-citation>
</ref>
<ref id="CR40">
<label>40</label>
<mixed-citation publication-type="other">Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics. 2014; 15(3):413–26.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/biostatistics/kxt053">https://doi.org/10.1093/biostatistics/kxt053</ext-link>
.</mixed-citation>
</ref>
<ref id="CR41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weinstein</surname>
<given-names>JN</given-names>
</name>
<name>
<surname>Collisson</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Mills</surname>
<given-names>GB</given-names>
</name>
<name>
<surname>Shaw</surname>
<given-names>KRM</given-names>
</name>
<name>
<surname>Ozenberger</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Ellrott</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Cancer Genome Alas pan-cancer analysis project</article-title>
<source>Nat Genet</source>
<year>2013</year>
<volume>45</volume>
<issue>10</issue>
<fpage>1113</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1038/ng.2764</pub-id>
<pub-id pub-id-type="pmid">24071849</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42</label>
<mixed-citation publication-type="other">Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19):2520–2.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/bts480">https://doi.org/10.1093/bioinformatics/bts480</ext-link>
.</mixed-citation>
</ref>
<ref id="CR43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodriguez</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Maietta</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ezkurdia</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Pietrelli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wesselink</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Lopez</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>APPRIS: annotation of principal and alternative splice isoforms</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>41</volume>
<issue>D1</issue>
<fpage>110</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gks1058</pub-id>
<pub-id pub-id-type="pmid">23093607</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44</label>
<mixed-citation publication-type="other">Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:106.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/gb-2010-11-10-r106">https://doi.org/10.1186/gb-2010-11-10-r106</ext-link>
.</mixed-citation>
</ref>
<ref id="CR45">
<label>45</label>
<mixed-citation publication-type="other">Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15:29.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/gb-2014-15-2-r29">https://doi.org/10.1186/gb-2014-15-2-r29</ext-link>
.</mixed-citation>
</ref>
<ref id="CR46">
<label>46</label>
<mixed-citation publication-type="other">Jeanmougin M, de Reynies A, Marisa L, Paccard C, Nuel G, Guedj M. Should we abandon the
<italic>t</italic>
-test in the analysis of gene expression microarray data: a comparison of variance modeling strategies. PLoS One. 2010; 5(9):12336.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pone.0012336">https://doi.org/10.1371/journal.pone.0012336</ext-link>
.</mixed-citation>
</ref>
<ref id="CR47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benjamini</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hochberg</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing</article-title>
<source>J R Stat Soc Ser B Methodol</source>
<year>1995</year>
<volume>57</volume>
<issue>1</issue>
<fpage>289</fpage>
<lpage>300</lpage>
</element-citation>
</ref>
<ref id="CR48">
<label>48</label>
<mixed-citation publication-type="other">Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al.BLAST+: architecture and applications. BMC Bioinforma. 2009; 10:421.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/1471-2105-10-421">https://doi.org/10.1186/1471-2105-10-421</ext-link>
.</mixed-citation>
</ref>
<ref id="CR49">
<label>49</label>
<mixed-citation publication-type="other">Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010; 26(7):873–81.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btq057">https://doi.org/10.1093/bioinformatics/btq057</ext-link>
.</mixed-citation>
</ref>
<ref id="CR50">
<label>50</label>
<mixed-citation publication-type="other">Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26(6):841–2.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btq033">https://doi.org/10.1093/bioinformatics/btq033</ext-link>
.</mixed-citation>
</ref>
<ref id="CR51">
<label>51</label>
<mixed-citation publication-type="other">Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al.Integrative genomics viewer. Nat Biotechnol. 2011; 29(1):24–6.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nbt.1754">https://doi.org/10.1038/nbt.1754</ext-link>
.</mixed-citation>
</ref>
<ref id="CR52">
<label>52</label>
<mixed-citation publication-type="other">Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al.STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;635.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/bts635">https://doi.org/10.1093/bioinformatics/bts635</ext-link>
.</mixed-citation>
</ref>
<ref id="CR53">
<label>53</label>
<mixed-citation publication-type="other">Silvester N, Alako B, Amid C, Cerdeño-Tarrága A, Clarke L, Cleland I, et al.The european nucleotide archive in 2017. Nucleic Acids Res. 2017;1125.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/nar/gkx1125">https://doi.org/10.1093/nar/gkx1125</ext-link>
.</mixed-citation>
</ref>
<ref id="CR54">
<label>54</label>
<mixed-citation publication-type="other">Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al.NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013; 41(D1):991–5.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/nar/gks1193">https://doi.org/10.1093/nar/gks1193</ext-link>
.</mixed-citation>
</ref>
<ref id="CR55">
<label>55</label>
<mixed-citation publication-type="other">Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, et al.NCBI’s database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 2014; 42(D1):975–9.
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/nar/gkt1211">https://doi.org/10.1093/nar/gkt1211</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000367 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000367 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5747171
   |texte=   DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:29284518" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021