Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets

Identifieur interne : 000282 ( Pmc/Corpus ); précédent : 000281; suivant : 000283

Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets

Auteurs : Tobias Neumann ; Veronika A. Herzog ; Matthias Muhar ; Arndt Von Haeseler ; Johannes Zuber ; Stefan L. Ameres ; Philipp Rescheneder

Source :

RBID : PMC:6528199

Abstract

Background

Methods to read out naturally occurring or experimentally introduced nucleic acid modifications are emerging as powerful tools to study dynamic cellular processes. The recovery, quantification and interpretation of such events in high-throughput sequencing datasets demands specialized bioinformatics approaches.

Results

Here, we present Digital Unmasking of Nucleotide conversions in K-mers (DUNK), a data analysis pipeline enabling the quantification of nucleotide conversions in high-throughput sequencing datasets. We demonstrate using experimentally generated and simulated datasets that DUNK allows constant mapping rates irrespective of nucleotide-conversion rates, promotes the recovery of multimapping reads and employs Single Nucleotide Polymorphism (SNP) masking to uncouple true SNPs from nucleotide conversions to facilitate a robust and sensitive quantification of nucleotide-conversions. As a first application, we implement this strategy as SLAM-DUNK for the analysis of SLAMseq profiles, in which 4-thiouridine-labeled transcripts are detected based on T > C conversions. SLAM-DUNK provides both raw counts of nucleotide-conversion containing reads as well as a base-content and read coverage normalized approach for estimating the fractions of labeled transcripts as readout.

Conclusion

Beyond providing a readily accessible tool for analyzing SLAMseq and related time-resolved RNA sequencing methods (TimeLapse-seq, TUC-seq), DUNK establishes a broadly applicable strategy for quantifying nucleotide conversions.

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2849-7) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12859-019-2849-7
PubMed: 31109287
PubMed Central: 6528199

Links to Exploration step

PMC:6528199

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets</title>
<author>
<name sortKey="Neumann, Tobias" sort="Neumann, Tobias" uniqKey="Neumann T" first="Tobias" last="Neumann">Tobias Neumann</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Herzog, Veronika A" sort="Herzog, Veronika A" uniqKey="Herzog V" first="Veronika A." last="Herzog">Veronika A. Herzog</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0008 2788</institution-id>
<institution-id institution-id-type="GRID">grid.417521.4</institution-id>
<institution>Institute of Molecular Biotechnology of the Austrian Academy of Sciences (IMBA),</institution>
</institution-wrap>
Dr. Bohr-Gasse 3, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Muhar, Matthias" sort="Muhar, Matthias" uniqKey="Muhar M" first="Matthias" last="Muhar">Matthias Muhar</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Von Haeseler, Arndt" sort="Von Haeseler, Arndt" uniqKey="Von Haeseler A" first="Arndt" last="Von Haeseler">Arndt Von Haeseler</name>
<affiliation>
<nlm:aff id="Aff3">Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2286 1424</institution-id>
<institution-id institution-id-type="GRID">grid.10420.37</institution-id>
<institution>Bioinformatics and Computational Biology, Faculty of Computer Science,</institution>
<institution>University of Vienna,</institution>
</institution-wrap>
Waehringerstrasse 17, A-1090 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zuber, Johannes" sort="Zuber, Johannes" uniqKey="Zuber J" first="Johannes" last="Zuber">Johannes Zuber</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9259 8492</institution-id>
<institution-id institution-id-type="GRID">grid.22937.3d</institution-id>
<institution>Medical University of Vienna,</institution>
</institution-wrap>
VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ameres, Stefan L" sort="Ameres, Stefan L" uniqKey="Ameres S" first="Stefan L." last="Ameres">Stefan L. Ameres</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0008 2788</institution-id>
<institution-id institution-id-type="GRID">grid.417521.4</institution-id>
<institution>Institute of Molecular Biotechnology of the Austrian Academy of Sciences (IMBA),</institution>
</institution-wrap>
Dr. Bohr-Gasse 3, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rescheneder, Philipp" sort="Rescheneder, Philipp" uniqKey="Rescheneder P" first="Philipp" last="Rescheneder">Philipp Rescheneder</name>
<affiliation>
<nlm:aff id="Aff3">Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31109287</idno>
<idno type="pmc">6528199</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6528199</idno>
<idno type="RBID">PMC:6528199</idno>
<idno type="doi">10.1186/s12859-019-2849-7</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000282</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000282</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets</title>
<author>
<name sortKey="Neumann, Tobias" sort="Neumann, Tobias" uniqKey="Neumann T" first="Tobias" last="Neumann">Tobias Neumann</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Herzog, Veronika A" sort="Herzog, Veronika A" uniqKey="Herzog V" first="Veronika A." last="Herzog">Veronika A. Herzog</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0008 2788</institution-id>
<institution-id institution-id-type="GRID">grid.417521.4</institution-id>
<institution>Institute of Molecular Biotechnology of the Austrian Academy of Sciences (IMBA),</institution>
</institution-wrap>
Dr. Bohr-Gasse 3, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Muhar, Matthias" sort="Muhar, Matthias" uniqKey="Muhar M" first="Matthias" last="Muhar">Matthias Muhar</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Von Haeseler, Arndt" sort="Von Haeseler, Arndt" uniqKey="Von Haeseler A" first="Arndt" last="Von Haeseler">Arndt Von Haeseler</name>
<affiliation>
<nlm:aff id="Aff3">Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2286 1424</institution-id>
<institution-id institution-id-type="GRID">grid.10420.37</institution-id>
<institution>Bioinformatics and Computational Biology, Faculty of Computer Science,</institution>
<institution>University of Vienna,</institution>
</institution-wrap>
Waehringerstrasse 17, A-1090 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zuber, Johannes" sort="Zuber, Johannes" uniqKey="Zuber J" first="Johannes" last="Zuber">Johannes Zuber</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff5">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9259 8492</institution-id>
<institution-id institution-id-type="GRID">grid.22937.3d</institution-id>
<institution>Medical University of Vienna,</institution>
</institution-wrap>
VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ameres, Stefan L" sort="Ameres, Stefan L" uniqKey="Ameres S" first="Stefan L." last="Ameres">Stefan L. Ameres</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0008 2788</institution-id>
<institution-id institution-id-type="GRID">grid.417521.4</institution-id>
<institution>Institute of Molecular Biotechnology of the Austrian Academy of Sciences (IMBA),</institution>
</institution-wrap>
Dr. Bohr-Gasse 3, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rescheneder, Philipp" sort="Rescheneder, Philipp" uniqKey="Rescheneder P" first="Philipp" last="Rescheneder">Philipp Rescheneder</name>
<affiliation>
<nlm:aff id="Aff3">Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">Methods to read out naturally occurring or experimentally introduced nucleic acid modifications are emerging as powerful tools to study dynamic cellular processes. The recovery, quantification and interpretation of such events in high-throughput sequencing datasets demands specialized bioinformatics approaches.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">Here, we present Digital Unmasking of Nucleotide conversions in K-mers (DUNK), a data analysis pipeline enabling the quantification of nucleotide conversions in high-throughput sequencing datasets. We demonstrate using experimentally generated and simulated datasets that DUNK allows constant mapping rates irrespective of nucleotide-conversion rates, promotes the recovery of multimapping reads and employs Single Nucleotide Polymorphism (SNP) masking to uncouple true SNPs from nucleotide conversions to facilitate a robust and sensitive quantification of nucleotide-conversions. As a first application, we implement this strategy as SLAM-DUNK for the analysis of SLAMseq profiles, in which 4-thiouridine-labeled transcripts are detected based on T > C conversions. SLAM-DUNK provides both raw counts of nucleotide-conversion containing reads as well as a base-content and read coverage normalized approach for estimating the fractions of labeled transcripts as readout.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">Beyond providing a readily accessible tool for analyzing SLAMseq and related time-resolved RNA sequencing methods (TimeLapse-seq, TUC-seq), DUNK establishes a broadly applicable strategy for quantifying nucleotide conversions.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2849-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Frommer, M" uniqKey="Frommer M">M Frommer</name>
</author>
<author>
<name sortKey="Mcdonald, Le" uniqKey="Mcdonald L">LE McDonald</name>
</author>
<author>
<name sortKey="Millar, Ds" uniqKey="Millar D">DS Millar</name>
</author>
<author>
<name sortKey="Collis, Cm" uniqKey="Collis C">CM Collis</name>
</author>
<author>
<name sortKey="Watt, F" uniqKey="Watt F">F Watt</name>
</author>
<author>
<name sortKey="Grigg, Gw" uniqKey="Grigg G">GW Grigg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hafner, M" uniqKey="Hafner M">M Hafner</name>
</author>
<author>
<name sortKey="Landthaler, M" uniqKey="Landthaler M">M Landthaler</name>
</author>
<author>
<name sortKey="Burger, L" uniqKey="Burger L">L Burger</name>
</author>
<author>
<name sortKey="Khorshid, M" uniqKey="Khorshid M">M Khorshid</name>
</author>
<author>
<name sortKey="Hausser, J" uniqKey="Hausser J">J Hausser</name>
</author>
<author>
<name sortKey="Berninger, P" uniqKey="Berninger P">P Berninger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, X" uniqKey="Li X">X Li</name>
</author>
<author>
<name sortKey="Xiong, X" uniqKey="Xiong X">X Xiong</name>
</author>
<author>
<name sortKey="Yi, C" uniqKey="Yi C">C Yi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Herzog, Va" uniqKey="Herzog V">VA Herzog</name>
</author>
<author>
<name sortKey="Reichholf, B" uniqKey="Reichholf B">B Reichholf</name>
</author>
<author>
<name sortKey="Neumann, T" uniqKey="Neumann T">T Neumann</name>
</author>
<author>
<name sortKey="Rescheneder, P" uniqKey="Rescheneder P">P Rescheneder</name>
</author>
<author>
<name sortKey="Bhat, P" uniqKey="Bhat P">P Bhat</name>
</author>
<author>
<name sortKey="Burkard, Tr" uniqKey="Burkard T">TR Burkard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Muhar, M" uniqKey="Muhar M">M Muhar</name>
</author>
<author>
<name sortKey="Ebert, A" uniqKey="Ebert A">A Ebert</name>
</author>
<author>
<name sortKey="Neumann, T" uniqKey="Neumann T">T Neumann</name>
</author>
<author>
<name sortKey="Umkehrer, C" uniqKey="Umkehrer C">C Umkehrer</name>
</author>
<author>
<name sortKey="Jude, J" uniqKey="Jude J">J Jude</name>
</author>
<author>
<name sortKey="Wieshofer, C" uniqKey="Wieshofer C">C Wieshofer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sedlazeck, Fj" uniqKey="Sedlazeck F">FJ Sedlazeck</name>
</author>
<author>
<name sortKey="Rescheneder, P" uniqKey="Rescheneder P">P Rescheneder</name>
</author>
<author>
<name sortKey="Haeseler, V A" uniqKey="Haeseler V">v A Haeseler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mignone, F" uniqKey="Mignone F">F Mignone</name>
</author>
<author>
<name sortKey="Gissi, C" uniqKey="Gissi C">C Gissi</name>
</author>
<author>
<name sortKey="Liuni, S" uniqKey="Liuni S">S Liuni</name>
</author>
<author>
<name sortKey="Pesole, G" uniqKey="Pesole G">G Pesole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Young, Ra" uniqKey="Young R">RA Young</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewels, P" uniqKey="Ewels P">P Ewels</name>
</author>
<author>
<name sortKey="Magnusson, M" uniqKey="Magnusson M">M Magnusson</name>
</author>
<author>
<name sortKey="Lundin, S" uniqKey="Lundin S">S Lundin</name>
</author>
<author>
<name sortKey="K Ller, M" uniqKey="K Ller M">M Käller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jurges, C" uniqKey="Jurges C">C Jürges</name>
</author>
<author>
<name sortKey="Dolken, L" uniqKey="Dolken L">L Dölken</name>
</author>
<author>
<name sortKey="Erhard, F" uniqKey="Erhard F">F Erhard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koboldt, Dc" uniqKey="Koboldt D">DC Koboldt</name>
</author>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author>
<name sortKey="Larson, De" uniqKey="Larson D">DE Larson</name>
</author>
<author>
<name sortKey="Shen, D" uniqKey="Shen D">D Shen</name>
</author>
<author>
<name sortKey="Mclellan, Md" uniqKey="Mclellan M">MD McLellan</name>
</author>
<author>
<name sortKey="Lin, L" uniqKey="Lin L">L Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31109287</article-id>
<article-id pub-id-type="pmc">6528199</article-id>
<article-id pub-id-type="publisher-id">2849</article-id>
<article-id pub-id-type="doi">10.1186/s12859-019-2849-7</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-3908-4224</contrib-id>
<name>
<surname>Neumann</surname>
<given-names>Tobias</given-names>
</name>
<address>
<email>tobias.neumann@imp.ac.at</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Herzog</surname>
<given-names>Veronika A.</given-names>
</name>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Muhar</surname>
<given-names>Matthias</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>von Haeseler</surname>
<given-names>Arndt</given-names>
</name>
<xref ref-type="aff" rid="Aff3">3</xref>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zuber</surname>
<given-names>Johannes</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff5">5</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ameres</surname>
<given-names>Stefan L.</given-names>
</name>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Rescheneder</surname>
<given-names>Philipp</given-names>
</name>
<address>
<email>philipp.rescheneder@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9799 657X</institution-id>
<institution-id institution-id-type="GRID">grid.14826.39</institution-id>
<institution>Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, Vienna BioCenter (VBC),</institution>
</institution-wrap>
1030 Vienna, Austria</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0008 2788</institution-id>
<institution-id institution-id-type="GRID">grid.417521.4</institution-id>
<institution>Institute of Molecular Biotechnology of the Austrian Academy of Sciences (IMBA),</institution>
</institution-wrap>
Dr. Bohr-Gasse 3, VBC, 1030 Vienna, Austria</aff>
<aff id="Aff3">
<label>3</label>
Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria</aff>
<aff id="Aff4">
<label>4</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2286 1424</institution-id>
<institution-id institution-id-type="GRID">grid.10420.37</institution-id>
<institution>Bioinformatics and Computational Biology, Faculty of Computer Science,</institution>
<institution>University of Vienna,</institution>
</institution-wrap>
Waehringerstrasse 17, A-1090 Vienna, Austria</aff>
<aff id="Aff5">
<label>5</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9259 8492</institution-id>
<institution-id institution-id-type="GRID">grid.22937.3d</institution-id>
<institution>Medical University of Vienna,</institution>
</institution-wrap>
VBC, 1030 Vienna, Austria</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>20</day>
<month>5</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>20</day>
<month>5</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>20</volume>
<elocation-id>258</elocation-id>
<history>
<date date-type="received">
<day>4</day>
<month>10</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>4</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">Methods to read out naturally occurring or experimentally introduced nucleic acid modifications are emerging as powerful tools to study dynamic cellular processes. The recovery, quantification and interpretation of such events in high-throughput sequencing datasets demands specialized bioinformatics approaches.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">Here, we present Digital Unmasking of Nucleotide conversions in K-mers (DUNK), a data analysis pipeline enabling the quantification of nucleotide conversions in high-throughput sequencing datasets. We demonstrate using experimentally generated and simulated datasets that DUNK allows constant mapping rates irrespective of nucleotide-conversion rates, promotes the recovery of multimapping reads and employs Single Nucleotide Polymorphism (SNP) masking to uncouple true SNPs from nucleotide conversions to facilitate a robust and sensitive quantification of nucleotide-conversions. As a first application, we implement this strategy as SLAM-DUNK for the analysis of SLAMseq profiles, in which 4-thiouridine-labeled transcripts are detected based on T > C conversions. SLAM-DUNK provides both raw counts of nucleotide-conversion containing reads as well as a base-content and read coverage normalized approach for estimating the fractions of labeled transcripts as readout.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">Beyond providing a readily accessible tool for analyzing SLAMseq and related time-resolved RNA sequencing methods (TimeLapse-seq, TUC-seq), DUNK establishes a broadly applicable strategy for quantifying nucleotide conversions.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2849-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Mapping</kwd>
<kwd>Epitranscriptomics</kwd>
<kwd>Next generation sequencing</kwd>
<kwd>High-throughput sequencing</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution>European Research Council</institution>
</funding-source>
<award-id>ERC-StG-338252</award-id>
<award-id>ERC-StG-336860</award-id>
<award-id>ERC-PoC-825710 SLAMseq</award-id>
<principal-award-recipient>
<name>
<surname>Zuber</surname>
<given-names>Johannes</given-names>
</name>
<name>
<surname>Ameres</surname>
<given-names>Stefan L.</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>Austrian Science Fund</institution>
</funding-source>
<award-id>Y-733-B22 START</award-id>
<award-id>W-1207-B09</award-id>
<award-id>SFB F43-22</award-id>
<award-id>W-1207-B09</award-id>
<principal-award-recipient>
<name>
<surname>Ameres</surname>
<given-names>Stefan L.</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par14">Mismatches in reads yielded from standard sequencing protocols such as genome sequencing and RNA-Seq originate either from genetic variations or sequencing errors and are typically ignored by standard mapping approaches. Beyond these standard applications, a growing number of profiling techniques harnesses nucleotide conversions to monitor naturally occurring or experimentally introduced DNA or RNA modifications. For example, bisulfite-sequencing (BS-Seq) identifies non-methylated cytosines from cytosine-to-thymine (C > T) conversions [
<xref ref-type="bibr" rid="CR1">1</xref>
]. Similarly, photoactivatable ribonucleoside-enhanced crosslinking and immunoprecipitation (PAR-CLIP) enables the identification of protein-RNA-interactions by qualitative assessment of thymine-to-cytosine (T > C) conversions [
<xref ref-type="bibr" rid="CR2">2</xref>
]. Most recently, emerging sequencing technologies further expanded the potential readout of nucleotide-conversions in high-throughput sequencing datasets by employing chemoselective modifications to modified nucleotides in RNA species, resulting in specific nucleotide conversions upon reverse transcription and sequencing [
<xref ref-type="bibr" rid="CR3">3</xref>
]. Among these, thiol (SH)-linked alkylation for the metabolic sequencing of RNA (SLAMseq) is a novel sequencing protocol enabling quantitative measurements of RNA kinetics within living cells which can be applied to determine RNA stabilities [
<xref ref-type="bibr" rid="CR4">4</xref>
] and transcription-factor dependent transcriptional outputs [
<xref ref-type="bibr" rid="CR5">5</xref>
] in vitro, or, when combined with the cell-type-specific expression of uracil phosphoribosyltransferase, to assess cell-type-specific transcriptomes in vivo (SLAM-ITseq) [
<xref ref-type="bibr" rid="CR6">6</xref>
]. SLAMseq employs metabolic RNA labeling with 4-thiouridine (4SU), which is readily incorporated into newly synthesized transcripts. After RNA isolation, chemical nucleotide-analog derivatization specifically modifies thiol-containing residues, which leads to specific misincorporation of guanine (G) instead of adenine (A) when the reverse transcriptase encounters an alkylated 4SU residue during RNA to cDNA conversion. The resulting T > C conversion can be read out by high-throughput sequencing.</p>
<p id="Par15">Identifying nucleotide conversions in high-throughput sequencing data comes with two major challenges: First, depending on nucleotide conversion rates, reads will contain a high proportion of mismatches with respect to a reference genome, causing common aligners to misalign them to an incorrect genomic position or to fail aligning them at all [
<xref ref-type="bibr" rid="CR7">7</xref>
]. Secondly, Single Nucleotide Polymorphisms (SNPs) in the genome will lead to an overestimation of nucleotide-conversions if not appropriately separated from experimentally introduced genuine nucleotide conversions. Moreover, depending on the nucleotide-conversion efficiency and the number of available conversion-sites, high sequencing depth is required to reliably detect nucleotide-conversions at lower frequencies. Therefore, selective amplification of transcript regions, such as 3′ end mRNA sequencing (QuantSeq [
<xref ref-type="bibr" rid="CR8">8</xref>
]) reduces library complexity ensuring high local coverage and allowing increased multiplexing of samples. In addition, QuantSeq specifically recovers only mature (polyadenylated) mRNAs and allows the identification of transcript 3′ ends. However, the 3′ terminal regions of transcripts sequenced by QuantSeq (typically 250 bp; hereafter called 3′ intervals) largely overlap with 3′ untranslated regions (UTRs), which are generally of less sequence complexity than coding sequences [
<xref ref-type="bibr" rid="CR9">9</xref>
], resulting in an increased number of multi-mapping reads, i.e. reads mapping equally well to several genomic regions. Finally, besides the exact position of nucleotide conversions in the reads, SLAMseq down-stream analysis requires quantifications of overall conversion-rates robust against variation in coverage and base composition in genomic intervals e.g. 3′ intervals.</p>
<p id="Par16">Here we introduce Digital Unmasking of Nucleotide-conversions in
<italic>k</italic>
-mers (DUNK), a data analysis method for the robust and reproducible recovery of nucleotide-conversions in high-throughput sequencing datasets. DUNK solves the main challenges generated by nucleotide-conversions in high-throughput sequencing experiments: It facilitates the accurate alignment of reads with many mismatches and the unbiased estimation of nucleotide-conversion rates taking into account SNPs that may feign nucleotide-conversions. As an application of DUNK, we introduce SLAM-DUNK - a SLAMseq-specific pipeline that takes additional complications of the SLAMseq approach into account. SLAM-DUNK allows to address the increased number of multi-mapping reads in low-complexity regions frequently occurring in 3′ end sequencing data sets and a robust and unbiased quantification of nucleotide-conversions in genomic intervals such as 3′ intervals. SLAM-DUNK enables researchers to analyze SLAMseq data from raw reads to fully normalized nucleotide-conversion quantifications without expert bioinformatics knowledge. Moreover, SLAM-DUNK provides a comprehensive analysis of the input data, including visualization, summary statistics and other relevant information of the data processing. To allow scientists to assess feasibility and accuracy of nucleotide-conversion based measurements for genes and/or organisms of interest in silico, SLAM-DUNK comes with a SLAMseq simulation module enabling optimization of experimental parameters such as sequencing depth and sample numbers. We supply this fully encapsulated and easy to install software package via BioConda, the Python Package Index, Docker hub and Github (see
<ext-link ext-link-type="uri" xlink:href="http://t-neumann.github.io/slamdunk">http://t-neumann.github.io/slamdunk</ext-link>
) as well as a MultiQC (
<ext-link ext-link-type="uri" xlink:href="http://multiqc.info">http://multiqc.info</ext-link>
) plugin to make SLAMseq data analysis and integration available to bench-scientists.</p>
</sec>
<sec id="Sec2">
<title>Results</title>
<sec id="Sec3">
<title>Digital unmasking of nucleotide-conversions in
<italic>k</italic>
-mers</title>
<p id="Par17">DUNK addresses the challenges of distinguishing nucleotide-conversions from sequencing error and genuine SNPs in high-throughput sequencing datasets by executing four main steps (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
): First, a nucleotide conversion-aware read mapping algorithm facilitates the alignment of reads (k-mers) with elevated numbers of mismatches (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
a). Second, to provide robust nucleotide-conversion readouts in repetitive or low-complexity regions such as 3′ UTRs, DUNK optionally employs a recovery strategy for multi-mapping reads. Instead of discarding all multi-mapping reads, DUNK only discards reads that map equally well to two different 3′ intervals. Reads with multiple alignments to the same 3′ interval or to a single 3′ interval and a region of the genome that is not part of a 3′ interval are kept (Fig
<xref rid="Fig1" ref-type="fig">1</xref>
b). Third, DUNK identifies Single-Nucleotide Polymorphisms (SNPs) to mask false-positive nucleotide-conversions at SNP positions (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
c). Finally, the high-quality nucleotide-conversion signal is deconvoluted from sequencing error and used to compute conversion frequencies for all 3′ intervals taking into account read coverage and base content of the interval (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
d).
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Digital Unmasking of Nucleotide-conversions in
<italic>k</italic>
-mers: Legend: Possible base outcomes for a given nucleotide-conversion: match with reference (white), nucleotide-conversion scored as mismatch (red), nucleotide-conversion scored with nucleotide-conversion aware scoring (blue), low-quality nucleotide conversion (black) and filtered nucleotide-conversion (opaque)
<bold>a</bold>
Naïve nucleotide-conversion processing and quantification vs DUNK: The naïve read mapper (left) maps 11 reads (grey) to the reference genome and discards five reads (light grey), that comprise many converted nucleotides (red). The DUNK mapper (right) maps all 16 reads.
<bold>b</bold>
DUNK processes multi-mapping reads (R5, R6, R7, left) such that the ones (R3, R6) that can be unambiguously assigned to a 3′ interval are identified and assigned to that region, R5 and R7 cannot be assigned to a 3′ interval and will be deleted from downstream analyses. R2 is discarded due to general low alignment quality.
<bold>c</bold>
False-positive nucleotide conversions originating from Single-Nucleotide Polymorphisms are masked.
<bold>d</bold>
High-quality nucleotide-conversions are quantified normalizing for coverage and base content</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par18">In the following, we demonstrate the performance and validity of each analysis step by applying DUNK to several published and simulated datasets.</p>
</sec>
<sec id="Sec4">
<title>Nucleotide-conversion aware mapping improves nucleotide-conversion quantification</title>
<p id="Par19">Correct alignment of reads to a reference genome is a central task of most high-throughput sequencing analyses. To identify the optimal alignment between a read and the reference genome, mapping algorithms employ a scoring function that includes penalties for mismatches and gaps. The penalties are aimed to reflect the probability to observe a mismatch or a gap. In standard high throughput sequencing experiments, one assumes one mismatch penalty independent of the type of nucleotide mismatch (standard scoring). In contrast, SLAMseq or similar protocols produce datasets where a specific nucleotide conversion occurs more frequently than all others. To account for this, DUNK uses a conversion-aware scoring scheme (see Table
<xref rid="Tab1" ref-type="table">1</xref>
). For example, SLAM-DUNK does not penalize a T > C mismatch between reference>read.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Columns represent reference nucleotide, rows read nucleotide. If a C occurs in the read and a T in the reference, the score is equal to zero. The other possible mismatches receive a score of − 15. A match receives a score of 10</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2" colspan="2"></th>
<th colspan="4">Reference genome</th>
</tr>
<tr>
<th>A</th>
<th>T</th>
<th>G</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">Read position</td>
<td>A</td>
<td>10</td>
<td>−15</td>
<td>− 15</td>
<td>− 15</td>
</tr>
<tr>
<td>T</td>
<td>−15</td>
<td>10</td>
<td>−15</td>
<td>− 15</td>
</tr>
<tr>
<td>G</td>
<td>−15</td>
<td>−15</td>
<td>10</td>
<td>−15</td>
</tr>
<tr>
<td>C</td>
<td>−15</td>
<td>0</td>
<td>−15</td>
<td>10</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par20">We used simulated SLAMseq data with conversion rates of 0% (no conversions), 2.4 and 7% (conversion rates observed in mouse embryonic stem cell (mESC) SLAMseq data [
<xref ref-type="bibr" rid="CR4">4</xref>
] and HeLa SLAMseq data (unpublished) upon saturated 4SU-labeling conditions), and an excessive conversion rate of 15% (see Table
<xref rid="Tab2" ref-type="table">2</xref>
) to evaluate the scoring scheme displayed in Table
<xref rid="Tab1" ref-type="table">1</xref>
. For each simulated dataset, we compared the inferred nucleotide-conversion sites using either the standard scoring or the conversion-aware scoring scheme to the simulated “true” conversions and calculated the median of the relative errors [%] from the simulated truth (see
<xref rid="Sec11" ref-type="sec">Methods</xref>
). For a “conversion rate” of 0% both scoring schemes showed a median error of < 0.1% (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
a, Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1). Of note, the mean error of the standard scoring scheme is lower than for the conversion-aware scoring scheme (0.288 vs 0.297 nucleotide-conversions) thus favoring standard-scoring for datasets without experimentally introduced nucleotide-conversions. For a conversion rate of 2.4% the standard and the conversion-aware scoring scheme showed an error of 4.5 and 2.3%, respectively. Increasing the conversion rate to 7% further increased the error of the standard scoring to 5%. In contrast, the error of the SLAM-DUNK scoring function stayed at 2.3%. Thus, conversion-aware scoring reduced the median conversion quantification error by 49–54% when compared to standard scoring scheme.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Simulated datasets and their corresponding analyses in this study</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>3′ intervals</th>
<th>Nucleotide-conversion rate [%]</th>
<th>Read length [bp]</th>
<th>Coverage</th>
<th>Labeled transcripts</th>
<th>Analysis</th>
<th>Figure</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 k mESC expressed 3′ intervals of 1000 randomly selected transcripts expressed in mESC)</td>
<td>0, 2.4, 7, 15, 30, 60</td>
<td>50, 100, 150</td>
<td>100</td>
<td>100%</td>
<td>Nucleotide-conversion aware read mapping</td>
<td>2a,c, S1</td>
</tr>
<tr>
<td>1 k mESC expressed</td>
<td>0, 2.4, 7</td>
<td>50, 100, 150</td>
<td>100</td>
<td>100%</td>
<td>Multimapper recovery strategy evaluation</td>
<td>3b</td>
</tr>
<tr>
<td>22,281 (mESC)</td>
<td>0, 2.4, 7</td>
<td>100</td>
<td>60x</td>
<td>100%</td>
<td>SNP masking evaluation</td>
<td>4b</td>
</tr>
<tr>
<td>1 k mESC expressed</td>
<td>2.4, 7</td>
<td>50, 100, 150</td>
<td>100x</td>
<td>50%</td>
<td>Evaluation of T > C read sensitivity / specificity</td>
<td>5a, S4</td>
</tr>
<tr>
<td>18 example genes (mESC)</td>
<td>2.4, 7</td>
<td>50, 100, 150</td>
<td>200x</td>
<td>50%</td>
<td>Comparison of labeled fraction of transcript estimation methods</td>
<td>5c</td>
</tr>
<tr>
<td>1 k mESC expressed</td>
<td>2.4, 7</td>
<td>50, 100, 150</td>
<td>25x- 200x in 25x intervals</td>
<td>0–100%</td>
<td>Evaluation of labeled fraction of transcript estimation</td>
<td>5d, S6</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Nucleotide-conversion aware read mapping:
<bold>a</bold>
Evaluation of nucleotide-conversion aware scoring vs naïve scoring during read mapping: Median error [%] of true vs recovered nucleotide-conversions for simulated data with 100 bp read length and increasing nucleotide-conversion rates at 100x coverage.
<bold>b</bold>
Number of reads correctly assigned to their 3′ interval of origin for typically encountered nucleotide-conversion rates of 0.0, 2.4 and 7.0% as well as excessive conversion rates of 15, 30 and 60%.
<bold>c</bold>
Percentages of retained reads and linear regression with 95% CI bands after mapping 21 mouse ES cell pulse-chase time course samples with increasing nucleotide-conversion content for standard mapping and DUNK</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec5">
<title>DUNK correctly maps reads independently of their nucleotide-conversion rate</title>
<p id="Par21">Mismatches due to SNPs or sequencing errors are one of the central challenges of read mapping tools. Typical RNA-Seq datasets show a SNP rate between 0.1 and 1.0% and a sequencing error of up to 1%. Protocols employing chemically induced nucleotide-conversions produce datasets with a broad range of mismatch frequencies. While nucleotide-conversion free (unlabeled) reads show the same number of mismatches as RNA-Seq reads, nucleotide-conversion containing (labeled) reads contain additional mismatches, depending on the nucleotide-conversion rate of the experiment and the number of nucleotides that can be converted in a read. To assess the effect of nucleotide-conversion rate on read mapping we randomly selected 1000 genomic 3′ intervals of expressed transcripts extracted from a published mESC 3′ end annotation and simulated two datasets of labeled reads with a nucleotide-conversion rate of 2.4 and 7% (see Table
<xref rid="Tab2" ref-type="table">2</xref>
). Next, SLAM-DUNK mapped the simulated data to the mouse genome and we computed the number of reads mapped to the correct 3′ interval per dataset. Figure
<xref rid="Fig2" ref-type="fig">2</xref>
b shows that for a read length of 50 bp and a nucleotide-conversion rate of 2.4% the mapping rate (91%) is not significantly different when compared to a dataset of unlabeled reads. Increasing the nucleotide-conversion rate to 7% caused a moderate drop of the correctly mapped reads to 88%. This drop can be rectified by increasing the read length to 100 or 150 bp where the mapping rates are at least 96% for nucleotide-conversion rates as large as 15% (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
b).</p>
<p id="Par22">While we observe a substantial drop in the percentage of correctly mapped reads for higher conversion rates (> 15%) for shorter reads (50 bp), SLAM-DUNK’s mapping rate for longer reads (100 and 150 bp) remained above 88% for datasets with up to 15 and 30% conversion rates, respectively, demonstrating that SLAM-DUNK maps reads with and without nucleotide-conversion equally well even for high conversion frequencies.</p>
<p id="Par23">To confirm this finding in real data, we used SLAM-DUNK to map 21 published (7 time points with three replicates each) SLAMseq datasets [
<xref ref-type="bibr" rid="CR4">4</xref>
] from a pulse-chase time course in mESCs (see Table
<xref rid="Tab3" ref-type="table">3</xref>
) with estimated conversion rates of 2.4%. Due to the biological nature of the experiment we expect that the SLAMseq data from the first time point (onset of 4SU- wash-out/chase) contain the highest number of labeled reads while the data from the last time point has virtually no labeled reads.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Real SLAMseq datasets and their corresponding analyses in this study</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Samples</th>
<th>Description</th>
<th>Analysis</th>
<th>Figure</th>
</tr>
</thead>
<tbody>
<tr>
<td>GSM2666819-GSM2666839</td>
<td>Chase-timecourse samples at 0, 0.5,1, 3, 6, 12 and 24 h with 3 replicates at each time point</td>
<td>Percentages of retained reads after mapping with standard and nucleotide-conversion aware scoring with DUNK.</td>
<td>2b</td>
</tr>
<tr>
<td>GSM2666816</td>
<td>Single no 4SU 0 h replicate</td>
<td>Multimapper recovery count scatter vs unique mappers</td>
<td>3c, S2b,c, S3</td>
</tr>
<tr>
<td>GSM2666816-GSM2666818</td>
<td>3 no 4SU 0 h replicates</td>
<td>Multimapper recovery correlation with RNAseq</td>
<td>3a</td>
</tr>
<tr>
<td>GSM2666816-GSM2666821</td>
<td>0 h no 4SU samples and 0 h chase samples with 3 replicates each</td>
<td>Evaluation of SNP calling and masking</td>
<td>4a, c-d</td>
</tr>
<tr>
<td>GSM2666816-GSM2666821, GSM2666828-GSM2666837</td>
<td>0 h no 4SU, 0, 3, 6, 12 and 24 h chase samples (3 replicates each)</td>
<td>Evaluation of QC diagnostics</td>
<td>6</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par24">Figure
<xref rid="Fig2" ref-type="fig">2</xref>
c shows the expected positive correlation (Spearman’s rho: 0.565,
<italic>p</italic>
-value: 0.004) between the fraction of mapped reads and the time-points if a conversion unaware mapper is used (NextGenMap with default values). Next, we repeated the analysis using SLAM-DUNK. Despite the varying number of labeled reads in these datasets, we observed a constant fraction of 60–70% mapped reads across all samples (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
c) and did not observe a significant correlation between the time point and the number of mapped reads (Spearman’s rho: 0.105,
<italic>p</italic>
-value: 0.625). Thus, DUNK maps reads independent of the nucleotide-conversion rate also in experimentally generated data.</p>
</sec>
<sec id="Sec6">
<title>Multi-mapper recovery increases number of genes accessible for 3’ end sequencing analysis</title>
<p id="Par25">Genomic low-complexity regions and repeats pose major challenges for read aligners and are one of the main sources of error in sequencing data analysis. Therefore, multi-mapping reads are often discarded to reduce misleading signals originating from mismapped reads: As most transcripts are long enough to span sufficiently long unique regions of the genome, the overall effect of discarding all multi-mapping reads on expression analysis is tolerable (mean mouse (GRCm38) RefSeq transcript length: 4195 bp). By only sequencing the ~ 250 nucleotides at the 3′ end of a transcript, 3′ end sequencing increases throughput and avoids normalizations accounting for varying gene length. As a consequence, 3′ end sequencing typically only covers 3′ UTR regions which are generally of less complexity than the coding sequence of transcripts [
<xref ref-type="bibr" rid="CR9">9</xref>
] (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2a). Therefore, 3′ end sequencing produces a high percentage (up to 25% in 50 bp mESC samples) of multi-mapping reads. Excluding these reads can result in a massive loss of signal. The core pluripotency factor
<italic>Oct4</italic>
is an example [
<xref ref-type="bibr" rid="CR10">10</xref>
]: Although Oct4 is highly expressed in mESCs, it showed almost no mapped reads in the 3′ end sequencing mESC samples when discarding multi-mapping reads (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3a). The high fraction of multi-mapping reads is due to a sub-sequence of length 340 bp occurring in the
<italic>Oct4</italic>
3′ UTR and an intronic region of
<italic>Rfwd2.</italic>
</p>
<p id="Par26">To assess the influence of low complexity of 3′ UTRs on the read count in 3′ end sequencing, we computed the mappability scores [
<xref ref-type="bibr" rid="CR11">11</xref>
] for each 3′ UTR. A high mappability score (ranging from 0.0 to 1.0) of a
<italic>k</italic>
-mer in a 3′ UTR indicates uniqueness of that k-mer. Next, we computed for each 3′ UTR the %-uniqueness, that is the percentage of its sequence with mappability score of 1. The 3′ UTRs were subsequently categorized in 5% bins according to their %-uniqueness. For each bin we then compared read counts of corresponding 3′ intervals (3 x 4SU 0 h samples, see Table
<xref rid="Tab3" ref-type="table">3</xref>
) with the read counts of their corresponding gene from a RNA-Seq dataset [
<xref ref-type="bibr" rid="CR4">4</xref>
]. Figure
<xref rid="Fig3" ref-type="fig">3</xref>
a shows the increase in correlation as the %-uniqueness increases. If multi-mappers are included the correlation is stronger compared to counting only unique-mappers. Thus, the recovery strategy of multi-mappers as described above efficiently and correctly recovers reads in low complexity regions such as 3′ UTRs. Notably, the overall correlation was consistently above 0.7 for all 3′ intervals with more than 10% of unique sequence.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Multimapper recovery strategy in low complexity regions:
<bold>a</bold>
Correlation of mESC -4SU SLAMseq vs mESC RNA-seq samples (3 replicates each) for unique-mapping reads vs multi-mapping recovery strategy. Spearman mean correlation of all vs all samples is shown for genes with RNAseq tpm > 0 on the y-axis for increasing cutoffs for percentage of unique bp in the corresponding 3′ UTR. Error bars are indicated in black.
<bold>b</bold>
Percentages of reads mapped to correct (left panel) or wrong (right panel) 3′ interval for nucleotide-conversions rates of 0, 2.4 and 7% and 50, 100 and 150 bp read length respectively, when recovering multimappers or using uniquely mapping reads only
<bold>c</bold>
Scatterplot of unique vs multi-mapping read counts (log2) of ~ 20,000 3′ intervals colored by relative error cutoff of 5% for genes with > 0 unique and multi-mapping read counts</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p id="Par27">To further evaluate the performance of the multi-mapper recovery approach, we resorted to simulated SLAMseq datasets: We quantified the percentages of reads mapped to their correct 3′ interval (as known from the simulation) and the number of reads mapped to a wrong 3′ interval, again using nucleotide-conversion rates of 0.0, 2.4 and 7.0% and read lengths of 50, 100 and 150 bp (see Table
<xref rid="Tab2" ref-type="table">2</xref>
): The multi-mapper recovery approach increases the number of correctly mapped reads between 1 and 7%, with only a minor increase of < 0.03% incorrectly mapped reads (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
b).</p>
<p id="Par28">Next, we analysed experimentally generated 3′ end sequencing data (see Table
<xref rid="Tab3" ref-type="table">3</xref>
) in the nucleotide-conversion free mESC sample. For each 3′ interval, we compared read counts with and without multi-mapper recovery (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
c). When including multimappers, 82% of the 19,592 3′ intervals changed the number of mapped reads by less than 5%. However, for many of the 18% remaining 3′ intervals the number of mapped reads was highly increased with the multi-mapper-assignment strategy. We found that these intervals show a significantly lower associated 3′ UTR mappability score, confirming that our multi-mapper assignment strategy specifically targets intervals with low mappability (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2b,c).</p>
<p id="Par29">Figure
<xref rid="Fig3" ref-type="fig">3</xref>
c also shows the significant increase of the Oct4 read counts when multi-mappers are included (3 x no 4SU samples, mean unique mapper CPM 2.9 vs mean multimapper CPM 1841.1, mean RNA-seq TPM 1673.1, Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
, Figure S3b) and scores in the top 0.2% of the read count distribution. Simulation confirmed that these are indeed reads originating from the Oct4 locus: without multi-mapper assignment only 3% of simulated reads were correctly mapped to
<italic>Oct4</italic>
, while all reads were correctly mapped when applying multi-mapper recovery.</p>
</sec>
<sec id="Sec7">
<title>Masking single nucleotide polymorphisms improves nucleotide-conversion quantification</title>
<p id="Par30">Genuine SNPs influence nucleotide-conversion quantification as reads covering a T > C SNP are mis-interpreted as nucleotide-conversion containing reads. Therefore, DUNK performs SNP calling on the mapped reads to identify genuine SNPs and mask their respective positions in the genome. DUNK considers every position in the genome a genuine SNP position if the fraction of reads carrying an alternative base among all reads exceeds a certain threshold (hereafter called variant fraction).</p>
<p id="Par31">To identify an optimal threshold, we benchmarked variant fractions ranging from 0 to 1 in increments of 0.1 in three nucleotide-conversion-free mESC QuantSeq datasets (see Table
<xref rid="Tab3" ref-type="table">3</xref>
). As a ground truth for the benchmark we used a genuine SNP dataset that was generated by genome sequencing of the same cell line. We found that for variant fractions between 0 and 0.8 DUNK’s SNP calling identifies between 93 and 97% of the SNPs that are present in the truth set (sensitivity) (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
a, −4SU). Note that the mESCs used in this study were derived from haploid mESCs [
<xref ref-type="bibr" rid="CR12">12</xref>
]. Therefore, SNPs are expected to be fully penetrant across the reads at the respective genomic position. For variant fractions higher than 0.8, sensitivity quickly drops below 85% consistently for all samples. In contrast, the number of identified SNPs that are not present in the truth set (false positive rate) for all samples rapidly decreases for increasing variant fractions and starts to level out around 0.8 for most samples. To assess the influence of nucleotide-conversion on SNP calling, we repeated the experiment with three mESC samples containing high numbers of nucleotide-conversions (24 h of 4SU treatment). While we did not observe a striking difference in sensitivity between unlabeled and highly labeled replicates, the false-positive rates were larger for low variant fractions suggesting that nucleotide-conversions might be misinterpreted as SNPs when using a low variant fraction threshold. Judging from the ROC curves we found a variant fraction of 0.8 to be a good tradeoff between sensitivity and false positive rate with an average of 94.2% sensitivity and a mean false-positive rate of 16.8%.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Single-Nucleotide Polymorphism masking:
<bold>a</bold>
ROC curves for three unlabeled mESC replicates (−4SU) vs three labeled replicates (+4SU) across variant fractions from 0 to 1 in steps of 0.1.
<bold>b</bold>
Log10 relative errors of simulated T > C vs recovered T > C conversions for naïve (red) and SNP-masked (blue) datasets for nucleotide-conversion rates of 2.4 and 7%.
<bold>c</bold>
Barcodeplot of 3′ intervals ranked by their T > C read count including SNP induced T > C conversions. Black bars indicate 3′ intervals containing genuine SNPs.
<bold>d</bold>
Barcodeplot of 3′ intervals ranked by their T > C read count ignoring SNP masked T > C conversions</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p id="Par32">To demonstrate the impact of masking SNPs before quantifying nucleotide-conversions, we simulated SLAMseq data (Table
<xref rid="Tab2" ref-type="table">2</xref>
): For each 3′ interval, we computed the difference between the number of simulated and detected nucleotide-conversions and normalized it by the number of simulated conversion (relative errors) – once with and once without SNP masking (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
b). The relative error when applying SNP-masking was significantly reduced compared to datasets without SNP masking: With a 2.4% conversion rate, the median relative error dropped from 53 to 0.07% and for a conversion rate of 7% from 17 to 0.002%.</p>
<p id="Par33">To investigate the effect of SNP masking in real data, we correlated the number of identified nucleotide conversions and the number of genuine T > C SNPs in 3′ intervals. To this end, we ranked all 3′ intervals from the three labeled mESC samples (24 h 4SU labeling) by their number of T > C containing reads and inspected the distribution of 3′ intervals that contain a genuine T > C SNP within that ranking (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
c and d, one replicate shown). In all three replicates, we observed a strong enrichment (
<italic>p</italic>
-values < 0.01, 0.02 and 0.06) of SNPs in 3′ intervals with higher numbers of T > C reads (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
c, one replicate shown). Since T > C SNPs are not assumed to be associated with T > C conversions we expect them to be evenly distributed across all 3′ intervals if properly separated from nucleotide conversions. Indeed, applying SNP-masking rendered enrichment of SNP in 3′ intervals with higher numbers of T > C containing reads not significant (
<italic>p</italic>
-values 0.56, 0.6 and 0.92) in all replicates (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
d, one replicate shown).</p>
</sec>
<sec id="Sec8">
<title>SLAM-DUNK: quantifying nucleotide conversions in SLAMseq datasets</title>
<p id="Par34">The main readout of a SLAMseq experiment is the number of 4SU-labeled transcripts, hereafter called labeled transcripts for a given gene in a given sample. However, labeled transcripts cannot be observed directly, but only by counting the number of reads showing converted nucleotides. To this end, SLAM-DUNK provides exact quantifications of T > C read counts for all 3′ intervals in a sample. To validate SLAM-DUNK’s ability to detect T > C reads, we applied SLAM-DUNK to simulated mESC datasets (for details see Table
<xref rid="Tab2" ref-type="table">2</xref>
) and quantified the percentage of correctly identified T > C reads i.e. the fraction stemming from a labeled transcript (sensitivity). Moreover, we computed the percentage of reads stemming from unlabeled transcripts (specificity). For a perfect simulation, where all reads that originated from labeled transcripts contained a T > C conversion, SLAM-DUNK showed a sensitivity > 95% and a specificity of > 99% independent of read length and conversion rate (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4). However, in real datasets not all reads that stem from a labeled transcript contain T > C conversions. To showcase the effect of read length and conversion rate on the ability of SLAMseq to detect the presence of labeled transcripts, we performed a more realistic simulation where the number of T > C conversions per read follows a binomial distribution (allowing for 0 T > C conversions per read).</p>
<p id="Par35">As expected, specificity was unaffected by this change (Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
a). However, sensitivity changed drastically depending on the read length and T > C conversion rate. While we observed a sensitivity of 94% for 150 bp reads and a conversion rate of 7%, with a read length of 50 bp and 2.4% conversion rate it drops to 23%. Based on these findings we next computed the probability of detecting at least one T > C read for a 3′ interval given the fraction of labeled and unlabeled transcripts for that gene (labeled transcript fraction) for different sequencing depths, read lengths and conversion rates (see
<xref rid="Sec11" ref-type="sec">Methods</xref>
) (Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
b, Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5). Counterintuitively, shorter read lengths are superior to longer read lengths for detecting at least one read originating from a labeled transcript, especially for low fractions of labeled transcripts. While 26 X coverage is required for 150 bp reads to detect a read from a labeled transcript present at a fraction of 0.1 and a conversion rate of 2.4%, only 22 X coverage is required for 50 bp reads (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1). This suggests that the higher number of short reads contributes more to the probability of detecting reads from a labeled transcript than the higher probability for observing a T > C conversion of longer reads. Increasing the conversion-rate to 7% reduces the required coverage by ~ 50% across fractions of labeled transcripts, again with 50 bp read lengths profiting most from the increase. In general, for higher labeled transcript fractions such as 1.0 the detection probability converges for all read lengths to a coverage of 2–3 X and 1 X for conversion rates of 2.4 and 7%, respectively (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5). Although, these results are a best-case approximation they can serve as guideline on how much coverage is required when designing a SLAMseq experiment that relies on T > C read counts to detect labeled transcripts.
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Quantification of nucleotide-conversions:
<bold>a</bold>
Sensitivity and specificity of SLAM-DUNK on simulated labeled reads vs recovered T > C containing reads for read lengths of 50, 100 and 150 bp and nucleotide-conversion rates of 2.4 and 7%.
<bold>b</bold>
Heatmap of the probability of detecting at least one read originating from a labeled transcript from a given fraction of labeled transcripts and coverage for a conversion rate of 2.4% and a read length of 50 bp. White color code marks the 0.95 probability boundary.
<bold>c</bold>
Distribution of relative errors of read-based and SLAM-DUNK’s T-content normalized based
<italic>fraction of labeled transcript</italic>
estimates for 18 genes with various T-content for 1000 simulated replicates each.
<bold>d</bold>
Distribution of relative errors of SLAM-DUNK’s T-content normalized
<italic>fraction of labeled transcript</italic>
estimates for 1000 genes with T > C conversion rates of 2.4 and 7% and sequencing depth from 25 to 200x</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
<p id="Par36">While estimating the number of labeled transcripts from T > C read counts is sufficient for experiments comparing the same genes in different conditions and performing differential gene expression-like analyses, it does not account for different abundancies of total transcripts when comparing different genes. To address this problem, the number of labeled transcripts for a specific gene must be normalized by the total number of transcripts present for that gene. We will call this the
<italic>fraction of labeled transcripts</italic>
. A straight forward approach to estimate the
<italic>fraction of labeled transcripts</italic>
is to compare the number of labeled reads to the total number of sequenced reads for a given gene (see
<xref rid="Sec11" ref-type="sec">Methods</xref>
). However, this approach does not account for the number of Uridines in the 3′ interval. Reads originating from U-rich transcript or a T-rich part of the corresponding genomic 3′ interval have a higher probability of showing a T > C conversion. Therefore, T > C read counts are influenced by the base composition of the transcript and the coverage pattern. Thus, the
<italic>fraction of labeled transcripts</italic>
will be overestimated for T-rich and underestimated for T-poor 3′ intervals. To normalize for the base composition, SLAM-DUNK implements a T-content and read coverage normalized approach for estimating the
<italic>fractions of labeled transcripts</italic>
(see
<xref rid="Sec11" ref-type="sec">Methods</xref>
). To evaluate both approaches we picked 18 example genes with varying T content in their 3′ intervals, 3′ interval length and mappability (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S2 for full list), simulated 1000 SLAMseq datasets (see Table
<xref rid="Tab2" ref-type="table">2</xref>
) for each gene and compared the recovered
<italic>fraction of labeled transcripts</italic>
with the simulated truth (Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
c). On average the read-count based method showed a mean relative error of 15%. In contrast, SLAM-DUNK’s T-content normalized approach showed a mean relative error of only ~ 2%. Inspection of the 18 genes revealed high variability in the estimates of the read-count based method. While both methods perform equally well for
<italic>Tep1</italic>
, the median error of the other 17 genes varies between 6 and 39% for the read-based method and only between 1 and 4% for SLAM-DUNK. We observed a strong correlation of relative error and T-content using the read-count based method (Pearson’s r: 0.41) and only a very weak association when using SLAM-DUNK’s T-content normalized approach (Pearson’s r: − 0.04). Expanding the analysis from 18 to 1000 genes confirmed the result. For the T > C read-based approach, 23% of the 3′ intervals showed a relative error larger 20%. For SLAM-DUNK’s T-content normalized approach it was only 8%.</p>
<p id="Par37">Important factors for how confidently we can assess the
<italic>fraction of labeled transcripts</italic>
of a given gene are the T > C conversion rate, read length and sequencing depth. To assess how much SLAMseq read coverage is required for a given read length, we computed the relative error in
<italic>fraction of labeled transcripts</italic>
using SLAM-DUNK’s T content normalized approach estimation for datasets with a conversion rate of 2.4 and 7%, read lengths of 50, 100 and 150 bp and sequencing depth of 25 to 200 (Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
d). First, we looked at datasets with a T > C conversion rate of 2.4%. With a read length of 50 bp, SLAM-DUNK underestimated
<italic>the fractions of labeled transcripts</italic>
by about 10%. This is caused by multi-mapping reads that cannot be assigned to a single 3′ interval. Increasing the read length to 100 or 150 bp allows SLAM-DUNK to assign more reads uniquely to the genome. Therefore, the median relative error is reduced to 3% for these datasets. Sequencing depth showed no influence on the median relative error. However, it influences the variance of the estimates. With a read length of 100 bp and a coverage of 50X, 18% of the 3′ intervals show a relative error of > 20%. Increasing the coverage to 100X or 150X, reduces this number to 6 and 0.8%, respectively.</p>
<p id="Par38">Increasing the T > C conversion rate to 7% improved overall
<italic>fraction of labeled transcripts</italic>
estimations noticeably. For 100 bp reads and a coverage of 50X, 100X and 200X the percentage of 3′ intervals with relative error > 20% is reduced to 3, 0.2 and 0%, respectively. Independent of read length, coverage and T > C conversion rate, the T > C read based
<italic>fraction of labeled transcripts</italic>
estimates performed worse than the SLAM-DUNK estimates (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6).</p>
<p id="Par39">Both
<italic>fraction of labeled transcripts</italic>
estimates as well as raw T > C read counts are affected by sequencing error, especially when the T > C conversion rate is low. To mitigate the impact of sequencing error on the respective quantification measures, SLAM-DUNK optionally applies a base-quality filter on conversion calls. As shown in Fig.
<xref rid="Fig6" ref-type="fig">6</xref>
c, this strategy substantially reduces the signal from erroneous sequencing cycles. In addition, SLAM-DUNK allows the quantifications of
<italic>fraction of labeled transcripts</italic>
estimates as well as raw T > C read counts to be restricted to reads that carry > 1 nucleotide-conversions. Muhar et al. [
<xref ref-type="bibr" rid="CR5">5</xref>
] showed that using this strategy, the contribution of background signal from reads with 1 T > C conversion was almost completely eradicated when using reads with 2 T > C conversions. Alternatively, the background signal of no 4SU could be subtracted to address sequencing error as performed by Herzog et al. [
<xref ref-type="bibr" rid="CR4">4</xref>
].
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Integrated quality controls:
<bold>a</bold>
Nucleotide-conversion rates of read sets from 6 representative mESC time courses showing a decrease of T > C conversions proportional to their respective chase time.
<bold>b</bold>
T > C conversion containing read based PCA of 6 mESC timepoints (3 replicates each).
<bold>c</bold>
Distribution of non-T > C mismatches across read positions display spikes in error rates (highlighted in yellow) for a low T > C conversion content (no 4SU) and a high T > C conversion (12 h chase) sample which are dampened or eradicated when applying base-quality filtering.
<bold>d</bold>
Nucleotide-conversion distribution along 3′ end positions in a static 250 bp at 3′ UTR ends for mESC time-course showing characteristic curve shifts according to their presumed T > C conversion content (timepoint) and a strong base conversion bias towards the 3′ end (highlighted in yellow) induced by generally reduced T-content in the last bases of 3′ UTRs</p>
</caption>
<graphic xlink:href="12859_2019_2849_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
</sec>
<sec id="Sec9">
<title>Quality control and interpretation of SLAMseq datasets</title>
<p id="Par40">To facilitate SLAMseq sample interpretation, we implemented several QC modules into SLAM-DUNK on a per-sample basis. To address the need for interpretation of samples in an experimental context, we provide MultiQC support [
<xref ref-type="bibr" rid="CR13">13</xref>
] for SLAM-DUNK. SLAM-DUNK’s MultiQC module allows inspection of conversion rates, identification of systematic biases and summary statistics across samples.</p>
<p id="Par41">To demonstrate SLAM-DUNK’s QA capabilities, we applied it to 6 representative mESC timecourse datasets with expected increasing nucleotide-conversion content (see Table
<xref rid="Tab3" ref-type="table">3</xref>
). First, we compared the overall nucleotide conversions rates of all timepoints and observed the expected decrease of T > C nucleotide-conversions in later time-points (Fig.
<xref rid="Fig6" ref-type="fig">6</xref>
a, one replicate shown). Next, we performed a PCA based on T > C conversion containing reads using all three replicates. We found that replicates cluster together as expected. Furthermore, 24 h chase and no 4SU samples formed one larger cluster. This can be explained since at 24 h of chase, samples are expected to be T > C conversion free (Fig.
<xref rid="Fig6" ref-type="fig">6</xref>
b).</p>
<p id="Par42">By inspecting mismatch rates along read positions for two representative samples, we could identify read cycles with increased error rates (Fig.
<xref rid="Fig6" ref-type="fig">6</xref>
c). To reduce read cycle dependent nucleotide mismatch noise, we implemented a base-quality cutoff for T > C conversion calling in SLAM-DUNK. Applying the base-quality cutoffs significantly increased overall data quality, mitigating or even eradicating error-prone read positions. Finally, we visualized average T > C conversion rates across the last 250 nucleotides of each transcript to inspect positional T > C conversion biases across the 3′ intervals. We found no conversion bias across the static 250 bp windows except for a dip in T > C conversions ~ 20 nucleotides upstream of the 3′ end, which is most likely caused by lower genomic T-content, a characteristic feature of mRNA 3′ end sequences (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S7).</p>
</sec>
</sec>
<sec id="Sec10">
<title>Discussion and conclusions</title>
<p id="Par43">We present Digital Unmasking of Nucleotide-conversions in
<italic>k</italic>
-mers (DUNK) for mapping and quantifying nucleotide conversions in reads stemming from nucleotide-conversion based sequencing protocols. As a showcase application of DUNK, we applied it to T > C nucleotide-conversion containing datasets as produced by the novel SLAMseq protocol. Using real and simulated datasets, we establish DUNK as method that allows nucleotide-conversion rate independent read mapping for nucleotide-conversion rates of up to 15% when analyzing 100 bp reads. Since the most informative proportion of the overall signal stems from nucleotide-conversion containing reads, the correct mapping of such reads is crucial and resulted in a reduction of the nucleotide-conversion quantification error by ~ 50%. DUNK tackles the problem of low-complexity and repetitive sequence content which is severely aggravated in 3′ end sequencing by employing a multi-mapper recovery strategy: We demonstrate that DUNK specifically recovers read mapping signal in 3′ intervals of low mappability that would otherwise be inaccessible to 3′ end sequencing approaches, enhancing the correlation with complementary RNAseq data. Globally, we recover an additional 1–7% of correctly mapped reads at a negligible cost of wrongly mapped reads. We used genome-sequencing datasets to establish optimized variant calling settings, a crucial step of DUNK to separate true- from false-positive nucleotide-conversion stemming from SNPs. Applying these established settings, we demonstrated the advantage of SNP masking over naïve nucleotide-conversion quantification which uncouples SNPs from nucleotide-conversion content and results in a more accurate nucleotide-conversion quantification, reducing the median quantification error for SNP harboring 3′ intervals from 53 to 0.07% and 17 to 0.002% for conversion rates of 2.4 and 7%, respectively.</p>
<p id="Par44">We provide the SLAM-DUNK package, an application of DUNK to SLAMseq datasets: SLAM-DUNK provides absolute read-counts of T > C conversion containing reads which can directly be used for comparing the same transcripts in different conditions and to perform differential gene expression-like analyses with a sensitivity of 95%. Since absolute-read counts between genes are dependent on T-content and sequencing depth, SLAM-DUNK implements a T-content and read coverage normalized approach for estimating the
<italic>fractions of labeled transcripts</italic>
. This quantification routine has clear advantages over read-count based
<italic>fraction of labeled transcripts</italic>
estimates, reducing the proportion of genes with relative errors > 20% from 23% with the read-based approach to only 8% with SLAM-DUNK’s T-content normalized approach. In addition to absolute T > C read counts and
<italic>fractions of labeled transcript</italic>
estimates, SLAM-DUNKs modular design also allows to plug-in statistical frameworks such as GRAND-SLAM [
<xref ref-type="bibr" rid="CR14">14</xref>
] which utilizes a binomial mixture-model for estimating proportions of new and old RNA to directly estimate RNA half-lives.</p>
<p id="Par45">While SLAM-DUNK is a showcase application of DUNK to unpaired, stranded and unspliced (QuantSeq) data, NextGenMap – our mapper of choice – is a general-purpose alignment tool that also facilitates paired-end, unstranded datasets. Therefore, NextGenMap can be readily parametrized to process datasets produced by novel applications such as NASC-seq (
<ext-link ext-link-type="uri" xlink:href="https://www.biorxiv.org/content/early/2018/12/17/498667">https://www.biorxiv.org/content/early/2018/12/17/498667</ext-link>
) and scSLAM-seq (
<ext-link ext-link-type="uri" xlink:href="https://www.biorxiv.org/content/10.1101/486852v1">https://www.biorxiv.org/content/10.1101/486852v1</ext-link>
) with the downstream SLAM-DUNK pipeline. Due to SLAM-DUNK’s modular design, one can also entirely swap the alignment tool to other standard RNA-seq aligners as long as they output the BAM-tags we introduced for speedy conversion detection (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary information).</p>
<p id="Par46">SLAM-DUNK’s simulation framework allows assessing error rates for given parameters such as read length, conversion rate and coverage in silico prior to setting up experiments in vitro. This allows bench scientists to inspect simulation results to check whether they are able to reliably interpret nucleotide-conversion readouts for given genes using a certain experimental setup and annotation.</p>
<p id="Par47">Ensuring scalability as well as feasible resource consumptions is vital for processing large multisample experiments such as multi-replicate time courses. SLAM-DUNK achieves this with its modular design and efficient implementation enabling a 21-sample time course experiment to run in under 8 h hours with 10 CPU threads on a desktop machine with a peak memory consumption of 10 GB main memory (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary information).</p>
<p id="Par48">We demonstrated that SLAM-DUNK visualizations and sample-aggregation via MultiQC are valuable tools to unravel biases in and characteristics of SLAMseq datasets thus facilitating rapid and easy quality checks of samples and providing measures to correct for systemic biases in the data.</p>
<p id="Par49">Deployment of SLAM-DUNK on multiple software platforms and through Docker images ensures low effort installation on heterogeneous computing environments. Verbose and comprehensive output of SLAM-DUNK makes results reproducible, transparent and immediately available to bench-scientists and downstream analysis tools.</p>
</sec>
<sec id="Sec11">
<title>Methods</title>
<sec id="Sec12">
<title>Mapping reads with T > C conversions</title>
<p id="Par50">NextGenMap [
<xref ref-type="bibr" rid="CR7">7</xref>
] maps adapter- and poly(A)-trimmed SLAMseq reads to the user specified reference genome. Briefly, NextGenMap searches for seed words - 13-mers that match between a given read and the reference sequence - using an index data structure. All regions of the reference sequence that exceed a certain seed-word count threshold are candidate mapping regions (CMR) for the read. Subsequently, NextGenMap identifies the CMR with the highest pairwise Smith-Waterman alignment score as the best mapping position of the read in the genome. If a read has more than one CMR NextGenMap reports up to 100 locations in the genome. Reads with more than 100 mapping locations are discarded.</p>
<p id="Par51">We extended NextGenMap’s seed-word identification step to allow for a single T > C mismatch in a seed word. Finally, we changed the scoring function of the pairwise sequence alignment algorithms to assign neither a mismatch penalty nor a match score to T > C mismatches:</p>
<p id="Par52">Furthermore, we extended NextGenMap to output additional SAM tags containing all necessary information (e.g. list of all mismatches in a read) required by subsequent steps of SLAM-DUNK (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary information for details).</p>
</sec>
<sec id="Sec13">
<title>Filtering reads and multi-mapper assignment</title>
<p id="Par53">Only read alignments with a sequence identity of at least 95% and a minimum of 50% of the read bases aligned were kept for the subsequent analysis. Since 3′ end sequencing should generate fragments at 3′ end mRNAs, we discard all read mappings located outside of a user-defined set of 3′ UTR intervals. Still, remaining multi-mapper reads are processed as follows: For a read that maps to two or more locations of the same 3′ UTR, one location is randomly picked, the others are removed. All reads that map to two or more distinct 3′ UTRs are entirely discarded (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S8 for details).</p>
</sec>
<sec id="Sec14">
<title>SNP masking</title>
<p id="Par54">SLAM-DUNK uses VarScan 2.4.1 [
<xref ref-type="bibr" rid="CR15">15</xref>
] to call SNPs in the set of filtered reads requiring a minimum coverage of 10x and a minimum alternative allele frequency of 0.8 for all published and sequenced haploid samples. Thus, if VarScan identified a C as a new SNP at a T nucleotide in the genome, this will not be counted as T > C conversion in downstream analysis.</p>
<p id="Par55">For genome-sequencing data, SNPs were called using VarScan 2.4.1 default parameters only outputting homozygous variant positions. Only SNPs that exceed the minimum coverage of the respective Varscan 2.4.1 runs in the benchmarked 3′ intervals are considered for sensitivity and false-positive rate calculations.</p>
<p id="Par56">We used an adaption of the
<italic>barcodeplot</italic>
function of the limma package to visualize the distribution of SNPs along 3′ intervals ordered by their number of T > C reads: To make sure the SNP calls are not coverage biased, we only use the upper quartile of 3′ intervals in terms of read coverage, excluding 3′ intervals not meeting the coverage cutoffs of the variant calling process for this analysis. We produce one plot using unmasked T > C containing reads and a separate plot using SNP-masked T > C containing reads. In addition, we apply the Mann-Whitney-U test on both sets to give a measure of how biased the SNPs are distributed in unmasked vs SNP-masked 3′ intervals. Ideally, a strong association of T > C containing reads with SNPs in the unmasked data and no association of T > C containing reads with SNPs in the masked data is expected, showing the SNP calling actually uncoupled T > C conversions and SNPs. These plots allow to visually assess the quality/performance of SNP calling on the data without presence of actual controls.</p>
</sec>
<sec id="Sec15">
<title>Estimating the fraction of labeled transcripts</title>
<p id="Par57">Let
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
 be the unknown fraction of labeled transcripts for a 3 ′ interval. With 0 ≤ 
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
 ≤ 1 we denote the efficiency that a transcript contains a 4SU, the 4SU residue is alkylated and that alkylated 4SU base-pairs with G instead of U during reverse transcription, that can be identified as a T > C conversion in high-throughput sequencing. Note, we assume
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
is constant for a given experiment.</p>
</sec>
<sec id="Sec16">
<title>T > C read counts based estimator</title>
<p id="Par58">The probability that a read from a labeled transcript does not show a T > C conversion equals
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {p}_0={\left(1-{p}_e\right)}^t $$\end{document}</tex-math>
<mml:math id="M2" display="block">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mi>t</mml:mi>
</mml:msup>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>t</italic>
is the number of thymidines in the read matching genomic sequence. Accordingly, the probability to find a read with at least one T > C conversion from a labeled transcript equals
<disp-formula id="Equb">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {p}_{T\to C}=1-{p}_0 $$\end{document}</tex-math>
<mml:math id="M4" display="block">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par59">In a sample of
<italic>n</italic>
reads given an unknown frequency
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
of labeled transcripts we expect to find
<disp-formula id="Equc">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ E(SU)={p}_{SU}\ast {p}_{T\to C}\ast n $$\end{document}</tex-math>
<mml:math id="M6" display="block">
<mml:mi>E</mml:mi>
<mml:mfenced close=")" open="(">
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
reads with at least one conversion site. Note, we assume that
<italic>t</italic>
is the same for all reads. Based on the observed number of converted reads
<italic>R</italic>
<sub>
<italic>T→C</italic>
</sub>
and the number of reads
<italic>n</italic>
, we get
<disp-formula id="Equd">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \widehat{p_{SU}\ast {P}_{T\to C}}=\frac{R_{T\to C}}{n} $$\end{document}</tex-math>
<mml:math id="M8" display="block">
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>n</mml:mi>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par60">In a time course experiment, we can use a time-point with a labeling time long enough that all transcripts are labeled (
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
=1) for all 3' intervals to retrieve
<italic>P</italic>
<sub>
<italic>T→C</italic>
</sub>
. Since we assume that
<italic>P</italic>
<sub>
<italic>T→C</italic>
</sub>
is constant for a given experiment we obtain
<disp-formula id="Eque">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \widehat{p_{SU}}=\frac{1}{P_{T\to C}}\ast \frac{R_{T\to C}}{n} $$\end{document}</tex-math>
<mml:math id="M10" display="block">
<mml:mover accent="true">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfrac>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>n</mml:mi>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Eque.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
for all other time points.</p>
</sec>
<sec id="Sec17">
<title>T-content and coverage normalized estimator</title>
<p id="Par61">Since assuming
<italic>t</italic>
to be the same for all reads is an oversimplification we want to estimate
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
without using T > C read counts by looking at T-positions individually. For a specific T-position
<italic>i</italic>
in the 3' interval, let
<italic>X</italic>
<sub>
<italic>i</italic>
</sub>
denote the number of reads that show a conversion and let
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
define the number of reads that cover position
<italic>i</italic>
. Then
<disp-formula id="Equf">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \Pr \left({X}_i={k}_i\ \right|\ {p}_{SU},{p}_e\Big)=\left({c}_i,{k}_i\right){\left({p}_{SU}\ast {p}_e\right)}^k{\left(1-{p}_{SU}\ast {p}_e\right)}^{\left({c}_i-{k}_i\right)}, $$\end{document}</tex-math>
<mml:math id="M12" display="block">
<mml:mo>Pr</mml:mo>
<mml:mfenced close="|" open="(">
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mspace width="0.25em"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.25em"></mml:mspace>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced close=")" open="(">
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfenced>
<mml:msup>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:msup>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equf.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par62">If
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
is unknown, we can compute the maximum likelihood estimate of the confounded probability
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
*
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
as
<disp-formula id="Equg">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \widehat{p_{SU}\ast {p}_e}=\frac{k_i}{c_i} $$\end{document}</tex-math>
<mml:math id="M14" display="block">
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equg.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par63">If the interval contains
<italic>n T</italic>
s then the maximum likelihood estimate of
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
*
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
equals
<disp-formula id="Equh">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \widehat{p_{SU}\ast {p}_e}=\frac{\sum_{i=1}^n{k}_i}{\sum_{i=1}^n{c}_i} $$\end{document}</tex-math>
<mml:math id="M16" display="block">
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equh.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par64">By retrieving
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
from an experiment with
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
=1 we obtain
<disp-formula id="Equi">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \widehat{p_{SU}}=\frac{1}{p_e}\ast \frac{\sum_{i=1}^n{k}_i}{\sum_{i=1}^n{c}_i} $$\end{document}</tex-math>
<mml:math id="M18" display="block">
<mml:mover accent="true">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mfrac>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equi.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
<sec id="Sec18">
<title>Computing the probability of detecting at least one T > C read for a 3’ interval</title>
<p id="Par65">Based on the notation developed above, we can compute the expected number of reads showing a T > C conversion given
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
and
<italic>p</italic>
<sub>
<italic>T→C</italic>
</sub>
for a 3' interval with
<italic>n</italic>
sequenced reads as
<disp-formula id="Equj">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ E(SU)={p}_{SU}\ast {p}_{T\to C}\ast n $$\end{document}</tex-math>
<mml:math id="M20" display="block">
<mml:mi>E</mml:mi>
<mml:mfenced close=")" open="(">
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equj.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par66">When taking into account empirically determined sensitivity of SLAMDUNK
<italic>S</italic>
, i.e. the probability of detecting a read with T > C conversion as a labeled read, we compute the probability of detecting at least one labeled read for a 3′ interval given
<italic>p</italic>
<sub>
<italic>SU</italic>
</sub>
and
<italic>p</italic>
<sub>
<italic>T</italic>
 → 
<italic>C</italic>
</sub>
as:
<disp-formula id="Equk">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {p}_{DETECT}=1-B\left(0;E(SU),S\right) $$\end{document}</tex-math>
<mml:math id="M22" display="block">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mtext mathvariant="italic">DETECT</mml:mtext>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>B</mml:mi>
<mml:mfenced close=")" open="(" separators=";,">
<mml:mn>0</mml:mn>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mfenced close=")" open="(">
<mml:mi mathvariant="italic">SU</mml:mi>
</mml:mfenced>
</mml:mrow>
<mml:mi>S</mml:mi>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equk.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
<sec id="Sec19">
<title>Simulating SLAMseq datasets</title>
<p id="Par67">All datasets were simulated using SLAM-DUNK’s simulation module. The input parameters are a reference genome sequence and a BED file containing 3′ interval annotations. First, SLAM-DUNK removes overlapping 3′ intervals in the BED file and assigns a labeled transcript fraction between 0 and 1 (uniformly distributed) to each 3′ interval. Alternatively, the user can either supply a fixed fraction of labeled transcripts that is the same for all genes or add gene specific fraction to the initial BED file. Next, SLAM-DUNK extracts the 3′ interval sequences from the supplied FASTA file and randomly adds homozygous SNPs, including but not limited to T > C SNPs, based on a user specified probability (default: 0.1%). For each of the modified 3′ interval SLAM-DUNK simulates RNA reads using the RNASeqReadSimulator (
<ext-link ext-link-type="uri" xlink:href="https://github.com/davidliwei/RNASeqReadSimulator">https://github.com/davidliwei/RNASeqReadSimulator</ext-link>
) package. To mimic Quant-Seq datasets, only the last 250 bp of the 3′ interval are used for the simulation. Finally, SLAM-DUNK adds T > C conversions to the reads to simulate transcripts labeling. The number of T > C conversions for each labeled read is computed using a binomial distribution
<italic>B(t, p</italic>
<sub>
<italic>e</italic>
</sub>
<italic>)</italic>
with
<italic>t</italic>
the number of Ts in the read and
<italic>p</italic>
<sub>
<italic>e</italic>
</sub>
the conversion probability. All simulated reads are stored in a BAM file. The name of each read contains the name of the 3′ interval the read was simulated from and the number of T > C conversions added. Furthermore, SLAM-DUNK provides a T > C count file containing the number of simulated reads and simulated fraction of labeled transcripts for all 3′ interval.</p>
</sec>
<sec id="Sec20">
<title>Relative error</title>
<p id="Par68">Let
<italic>N</italic>
<sub>
<italic>TRUE</italic>
</sub>
be the number of true events and
<italic>N</italic>
<sub>
<italic>DETECT</italic>
</sub>
the number of detected events. Then the relative error
<italic>E</italic>
<sub>
<italic>rel</italic>
</sub>
of a detected quantity compared to the known truth calculates as follows:
<disp-formula id="Equl">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {E}_{rel}=\frac{N_{TRUE}-{N}_{DETECT}}{N_{TRUE}} $$\end{document}</tex-math>
<mml:math id="M24" display="block">
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi mathvariant="italic">rel</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mtext mathvariant="italic">TRUE</mml:mtext>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mtext mathvariant="italic">DETECT</mml:mtext>
</mml:msub>
</mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mtext mathvariant="italic">TRUE</mml:mtext>
</mml:msub>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2019_2849_Article_Equl.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
<sec id="Sec21">
<title>Mappability assessment</title>
<p id="Par69">We used the GEM library [
<xref ref-type="bibr" rid="CR11">11</xref>
] to calculate 50mer and 100mer mappability tracks for the GRCm38 mouse genome (−e 0). We then used BEDTools [
<xref ref-type="bibr" rid="CR16">16</xref>
] coverage to define mappable regions within the 3′ UTR set published by Herzog et al. [
<xref ref-type="bibr" rid="CR4">4</xref>
] to calculate 3′ UTR mappability fractions 3′ UTRs. The same procedure was applied to RefSeq exons (obtained on May 2, 2016) mapped to Entrez genes for exon mappability fractions (note: these also include 3′ UTRs). Entrez genes were mapped to 3′ UTRs and only 3′ UTRs with a mappability fraction below 90% were analyzed.</p>
</sec>
<sec id="Sec22">
<title>Datasets</title>
<p id="Par70">For in silico validations with simulated datasets, we used the set of mESC 3′ intervals by Herzog et al. [
<xref ref-type="bibr" rid="CR4">4</xref>
] as reference. The datasets simulated during this study are listed in Table
<xref rid="Tab2" ref-type="table">2</xref>
.</p>
<p id="Par71">For validation, we used real SLAMseq data generated by performing 4SU pulse-chase experiments in mESCs (GEO accession: GSE99970) [
<xref ref-type="bibr" rid="CR4">4</xref>
]. The subsets used during this study are listed in Table
<xref rid="Tab3" ref-type="table">3</xref>
.</p>
<p id="Par72">For validation of the multimapper recovery strategy, we used RNA-seq data from the same mESC line (GEO accession: GSE99970, samples: GSM2666840-GSM2666842) [
<xref ref-type="bibr" rid="CR4">4</xref>
].</p>
</sec>
<sec id="Sec23">
<title>Additional information</title>
<p id="Par73">Supplementary information, Figures and Tables referenced in this study are provided as .pdf file DUNK_SI.</p>
</sec>
<sec id="Sec24">
<title>Genomic DNA sequencing of AN3–12 cells</title>
<p id="Par74">AN3–12 mouse embryonic stem cells [
<xref ref-type="bibr" rid="CR12">12</xref>
] were lysed in Lysis buffer (10 mM Tris, pH 7.5; 10 mM EDTA, pH 8; 10 mM NaCl; 0.5% N-Laurosylsarcosine) and incubated at 60 °C overnight. DNA was ethanol precipitated using 0.2 M NaCl and the DNA was resuspended in 1x TE. Isolated gDNA was phenol-chloroform purified followed by ethanol precipitation. 2 μg of purified gDNA was sheared in 130 μl of 1xTE in a microTUBE AFA Fiber Crimp-Cap (6 × 16 mm) using the E220 Focused-ultrasonicator (Covaris®) with the following settings: 140 W peak incident power, 10% Duty Factor, 200 cycles per burst, 80 s treatment time. The sheared DNA was bead-purified using Agencourt® AMPure® XP beads (Beckman Coulter) to select fragments between 250 and 500 nt. DNA library preparation was performed using NEBNext® Ultra™ DNA Library Prep Kit for Illumina® (NEB) and the library was sequenced in the paired-end 50 mode on a HiSeq 2500 instrument (Illumina).</p>
</sec>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec25">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12859_2019_2849_MOESM1_ESM.pdf">
<label>Additional file 1:</label>
<caption>
<p>Supplementary information, figures and tables. (PDF 6280 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>4SU</term>
<def>
<p id="Par4">4-thiouridine</p>
</def>
</def-item>
<def-item>
<term>CPM</term>
<def>
<p id="Par5">Counts-per-million</p>
</def>
</def-item>
<def-item>
<term>DUNK</term>
<def>
<p id="Par6">Digital Unmasking of Nucleotide conversions in K-mers</p>
</def>
</def-item>
<def-item>
<term>mESC</term>
<def>
<p id="Par7">mouse embryonic stem cell</p>
</def>
</def-item>
<def-item>
<term>PCA</term>
<def>
<p id="Par8">Principal component analysis</p>
</def>
</def-item>
<def-item>
<term>SLAMseq</term>
<def>
<p id="Par9">Thiol (SH)-linked alkylation for the metabolic sequencing of RNA</p>
</def>
</def-item>
<def-item>
<term>SNP</term>
<def>
<p id="Par10">Single Nucleotide Polymorphism</p>
</def>
</def-item>
<def-item>
<term>T > C</term>
<def>
<p id="Par11">Thymine-to-cytosine</p>
</def>
</def-item>
<def-item>
<term>TPM</term>
<def>
<p id="Par12">Transcripts-per-million</p>
</def>
</def-item>
<def-item>
<term>UTR</term>
<def>
<p id="Par13">Untranslated region</p>
</def>
</def-item>
</def-list>
</glossary>
<ack>
<title>Acknowledgements</title>
<p>The authors thank Pooja Bhat for software testing and bug discovery, Brian Reichholf for helpful discussions on T > C conversion-aware read mapping, Phil Ewels for helping with the MultiQC plugin implementation as well as Stefanie Detamble for graphics support.</p>
<sec id="FPar1">
<title>Funding</title>
<p id="Par75">This work was supported by the European Research Council to SLA (ERC-StG-338252, ERC-PoC-825710 SLAMseq) and JZ (ERC-StG-336860) and the Austrian Science Fund to SLA (Y-733-B22 START, W-1207-B09, and SFB F43–22), and AvH (W-1207-B09). MM is a recipient of a DOC Fellowship of the Austrian Academy of Sciences. The IMP is generously supported by Boehringer Ingelheim. All funding sources had no role in the design of this study, no role in collection, execution, analyses, interpretation of the data and no role in writing of the manuscript.</p>
</sec>
<sec id="FPar2" sec-type="data-availability">
<title>Availability of data and materials</title>
<p id="Par76">The datasets analysed during the current study are available in the Gene Expression Omnibus (
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo">https://www.ncbi.nlm.nih.gov/geo</ext-link>
) under accession GSE99978.</p>
<p id="Par77">The genome-sequencing dataset used and analysed during the current study is available in SRA under accession SRP154182.</p>
<p id="Par78">SLAM-DUNK is available from Bioconda, as Python package from PyPI, as Docker image from Docker hub and also from source (
<ext-link ext-link-type="uri" xlink:href="http://t-neumann.github.io/slamdunk">http://t-neumann.github.io/slamdunk</ext-link>
) under the GNU AGPL license.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>TN and PR developed the software and conducted the computational experiments. VAH provided biological samples, VAH and MM provided essential input for labeled fraction estimations and discussions. AvH, JZ and SLA provided essential input for method development. TN, PR and AvH wrote the manuscript with input from all authors. All of the authors have read and approved the final manuscript.</p>
</notes>
<notes>
<title>Ethics approval and consent to participate</title>
<p id="Par79">Not applicable.</p>
</notes>
<notes>
<title>Consent for publication</title>
<p id="Par80">Not applicable.</p>
</notes>
<notes notes-type="COI-statement">
<title>Competing interests</title>
<p id="Par81">Stefan Ludwig Ameres, Veronika Anna Herzog, Johannes Zuber and Matthias Muhar are inventors on patent application EU17166629.0–1403 submitted by the IMBA that covers methods for the modification and identification of nucleic acids, which have been licensed to Lexogen GmbH.</p>
</notes>
<notes>
<title>Publisher’s Note</title>
<p id="Par82">Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frommer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>LE</given-names>
</name>
<name>
<surname>Millar</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Collis</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Watt</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Grigg</surname>
<given-names>GW</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>1992</year>
<volume>89</volume>
<fpage>1827</fpage>
<lpage>1831</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.89.5.1827</pub-id>
<pub-id pub-id-type="pmid">1542678</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Landthaler</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Burger</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Khorshid</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hausser</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Berninger</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP</article-title>
<source>Cell</source>
<year>2010</year>
<volume>141</volume>
<fpage>129</fpage>
<lpage>141</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2010.03.009</pub-id>
<pub-id pub-id-type="pmid">20371350</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Yi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Epitranscriptome sequencing technologies: decoding RNA modifications</article-title>
<source>Nat Methods</source>
<year>2016</year>
<volume>14</volume>
<fpage>23</fpage>
<lpage>31</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.4110</pub-id>
<pub-id pub-id-type="pmid">28032622</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Herzog</surname>
<given-names>VA</given-names>
</name>
<name>
<surname>Reichholf</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Neumann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rescheneder</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bhat</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Burkard</surname>
<given-names>TR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Thiol-linked alkylation of RNA to assess expression dynamics</article-title>
<source>Nat Methods</source>
<year>2017</year>
<volume>14</volume>
<fpage>1198</fpage>
<lpage>1204</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.4435</pub-id>
<pub-id pub-id-type="pmid">28945705</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Muhar</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ebert</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Neumann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Umkehrer</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Jude</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wieshofer</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SLAM-seq defines direct gene-regulatory functions of the BRD4-MYC axis</article-title>
<source>Science</source>
<year>2018</year>
<volume>360</volume>
<fpage>800</fpage>
<lpage>805</lpage>
<pub-id pub-id-type="doi">10.1126/science.aao2793</pub-id>
<pub-id pub-id-type="pmid">29622725</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<mixed-citation publication-type="other">Matsushima W, Herzog VA, Neumann T, Gapp K, Zuber J, Ameres SL, et al. SLAM-ITseq: sequencing cell type-specific transcriptomes without cell sorting. Development. Oxford University Press for The Company of Biologists Limited. 2018;145:dev.164640.</mixed-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sedlazeck</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Rescheneder</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Haeseler</surname>
<given-names>v A</given-names>
</name>
</person-group>
<article-title>NextGenMap: fast and accurate read mapping in highly polymorphic genomes</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<fpage>2790</fpage>
<lpage>2791</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt468</pub-id>
<pub-id pub-id-type="pmid">23975764</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<mixed-citation publication-type="other">Moll P, Ante M, Seitz A, Reda T. QuantSeq 3` mRNA sequencing for RNA quantification. Nat Methods. 11. 2014;972 EP.</mixed-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mignone</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Gissi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Liuni</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pesole</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Untranslated regions of mRNAs</article-title>
<source>Genome Biol</source>
<year>2002</year>
<volume>3</volume>
<fpage>REVIEWS0004</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2002-3-3-reviews0004</pub-id>
<pub-id pub-id-type="pmid">11897027</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Young</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>Control of the embryonic stem cell state</article-title>
<source>Cell</source>
<year>2011</year>
<volume>144</volume>
<fpage>940</fpage>
<lpage>954</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2011.01.032</pub-id>
<pub-id pub-id-type="pmid">21414485</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<mixed-citation publication-type="other">Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigó R, et al. Fast computation and applications of genome mappability. PLoS One. 2012;7:e30377.</mixed-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<mixed-citation publication-type="other">Elling U, Wimmer RA, Leibbrandt A, Burkard T, Michlits G, Leopoldi A, et al. A reversible haploid mouse embryonic stem cell biobank resource for functional genomics. Nature. 2017;550:114–8.</mixed-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewels</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Magnusson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lundin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Käller</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>MultiQC: summarize analysis results for multiple tools and samples in a single report</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<fpage>3047</fpage>
<lpage>3048</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw354</pub-id>
<pub-id pub-id-type="pmid">27312411</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jürges</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Dölken</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Erhard</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Dissecting newly transcribed and old RNA using GRAND-SLAM</article-title>
<source>Bioinformatics</source>
<year>2018</year>
<volume>34</volume>
<fpage>i218</fpage>
<lpage>i226</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bty256</pub-id>
<pub-id pub-id-type="pmid">29949974</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koboldt</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Larson</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>D</given-names>
</name>
<name>
<surname>McLellan</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing</article-title>
<source>Genome Res</source>
<year>2012</year>
<volume>22</volume>
<fpage>568</fpage>
<lpage>576</lpage>
<pub-id pub-id-type="doi">10.1101/gr.129684.111</pub-id>
<pub-id pub-id-type="pmid">22300766</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<mixed-citation publication-type="other">Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinform. 2014;27:11.12.1–11.12.34 Hoboken, NJ, USA: John Wiley & Sons, Inc.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000282 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000282 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:6528199
   |texte=   Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:31109287" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021