Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000227 ( Pmc/Corpus ); précédent : 0002269; suivant : 0002280 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data</title>
<author>
<name sortKey="Dozmorov, Mikhail G" sort="Dozmorov, Mikhail G" uniqKey="Dozmorov M" first="Mikhail G" last="Dozmorov">Mikhail G. Dozmorov</name>
<affiliation>
<nlm:aff id="I1">Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Adrianto, Indra" sort="Adrianto, Indra" uniqKey="Adrianto I" first="Indra" last="Adrianto">Indra Adrianto</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Giles, Cory B" sort="Giles, Cory B" uniqKey="Giles C" first="Cory B" last="Giles">Cory B. Giles</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Glass, Edmund" sort="Glass, Edmund" uniqKey="Glass E" first="Edmund" last="Glass">Edmund Glass</name>
<affiliation>
<nlm:aff id="I1">Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Glenn, Stuart B" sort="Glenn, Stuart B" uniqKey="Glenn S" first="Stuart B" last="Glenn">Stuart B. Glenn</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Montgomery, Courtney" sort="Montgomery, Courtney" uniqKey="Montgomery C" first="Courtney" last="Montgomery">Courtney Montgomery</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sivils, Kathy L" sort="Sivils, Kathy L" uniqKey="Sivils K" first="Kathy L" last="Sivils">Kathy L. Sivils</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Olson, Lorin E" sort="Olson, Lorin E" uniqKey="Olson L" first="Lorin E" last="Olson">Lorin E. Olson</name>
<affiliation>
<nlm:aff id="I3">Immunobiology and Cancer Research Program, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Iwayama, Tomoaki" sort="Iwayama, Tomoaki" uniqKey="Iwayama T" first="Tomoaki" last="Iwayama">Tomoaki Iwayama</name>
<affiliation>
<nlm:aff id="I3">Immunobiology and Cancer Research Program, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Freeman, Willard M" sort="Freeman, Willard M" uniqKey="Freeman W" first="Willard M" last="Freeman">Willard M. Freeman</name>
<affiliation>
<nlm:aff id="I4">Reynolds Oklahoma Center on Aging, Donald W. Reynolds Department of Geriatric Medicine, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I5">Department of Physiology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lessard, Christopher J" sort="Lessard, Christopher J" uniqKey="Lessard C" first="Christopher J" last="Lessard">Christopher J. Lessard</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wren, Jonathan D" sort="Wren, Jonathan D" uniqKey="Wren J" first="Jonathan D" last="Wren">Jonathan D. Wren</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I6">Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26423047</idno>
<idno type="pmc">4597324</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597324</idno>
<idno type="RBID">PMC:4597324</idno>
<idno type="doi">10.1186/1471-2105-16-S13-S10</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000227</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data</title>
<author>
<name sortKey="Dozmorov, Mikhail G" sort="Dozmorov, Mikhail G" uniqKey="Dozmorov M" first="Mikhail G" last="Dozmorov">Mikhail G. Dozmorov</name>
<affiliation>
<nlm:aff id="I1">Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Adrianto, Indra" sort="Adrianto, Indra" uniqKey="Adrianto I" first="Indra" last="Adrianto">Indra Adrianto</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Giles, Cory B" sort="Giles, Cory B" uniqKey="Giles C" first="Cory B" last="Giles">Cory B. Giles</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Glass, Edmund" sort="Glass, Edmund" uniqKey="Glass E" first="Edmund" last="Glass">Edmund Glass</name>
<affiliation>
<nlm:aff id="I1">Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Glenn, Stuart B" sort="Glenn, Stuart B" uniqKey="Glenn S" first="Stuart B" last="Glenn">Stuart B. Glenn</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Montgomery, Courtney" sort="Montgomery, Courtney" uniqKey="Montgomery C" first="Courtney" last="Montgomery">Courtney Montgomery</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sivils, Kathy L" sort="Sivils, Kathy L" uniqKey="Sivils K" first="Kathy L" last="Sivils">Kathy L. Sivils</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Olson, Lorin E" sort="Olson, Lorin E" uniqKey="Olson L" first="Lorin E" last="Olson">Lorin E. Olson</name>
<affiliation>
<nlm:aff id="I3">Immunobiology and Cancer Research Program, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Iwayama, Tomoaki" sort="Iwayama, Tomoaki" uniqKey="Iwayama T" first="Tomoaki" last="Iwayama">Tomoaki Iwayama</name>
<affiliation>
<nlm:aff id="I3">Immunobiology and Cancer Research Program, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Freeman, Willard M" sort="Freeman, Willard M" uniqKey="Freeman W" first="Willard M" last="Freeman">Willard M. Freeman</name>
<affiliation>
<nlm:aff id="I4">Reynolds Oklahoma Center on Aging, Donald W. Reynolds Department of Geriatric Medicine, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I5">Department of Physiology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lessard, Christopher J" sort="Lessard, Christopher J" uniqKey="Lessard C" first="Christopher J" last="Lessard">Christopher J. Lessard</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wren, Jonathan D" sort="Wren, Jonathan D" uniqKey="Wren J" first="Jonathan D" last="Wren">Jonathan D. Wren</name>
<affiliation>
<nlm:aff id="I2">Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I6">Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.</p>
</sec>
<sec>
<title>Methods</title>
<p>We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/mdozmorov/RepeatSoaker">https://github.com/mdozmorov/RepeatSoaker</ext-link>
, and tested its effect on the alignment statistics and the results of the enrichment analyses.</p>
</sec>
<sec>
<title>Results</title>
<p>Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Huddleston, J" uniqKey="Huddleston J">J Huddleston</name>
</author>
<author>
<name sortKey="Ranade, S" uniqKey="Ranade S">S Ranade</name>
</author>
<author>
<name sortKey="Malig, M" uniqKey="Malig M">M Malig</name>
</author>
<author>
<name sortKey="Antonacci, F" uniqKey="Antonacci F">F Antonacci</name>
</author>
<author>
<name sortKey="Chaisson, M" uniqKey="Chaisson M">M Chaisson</name>
</author>
<author>
<name sortKey="Hon, L" uniqKey="Hon L">L Hon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Howorka, S" uniqKey="Howorka S">S Howorka</name>
</author>
<author>
<name sortKey="Cheley, S" uniqKey="Cheley S">S Cheley</name>
</author>
<author>
<name sortKey="Bayley, H" uniqKey="Bayley H">H Bayley</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Walsh, P" uniqKey="Walsh P">P Walsh</name>
</author>
<author>
<name sortKey="Lu, X" uniqKey="Lu X">X Lu</name>
</author>
<author>
<name sortKey="Carroll, J" uniqKey="Carroll J">J Carroll</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Del Fabbro, C" uniqKey="Del Fabbro C">C Del Fabbro</name>
</author>
<author>
<name sortKey="Scalabrin, S" uniqKey="Scalabrin S">S Scalabrin</name>
</author>
<author>
<name sortKey="Morgante, M" uniqKey="Morgante M">M Morgante</name>
</author>
<author>
<name sortKey="Giorgi, Fm" uniqKey="Giorgi F">FM Giorgi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Turner, S" uniqKey="Turner S">S Turner</name>
</author>
<author>
<name sortKey="Armstrong, Ll" uniqKey="Armstrong L">LL Armstrong</name>
</author>
<author>
<name sortKey="Bradford, Y" uniqKey="Bradford Y">Y Bradford</name>
</author>
<author>
<name sortKey="Carlson, Cs" uniqKey="Carlson C">CS Carlson</name>
</author>
<author>
<name sortKey="Crawford, Dc" uniqKey="Crawford D">DC Crawford</name>
</author>
<author>
<name sortKey="Crenshaw, At" uniqKey="Crenshaw A">AT Crenshaw</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
<author>
<name sortKey="Banks, E" uniqKey="Banks E">E Banks</name>
</author>
<author>
<name sortKey="Poplin, R" uniqKey="Poplin R">R Poplin</name>
</author>
<author>
<name sortKey="Garimella, Kv" uniqKey="Garimella K">KV Garimella</name>
</author>
<author>
<name sortKey="Maguire, Jr" uniqKey="Maguire J">JR Maguire</name>
</author>
<author>
<name sortKey="Hartl, C" uniqKey="Hartl C">C Hartl</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abel, Hj" uniqKey="Abel H">HJ Abel</name>
</author>
<author>
<name sortKey="Al Kateb, H" uniqKey="Al Kateb H">H Al-Kateb</name>
</author>
<author>
<name sortKey="Cottrell, Ce" uniqKey="Cottrell C">CE Cottrell</name>
</author>
<author>
<name sortKey="Bredemeyer, Aj" uniqKey="Bredemeyer A">AJ Bredemeyer</name>
</author>
<author>
<name sortKey="Pritchard, Cc" uniqKey="Pritchard C">CC Pritchard</name>
</author>
<author>
<name sortKey="Grossmann, Ah" uniqKey="Grossmann A">AH Grossmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author>
<name sortKey="Negre, N" uniqKey="Negre N">N Negre</name>
</author>
<author>
<name sortKey="Li, Q" uniqKey="Li Q">Q Li</name>
</author>
<author>
<name sortKey="Mieczkowska, Jo" uniqKey="Mieczkowska J">JO Mieczkowska</name>
</author>
<author>
<name sortKey="Slattery, M" uniqKey="Slattery M">M Slattery</name>
</author>
<author>
<name sortKey="Liu, T" uniqKey="Liu T">T Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Handsaker, B" uniqKey="Handsaker B">B Handsaker</name>
</author>
<author>
<name sortKey="Wysoker, A" uniqKey="Wysoker A">A Wysoker</name>
</author>
<author>
<name sortKey="Fennell, T" uniqKey="Fennell T">T Fennell</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N Homer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, W" uniqKey="Zhou W">W Zhou</name>
</author>
<author>
<name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
<author>
<name sortKey="Zhao, H" uniqKey="Zhao H">H Zhao</name>
</author>
<author>
<name sortKey="Eterovic, Ak" uniqKey="Eterovic A">AK Eterovic</name>
</author>
<author>
<name sortKey="Meric Bernstam, F" uniqKey="Meric Bernstam F">F Meric-Bernstam</name>
</author>
<author>
<name sortKey="Mills, Gb" uniqKey="Mills G">GB Mills</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Majewski, J" uniqKey="Majewski J">J Majewski</name>
</author>
<author>
<name sortKey="Ott, J" uniqKey="Ott J">J Ott</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
<author>
<name sortKey="Zilversmit, Mm" uniqKey="Zilversmit M">MM Zilversmit</name>
</author>
<author>
<name sortKey="Hartl, Dl" uniqKey="Hartl D">DL Hartl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kapitonov, Vv" uniqKey="Kapitonov V">VV Kapitonov</name>
</author>
<author>
<name sortKey="Jurka, J" uniqKey="Jurka J">J Jurka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Ruotti, V" uniqKey="Ruotti V">V Ruotti</name>
</author>
<author>
<name sortKey="Stewart, Rm" uniqKey="Stewart R">RM Stewart</name>
</author>
<author>
<name sortKey="Thomson, Ja" uniqKey="Thomson J">JA Thomson</name>
</author>
<author>
<name sortKey="Dewey, Cn" uniqKey="Dewey C">CN Dewey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chung, D" uniqKey="Chung D">D Chung</name>
</author>
<author>
<name sortKey="Kuan, Pf" uniqKey="Kuan P">PF Kuan</name>
</author>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Sanalkumar, R" uniqKey="Sanalkumar R">R Sanalkumar</name>
</author>
<author>
<name sortKey="Liang, K" uniqKey="Liang K">K Liang</name>
</author>
<author>
<name sortKey="Bresnick, Eh" uniqKey="Bresnick E">EH Bresnick</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lander, Es" uniqKey="Lander E">ES Lander</name>
</author>
<author>
<name sortKey="Linton, Lm" uniqKey="Linton L">LM Linton</name>
</author>
<author>
<name sortKey="Birren, B" uniqKey="Birren B">B Birren</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Zody, Mc" uniqKey="Zody M">MC Zody</name>
</author>
<author>
<name sortKey="Baldwin, J" uniqKey="Baldwin J">J Baldwin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abecasis, Gr" uniqKey="Abecasis G">GR Abecasis</name>
</author>
<author>
<name sortKey="Auton, A" uniqKey="Auton A">A Auton</name>
</author>
<author>
<name sortKey="Brooks, Ld" uniqKey="Brooks L">LD Brooks</name>
</author>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
<author>
<name sortKey="Durbin, Rm" uniqKey="Durbin R">RM Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hangauer, Mj" uniqKey="Hangauer M">MJ Hangauer</name>
</author>
<author>
<name sortKey="Vaughn, Iw" uniqKey="Vaughn I">IW Vaughn</name>
</author>
<author>
<name sortKey="Mcmanus, Mt" uniqKey="Mcmanus M">MT McManus</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pickrell, Jk" uniqKey="Pickrell J">JK Pickrell</name>
</author>
<author>
<name sortKey="Gaffney, Dj" uniqKey="Gaffney D">DJ Gaffney</name>
</author>
<author>
<name sortKey="Gilad, Y" uniqKey="Gilad Y">Y Gilad</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Stanciu, M" uniqKey="Stanciu M">M Stanciu</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, H" uniqKey="Lee H">H Lee</name>
</author>
<author>
<name sortKey="Popodi, E" uniqKey="Popodi E">E Popodi</name>
</author>
<author>
<name sortKey="Foster, Pl" uniqKey="Foster P">PL Foster</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmieder, R" uniqKey="Schmieder R">R Schmieder</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Olson, Le" uniqKey="Olson L">LE Olson</name>
</author>
<author>
<name sortKey="Soriano, P" uniqKey="Soriano P">P Soriano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolger, Am" uniqKey="Bolger A">AM Bolger</name>
</author>
<author>
<name sortKey="Lohse, M" uniqKey="Lohse M">M Lohse</name>
</author>
<author>
<name sortKey="Usadel, B" uniqKey="Usadel B">B Usadel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, N" uniqKey="Chen N">N Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Roberts, A" uniqKey="Roberts A">A Roberts</name>
</author>
<author>
<name sortKey="Goff, L" uniqKey="Goff L">L Goff</name>
</author>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Team, Rdc" uniqKey="Team R">RDC Team</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feng, J" uniqKey="Feng J">J Feng</name>
</author>
<author>
<name sortKey="Liu, T" uniqKey="Liu T">T Liu</name>
</author>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Machanick, P" uniqKey="Machanick P">P Machanick</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosenbloom, Kr" uniqKey="Rosenbloom K">KR Rosenbloom</name>
</author>
<author>
<name sortKey="Sloan, Ca" uniqKey="Sloan C">CA Sloan</name>
</author>
<author>
<name sortKey="Malladi, Vs" uniqKey="Malladi V">VS Malladi</name>
</author>
<author>
<name sortKey="Dreszer, Tr" uniqKey="Dreszer T">TR Dreszer</name>
</author>
<author>
<name sortKey="Learned, K" uniqKey="Learned K">K Learned</name>
</author>
<author>
<name sortKey="Kirkup, Vm" uniqKey="Kirkup V">VM Kirkup</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="abstract" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26423047</article-id>
<article-id pub-id-type="pmc">4597324</article-id>
<article-id pub-id-type="publisher-id">1471-2105-16-S13-S10</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-16-S13-S10</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Dozmorov</surname>
<given-names>Mikhail G</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>MDozmorov@vcu.edu</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Adrianto</surname>
<given-names>Indra</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Indra-Adrianto@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Giles</surname>
<given-names>Cory B</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Cory-Giles@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Glass</surname>
<given-names>Edmund</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>GlassER@vcu.edu</email>
</contrib>
<contrib contrib-type="author" id="A5">
<name>
<surname>Glenn</surname>
<given-names>Stuart B</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Stuart-Glenn@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A6">
<name>
<surname>Montgomery</surname>
<given-names>Courtney</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Courtney-Montgomery@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A7">
<name>
<surname>Sivils</surname>
<given-names>Kathy L</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Kathy-Sivils@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A8">
<name>
<surname>Olson</surname>
<given-names>Lorin E</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>Lorin-Olson@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A9">
<name>
<surname>Iwayama</surname>
<given-names>Tomoaki</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>Tomoaki-Iwayama@omrf.org</email>
</contrib>
<contrib contrib-type="author" id="A10">
<name>
<surname>Freeman</surname>
<given-names>Willard M</given-names>
</name>
<xref ref-type="aff" rid="I4">4</xref>
<xref ref-type="aff" rid="I5">5</xref>
<email>Willard-Freeman@ouhsc.edu</email>
</contrib>
<contrib contrib-type="author" id="A11">
<name>
<surname>Lessard</surname>
<given-names>Christopher J</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>Chris-Lessard@omrf.org</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A12">
<name>
<surname>Wren</surname>
<given-names>Jonathan D</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I6">6</xref>
<email>Jonathan-Wren@omrf.org</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA</aff>
<aff id="I2">
<label>2</label>
Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</aff>
<aff id="I3">
<label>3</label>
Immunobiology and Cancer Research Program, Oklahoma Medical Research Foundation. Oklahoma City, OK, USA</aff>
<aff id="I4">
<label>4</label>
Reynolds Oklahoma Center on Aging, Donald W. Reynolds Department of Geriatric Medicine, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</aff>
<aff id="I5">
<label>5</label>
Department of Physiology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</aff>
<aff id="I6">
<label>6</label>
Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center. Oklahoma City, OK, USA</aff>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>25</day>
<month>9</month>
<year>2015</year>
</pub-date>
<volume>16</volume>
<issue>Suppl 13</issue>
<supplement>
<named-content content-type="supplement-title">Proceedings of the 12th Annual MCBIOS Conference</named-content>
<named-content content-type="supplement-editor">Jonathan D Wren, Whraddah Thakkar, Ramin Jomayouni, Donald J Johann and Mikhail G Dozmorov</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorhsip. Information about the source of funding for publication charges can be found in the individual articles. Articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</named-content>
</supplement>
<fpage>S10</fpage>
<lpage>S10</lpage>
<permissions>
<copyright-statement>Copyright © 2015 Dozmorov et al.</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Dozmorov et al.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/16/S13/S10"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.</p>
</sec>
<sec>
<title>Methods</title>
<p>We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/mdozmorov/RepeatSoaker">https://github.com/mdozmorov/RepeatSoaker</ext-link>
, and tested its effect on the alignment statistics and the results of the enrichment analyses.</p>
</sec>
<sec>
<title>Results</title>
<p>Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.</p>
</sec>
</abstract>
<conference>
<conf-date>13-14 March 2015</conf-date>
<conf-name>12th Annual MCBIOS Conference</conf-name>
<conf-loc>Little Rock, AR, USA</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>Next generation sequencing (NGS) technology is primarily based on massively parallel sequencing of millions of short reads from DNA/RNA samples, although the read length has been increasing [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
]. The costs of NGS have rapidly dropped [
<xref ref-type="bibr" rid="B3">3</xref>
] and, consequently, there has been a relatively rapid shift from the use of microarrays to RNA-seq data to study transcription. This increased reliance on NGS necessitates examination of analysis steps that may affect the quality of the data and the interpretations drawn from it.</p>
<p>Although different types of NGS experiments and library preparation protocols dictate downstream processing steps, removing sequence adapters used to construct the short read library [
<xref ref-type="bibr" rid="B4">4</xref>
] as well as removing low quality bases from short reads [
<xref ref-type="bibr" rid="B5">5</xref>
] are the typical quality control steps. This is followed by removing duplicate reads, which can arise during library amplification by polymerase chain reaction [
<xref ref-type="bibr" rid="B6">6</xref>
]. The rationale behind this step is that such duplicate reads may lead to erroneous conclusions regarding the true level of biological signal, e.g., variant detection in DNA-seq data [
<xref ref-type="bibr" rid="B7">7</xref>
], gene expression in RNA-seq data [
<xref ref-type="bibr" rid="B8">8</xref>
], quantification of gene rearrangements [
<xref ref-type="bibr" rid="B9">9</xref>
], and in ChIP-seq data [
<xref ref-type="bibr" rid="B10">10</xref>
]. Two schools of thought have emerged in the field regarding how to best address this issue. The first is a widely used practice to remove all duplicate or low complexity reads from the dataset, presuming they are a source of potential bias [
<xref ref-type="bibr" rid="B7">7</xref>
,
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
-
<xref ref-type="bibr" rid="B12">12</xref>
]. The second believes that these duplicates may be true unique observations and their removal introduces bias on its own [
<xref ref-type="bibr" rid="B7">7</xref>
]. Although the effect of duplicate reads has been investigated in DNA sequencing [
<xref ref-type="bibr" rid="B13">13</xref>
], the question of how duplicates affect biological signals from gene expression in RNA-seq experiments and motif detection in ChIP-seq experiments remains open-ended.</p>
<p>The presence of low complexity [
<xref ref-type="bibr" rid="B14">14</xref>
,
<xref ref-type="bibr" rid="B15">15</xref>
] and repetitive elements [
<xref ref-type="bibr" rid="B16">16</xref>
] in the reference genome received less attention in how they may affect the conclusions of biological studies. Such regions complicate alignment because short reads originating from them can be mapped to multiple locations, making their interpretation challenging [
<xref ref-type="bibr" rid="B17">17</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
] or even impossible. This problem is not small, as eukaryotic genomes can be very rich in repeats; for example, some have estimated that the human genome contains ~47% repetitive regions [
<xref ref-type="bibr" rid="B19">19</xref>
]. Although recent findings from the ENCODE project suggest that the genome is pervasively transcribed [
<xref ref-type="bibr" rid="B20">20</xref>
,
<xref ref-type="bibr" rid="B21">21</xref>
], RNA molecules originating from low complexity and repetitive regions are a potential problem for analysis of RNA-seq data because they may promiscuously align throughout the genome [
<xref ref-type="bibr" rid="B17">17</xref>
]. This also causes a problem for motif detection within protein-DNA interaction regions, identified by ChIP-seq [
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B22">22</xref>
]. The use of paired-end sequencing, such as implemented by Illumina technologies, helps to control for proper mapping of reads by identifying the discordant read pairs whose mapped loci deviate from the expected orientation and insert size. The problem of multi-mapped reads has also been investigated in detection of structural variants, with the general consensus being to ignore them [
<xref ref-type="bibr" rid="B23">23</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
].</p>
<p>In this work, we systematically investigated the effect of adapter trimming, duplicate reads removal, and filtering out reads overlapping low complexity regions upon biological signal in RNA-seq and ChIP-seq experiments. At each processing step, we used an indirect measurement of the strength of the biological signals by performing pathway- and gene ontology enrichment analyses of genes detected from RNA-seq data, and transcription factor binding sites enrichment analyses in peaks detected from ChIP-seq data. Our rationale here is that, if a processing step leads to a more significant enrichment p-value, that processing step is likely to positively influence biological signal.</p>
<p>Removal of reads overlapping low complexity regions (referred hereafter as "low complexity reads") received less attention. For example, this step has been included into a set of quality control steps in PRINSEQ tool [
<xref ref-type="bibr" rid="B25">25</xref>
]. To allow isolated testing of the effect of low complexity reads removal, we introduce a simple post-alignment filtering tool, RepeatSoaker, that filters out reads overlapping with a user provided template file which contains genomic coordinates of low complexity regions. Designed to be aligner-independent, RepeatSoaker processes the aligned data in BAM format, removes low complexity reads, and outputs a cleaned BAM file and filtering statistics. RepeatSoaker is a straightforward method to remove alignment artifacts from NGS data, designed to eliminate potential false positive reads in quantifying transcript expression. Extendable to any other sequencing technology where low complexity reads may introduce bias, such as ChIP-seq, we envision RepeatSoaker could become a standard step helping to better structure reproducible NGS pipelines.</p>
<p>We applied a combination of adapter trimming, duplicate removal, and filtering out low complexity reads with RepeatSoaker to RNA-seq and ChIP-seq experiments, and investigated how each step affects the results of downstream enrichment analysis. Our results show that adapter trimming increases the significance of gene ontology and pathway enrichment analyses in RNA-seq data, and strengthens motif detection in ChIP-seq data. The duplicate removal step, despite decreasing the number of reads, further helps to increase the significance of biological signals, especially when coupled with adapter trimming. Filtering out low complexity reads with RepeatSoaker has minor effect on the total number of reads, yet this step had a positive effect on the detection of biological signals. Our study suggests that adapter trimming and duplicates removal are important steps in detecting stronger biological signals within RNA-seq and ChIP-seq data, and optional filtering out of reads overlapping low complexity regions will further increase the quality of conclusions.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Data source</title>
<p>The effect of adapter trimming, duplicate removal and filtering out reads overlapping low complexity region was investigated in 2 types of data: RNA-seq and ChIP-seq.</p>
<p>Human RNA-seq data were obtained from 60 Sjogren's syndrome patients and 30 healthy controls (unpublished data). In short, peripheral blood was collected in PaxGene tubes (BD Diagnostics; Franklin Lakes, NJ) and RNA was isolated using standard protocols (Qiagen, Inc., Valencia, CA). Globin transcripts were removed using GlobinClear (Life Technologies; Grand Island, NY) and samples were prepared for sequencing using the NuGen ENCORE complete kit (San Carlos, CA). The paired-end 100 bp-long sequencing was performed using the Illumina HiSeq 2000 employing standard procedures. For the 54 samples used in RNA-seq experiment, matching gene expression profiling was performed using Illumina HumanWG-6 v3.0 BeadChips. Pearson's correlation coefficient was used to test the agreement between human RNA expression estimates measured by RNA-Seq (log2 FPKM counts) and microarrays (log2 intensities) in the cohort of patients diagnosed with Sjogren's syndrome.</p>
<p>Mouse RNA-seq data were obtained from 3 wild type mice and 3 mice expressing the D842V mutant form of PDGFRα (platelet-derived growth factor receptor alpha), which was previously described [
<xref ref-type="bibr" rid="B26">26</xref>
]. Cell suspensions from neonatal dermis were prepared from 3 day old mouse skin after separating dermis/epidermis and digesting the dermis with 0.35% collagenase type 1 for 60 minutes. Subsequently, Nestin-GFP+ singlets were sorted with a MoFlo XDP cell sorter (Beckman Coulter). RNA was isolated from 2-3 million cells using RNAeasy kit (Qiagen). cDNA libraries were prepared with NEBNext Ultra Directional RNA Library Prep Kit for Illumina (New England Biolabs) according to the manufacturer's protocol. In short, mRNA was isolated from 1µg purified total RNA with oligo dT beads, and fragmented. Then, first and second strand cDNA were synthesized, followed by purification using Agencourt AMPure XP beads (Beckman Coulter). The second strand cDNAs were end-repaired, A-tailed, and adaptor-ligated. Size-selected DNA with Agencourt AMPure XP beads was enriched by 13-cycle PCR each with index primers, and again purified using the beads. The each indexed library was analyzed by Agilent 2200 TapeStation system (Agilent), and RNA integrity number equivalent (RINe) were ranged from 9.2 to 9.6. Then libraries were quantified with Qubit 2.0 Fluorometer (Thermo Fisher Scientific), and pooled for sequencing.</p>
<p>ChIP-seq data were obtained from 10 systemic lupus erythematosus patients and 10 healthy controls of European descent (unpublished data). Briefly, all nucleated cells were isolated from human blood using PolyPrep (Sigma, Deisenhofen, Germany) density gradient medium. Proteins were cross-linked to the DNA using formaldehyde and protein-DNA complexes were immunoprecipitated using a polyclonal rabbit anti-PU.1 (Spi-1) antibody (sc-352, Santa Cruz Biotechnology, Santa Cruz, California, USA). Individual sequencing libraries were prepared for each individual using the Illumina ChIP-Seq DNA Sample Prep Kit (Illumina, San Diego, California, USA). Sequencing was done using the Illumina HiSeq 2000 platform with 5 samples per lane. The case-control samples were sequenced on the same lane, e.g., 3 cases+2 controls in one lane.</p>
</sec>
<sec>
<title>Data processing</title>
<p>Quality of raw sequence data was assessed using FASTQC. Adapter trimming was performed using Trimmomatic v0.30 program [
<xref ref-type="bibr" rid="B27">27</xref>
]. The reads were cropped to the length of 70 bp, the adapters were trimmed, and bases having quality below 20 on Phred 33 scale were also cut. The reads with length less than 10 bases were discarded. Only the paired reads were used for subsequent analyses. Duplicates removal was performed using PICARD MarkDuplicates tool. The summary statistics were obtained using SAMTOOLS FLAGSTAT command [
<xref ref-type="bibr" rid="B11">11</xref>
].</p>
<p>To remove reads overlapping any of the regions identified by RepeatMasker [
<xref ref-type="bibr" rid="B28">28</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
], we implemented RepeatSoaker (
<ext-link ext-link-type="uri" xlink:href="https://github.com/mdozmorov/RepeatSoaker">https://github.com/mdozmorov/RepeatSoaker</ext-link>
). RepeatSoaker utilizes a user-provided list of genomic coordinates of low complexity regions in BED format [
<xref ref-type="bibr" rid="B30">30</xref>
]. Note that the coordinates should correspond to the organism and genome assembly of the original BAM files. We provide a Makefile that automates generation of BED files containing genomic coordinates of all regions identified by RepeatMasker for GRCh37/hg19 and NCBI37/mm9 genomes.</p>
<p>RepeatSoaker provides flexibility to set a threshold for deciding whether a read should be kept or filtered due to its overlap with a low complexity region. A user can set the percent overlap threshold, e.g., 75%, 50%, 25%. A read overlapping with a low complexity region more than the threshold, e.g., more than 75%, is filtered. We tested the effect of removing reads overlapping with low complexity regions more than 75%, 50%, 25%, and 0%. A 0% threshold indicates that a read is filtered if it is immediately proximal to a repeat region.</p>
<p>For the human and mouse RNA-seq data, raw FASTQ files were aligned to the human and mouse genomes (hg19/mm9, respectively) using TOPHAT [
<xref ref-type="bibr" rid="B31">31</xref>
]. The read counts per gene or transcript were generated using HTSEQ-COUNT. Differentially expressed (DE) transcripts were determined using DESeq R package with a false discovery rate (FDR) q-value of <0.05 and a fold change of >2 or <0.5. All data manipulations were performed in the R/Bioconductor environment [
<xref ref-type="bibr" rid="B32">32</xref>
].</p>
<p>For the human ChIP-seq data, raw FASTQ files were aligned to the human hg19 genome using bowtie2 [
<xref ref-type="bibr" rid="B33">33</xref>
]. The PU.1 binding peaks were called using MACS2 [
<xref ref-type="bibr" rid="B34">34</xref>
], and the consensus motifs were detected using MEME-ChIP suite [
<xref ref-type="bibr" rid="B35">35</xref>
].</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<sec>
<title>Systematic testing of data preprocessing steps</title>
<p>To elucidate the effects of adapter removal, elimination of duplicates, and filtering of low complexity regions, we performed systematic testing of sequencing data with and without applying these three preprocessing steps (Figure
<xref ref-type="fig" rid="F1">1</xref>
). At each step, we compared the alignment statistics, the number of differentially expressed genes (RNA-seq) or identified transcription factor binding peaks (ChIP-seq), and the results of Gene Ontology, KEGG and Reactome pathway enrichment analyses (RNA-seq) and motif enrichment analyses (ChIP-seq). We also compared combinations of data preprocessing steps, e.g., how duplicate removal affected trimmed and untrimmed data. At each comparison, we evaluated how the data preprocessing steps affected biological signals as judged by the functional enrichment analysis.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>RepeatSoaker comparisons</bold>
. Overview of the various permutations of the three data processing steps compared.</p>
</caption>
<graphic xlink:href="1471-2105-16-S13-S10-1"></graphic>
</fig>
</sec>
<sec>
<title>Adapter removal increases the data quality and the strength of biological signals</title>
<p>Before investigating the effects of low complexity region filtering, we assessed how adapter- and duplicate removal affected quality of alignment of the sequencing data. Adapter trimming increased the number of total reads in RNA-seq data, due to the fact that Trimmomatic had cut low quality bases from the middle portion of some of the reads, thus splitting some of them into multiple shorter reads, which still survived the minimum length (10 bp) threshold (Table
<xref ref-type="table" rid="T1">1</xref>
Additional Files
<xref ref-type="supplementary-material" rid="S1">1</xref>
and
<xref ref-type="supplementary-material" rid="S2">2</xref>
). Such cuts resulted in more reads with mates mapped to a different chromosome (referred hereafter as "mismapped reads"). Adapter trimming of ChIP-seq data slightly decreased the total number of reads, as compared with unprocessed data (Additional File
<xref ref-type="supplementary-material" rid="S3">3</xref>
). However, the percentage of properly paired reads increased while the percent of singletons and mismapped reads decreased (Table
<xref ref-type="table" rid="T2">2</xref>
). In summary, the adapter-trimming step altered the properties of sequencing data depending on the quality of the unprocessed data.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>RNA-seq alignment statistics for different combinations of the sequencing data processing steps</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Trim</th>
<th align="center">Dup</th>
<th align="center">RS</th>
<th align="center">Total reads</th>
<th align="center">properly paired (%)</th>
<th align="center">singletons (%)</th>
<th align="center">with mate mapped to a different chr (%)</th>
<th align="center">Number of DEGs</th>
<th align="center">KEGG: ECM-receptor interaction</th>
<th align="center">GO: multicellular organismal process</th>
<th align="center">Reactome: Transmembrane transport of small molecules</th>
<th align="center">R
<sup>2</sup>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">41,505,942</td>
<td align="center">68.98+-3.71</td>
<td align="center">16.03+-3.20</td>
<td align="center">4.01+-3.48</td>
<td align="center">2189</td>
<td align="center">1.86E-07</td>
<td align="center">8.86E-16</td>
<td align="center">6.38E-13</td>
<td align="center">0.6687</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">51,984,539</td>
<td align="center">60.10+-3.68</td>
<td align="center">10.98+-2.44</td>
<td align="center">17.92+-3.89</td>
<td align="center">2139</td>
<td align="center">3.29E-08</td>
<td align="center">2.86E-13</td>
<td align="center">9.85E-11</td>
<td align="center">0.6614</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">15,429,501</td>
<td align="center">61.72+-5.45</td>
<td align="center">12.49+-2.37</td>
<td align="center">10.22+-4.68</td>
<td align="center">2487</td>
<td align="center">1.34E-07</td>
<td align="center">2.50E-22</td>
<td align="center">1.85E-11</td>
<td align="center">0.6672</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">25,738,167</td>
<td align="center">43.14+-5.77</td>
<td align="center">7.51+-1.33</td>
<td align="center">36.18+-5.70</td>
<td align="center">2391</td>
<td align="center">1.97E-07</td>
<td align="center">4.93E-17</td>
<td align="center">3.54E-09</td>
<td align="center">0.6575</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">75</td>
<td align="center">28,283,010</td>
<td align="center">70.55+-3.17</td>
<td align="center">16.74+-3.85</td>
<td align="center">0.69+-0.09</td>
<td align="center">2100</td>
<td align="center">7.62E-08</td>
<td align="center">8.85E-17</td>
<td align="center">5.71E-12</td>
<td align="center">0.6708</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">50</td>
<td align="center">26,450,592</td>
<td align="center">70.05+-3.24</td>
<td align="center">17.22+-3.95</td>
<td align="center">0.63+-0.08</td>
<td align="center">2068</td>
<td align="center">7.18E-08</td>
<td align="center">1.31E-16</td>
<td align="center">6.85E-14</td>
<td align="center">0.6712</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">25</td>
<td align="center">24,703,408</td>
<td align="center">69.63+-3.31</td>
<td align="center">17.66+-4.05</td>
<td align="center">0.62+-0.08</td>
<td align="center">2021</td>
<td align="center">8.92E-09</td>
<td align="center">6.33E-19</td>
<td align="center">2.94E-14</td>
<td align="center">0.6705</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">21,413,178</td>
<td align="center">69.39+-3.47</td>
<td align="center">18.20+-4.26</td>
<td align="center">0.61+-0.09</td>
<td align="center">2087</td>
<td align="center">1.02E-08</td>
<td align="center">3.13E-19</td>
<td align="center">3.98E-15</td>
<td align="center">0.6643</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">75</td>
<td align="center">32,589,028</td>
<td align="center">64.70+-3.88</td>
<td align="center">12.29+-2.88</td>
<td align="center">10.46+-1.82</td>
<td align="center">2116</td>
<td align="center">4.88E-07</td>
<td align="center">1.12E-13</td>
<td align="center">2.89E-12</td>
<td align="center">0.6637</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">50</td>
<td align="center">30,174,345</td>
<td align="center">64.93+-3.92</td>
<td align="center">12.70+-2.97</td>
<td align="center">9.71+-1.82</td>
<td align="center">2066</td>
<td align="center">3.49E-07</td>
<td align="center">2.96E-14</td>
<td align="center">2.44E-12</td>
<td align="center">0.6642</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">25</td>
<td align="center">28,231,486</td>
<td align="center">64.55+-4.00</td>
<td align="center">13.04+-3.03</td>
<td align="center">9.76+-1.87</td>
<td align="center">2004</td>
<td align="center">3.15E-07</td>
<td align="center">5.05E-16</td>
<td align="center">1.13E-13</td>
<td align="center">0.6636</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">24,546,936</td>
<td align="center">63.85+-4.25</td>
<td align="center">13.42+-3.12</td>
<td align="center">10.26+-2.06</td>
<td align="center">2028</td>
<td align="center">2.65E-07</td>
<td align="center">3.42E-16</td>
<td align="center">3.55E-14</td>
<td align="center">0.6583</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">75</td>
<td align="center">9,681,047</td>
<td align="center">68.59+-4.15</td>
<td align="center">12.22+-2.96</td>
<td align="center">1.45+-0.25</td>
<td align="center">2302</td>
<td align="center">1.21E-07</td>
<td align="center">4.18E-23</td>
<td align="center">5.71E-12</td>
<td align="center">0.6695</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">50</td>
<td align="center">8,987,150</td>
<td align="center">68.34+-4.18</td>
<td align="center">12.53+-3.06</td>
<td align="center">1.221+-0.20</td>
<td align="center">2256</td>
<td align="center">1.48E-07</td>
<td align="center">4.80E-22</td>
<td align="center">4.07E-14</td>
<td align="center">0.6700</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">25</td>
<td align="center">8,346,861</td>
<td align="center">68.09+-4.22</td>
<td align="center">12.83+-3.16</td>
<td align="center">1.21+-0.19</td>
<td align="center">2245</td>
<td align="center">1.48E-07</td>
<td align="center">2.70E-21</td>
<td align="center">1.44E-14</td>
<td align="center">0.6694</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">0</td>
<td align="center">7,151,500</td>
<td align="center">68.13+-4.26</td>
<td align="center">12.99+-3.12</td>
<td align="center">1.19+-0.20</td>
<td align="center">2326</td>
<td align="center">1.69E-07</td>
<td align="center">2.81E-24</td>
<td align="center">8.28E-16</td>
<td align="center">0.6628</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">75</td>
<td align="center">14,251,402</td>
<td align="center">52.88+-6.50</td>
<td align="center">7.54+-1.45</td>
<td align="center">23.48+-5.56</td>
<td align="center">2210</td>
<td align="center">1.18E-06</td>
<td align="center">4.34E-20</td>
<td align="center">3.95E-09</td>
<td align="center">0.6598</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">50</td>
<td align="center">12,985,873</td>
<td align="center">53.93+-6.45</td>
<td align="center">7.65+-1.48</td>
<td align="center">22.01+-5.48</td>
<td align="center">2180</td>
<td align="center">7.69E-07</td>
<td align="center">5.94E-19</td>
<td align="center">3.02E-11</td>
<td align="center">0.6604</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">25</td>
<td align="center">12,125,100</td>
<td align="center">53.69+-6.50</td>
<td align="center">7.81+-1.51</td>
<td align="center">22.11+-5.58</td>
<td align="center">2124</td>
<td align="center">4.40E-06</td>
<td align="center">2.98E-19</td>
<td align="center">6.28E-13</td>
<td align="center">0.6599</td>
</tr>
<tr>
<td colspan="12">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">0</td>
<td align="center">10,416,970</td>
<td align="center">52.34+-6.98</td>
<td align="center">7.74+-1.30</td>
<td align="center">23.54+-6.17</td>
<td align="center">2176</td>
<td align="center">4.22E-06</td>
<td align="center">3.54E-18</td>
<td align="center">3.00E-13</td>
<td align="center">0.6539</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>"Total reads" - average number of reads; "paired (%)" - average percent of paired reads; "singletons (%)" - average percent of single end reads; "with mate mapped to a different chr (%)" - average percent of inter-chromosome mapped reads. "Number of DEGs" - number of differentially expressed genes. To allow direct comparisons of p-values among the processing steps, the "ECM-receptor interaction" KEGG pathway, the "multicellular organismal process" GO, and the "Transmembrane transport of small molecules" Reactome pathway were selected as the most representative and most enriched functional categories in each processing step, with the full enrichment analyses results shown in Additional Files 4 and 5. "+/-" indicate whether the step (Trim - adapter trimming, Dup - duplicate removal, RS - filtering out low complexity regions with RepeatSoaker) was applied/not applied, respectively. The number in the RepeatSoaker column reflects the threshold of removing reads overlapping with low complexity regions, i.e., 75% indicates that reads overlapping 75% or more with a low complexity region were removed.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>ChIP-seq alignment statistics for different combinations of sequencing data processing steps</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Trim</th>
<th align="center">Dup</th>
<th align="center">RS</th>
<th align="center">Total reads</th>
<th align="center">properly paired (%)</th>
<th align="center">singletons (%)</th>
<th align="center">with mate mapped to a different chr (%)</th>
<th align="center">SPI1 E-value</th>
<th align="center">Number of motifs</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">70,929,429</td>
<td align="center">96.81+-1.57</td>
<td align="center">0.40+-0.17</td>
<td align="center">0.80+-0.34</td>
<td align="center">8.2e-9446</td>
<td align="center">44</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">70,155,620</td>
<td align="center">97.49+-1.44</td>
<td align="center">0.15+-0.03</td>
<td align="center">0.39+-0.13</td>
<td align="center">7.6e-10075</td>
<td align="center">65</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">40,472,954</td>
<td align="center">95.60+-2.19</td>
<td align="center">0.44+-0.24</td>
<td align="center">1.53+-1.06</td>
<td align="center">1.5e-10726</td>
<td align="center">26</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">40,500,416</td>
<td align="center">96.83+-1.82</td>
<td align="center">0.16+-0.04</td>
<td align="center">0.65+-0.30</td>
<td align="center">4.0e-11010</td>
<td align="center">25</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">75</td>
<td align="center">68,856,152</td>
<td align="center">96.81+-1.57</td>
<td align="center">0.39+-0.16</td>
<td align="center">0.79+-0.34</td>
<td align="center">2.0e-9425</td>
<td align="center">43</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">50</td>
<td align="center">68,578,937</td>
<td align="center">96.81+-1.57</td>
<td align="center">0.39+-0.16</td>
<td align="center">0.79+-0.34</td>
<td align="center">2.0e-9425</td>
<td align="center">42</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">25</td>
<td align="center">68,405,768</td>
<td align="center">96.81+-1.57</td>
<td align="center">0.39+-0.16</td>
<td align="center">0.79+-0.34</td>
<td align="center">2.0e-9425</td>
<td align="center">43</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">68,279,169</td>
<td align="center">96.81+-1.57</td>
<td align="center">0.39+-0.16</td>
<td align="center">0.79+-0.34</td>
<td align="center">2.0e-9425</td>
<td align="center">42</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">75</td>
<td align="center">68,004,984</td>
<td align="center">97.50+-1.45</td>
<td align="center">0.15+-0.03</td>
<td align="center">0.38+-0.13</td>
<td align="center">2.8e-9899</td>
<td align="center">64</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">50</td>
<td align="center">67,805,679</td>
<td align="center">97.50+-1.45</td>
<td align="center">0.15+-0.03</td>
<td align="center">0.38+-0.13</td>
<td align="center">2.8e-9899</td>
<td align="center">64</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">25</td>
<td align="center">67,679,663</td>
<td align="center">97.50+-1.45</td>
<td align="center">0.15+-0.03</td>
<td align="center">0.38+-0.13</td>
<td align="center">2.8e-9899</td>
<td align="center">67</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">67,587,630</td>
<td align="center">97.50+-1.45</td>
<td align="center">0.15+-0.03</td>
<td align="center">0.38+-0.13</td>
<td align="center">2.8e-9899</td>
<td align="center">62</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">75</td>
<td align="center">39,242,893</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">7.6e-10575</td>
<td align="center">26</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">50</td>
<td align="center">39,080,973</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">7.6e-10575</td>
<td align="center">26</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">25</td>
<td align="center">38,981,300</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">7.6e-10575</td>
<td align="center">26</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">0</td>
<td align="center">38,908,929</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">7.6e-10575</td>
<td align="center">26</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">75</td>
<td align="center">39,242,893</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">4.7e-10731</td>
<td align="center">27</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">50</td>
<td align="center">39,080,973</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">4.7e-10731</td>
<td align="center">27</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">25</td>
<td align="center">38,981,300</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">4.7e-10731</td>
<td align="center">27</td>
</tr>
<tr>
<td colspan="9">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">0</td>
<td align="center">38,908,929</td>
<td align="center">95.61+-2.19</td>
<td align="center">0.43+-0.24</td>
<td align="center">1.51+-1.05</td>
<td align="center">4.7e-10731</td>
<td align="center">30</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>"+/-" indicate the step (Trim - adapter trimming, Dup - duplicate removal, RS - filtering out low complexity regions with RepeatSoaker) was applied/not applied, respectively. The RepeatSoaker % reflects the threshold of removing reads overlapping with low complexity regions, i.e., 75% indicates that reads overlapping with low complexity regions 75% or more are removed. SPI1 E-value is an equivalent of a p-value for the detection of PU.1 motif.</p>
</table-wrap-foot>
</table-wrap>
<p>To identify how the adapter-trimming step affects biological signals in RNA-seq data, we evaluated the total number of differentially expressed genes before and after trimming, and their functional enrichment results. The total number of differentially expressed genes remained virtually unchanged, as well as the order of enrichment p-values for KEGG, GO, and Reactome pathway enrichment results, and Pearson's correlation coefficient with the microarray gene expression data (Table
<xref ref-type="table" rid="T1">1</xref>
Additional Files
<xref ref-type="supplementary-material" rid="S4">4</xref>
and
<xref ref-type="supplementary-material" rid="S5">5</xref>
). In the case of ChIP-seq experiments, we evaluated the number and the significance of detected motifs within ChIP-seq binding peaks. As we have evaluated genome-wide binding of the PU.1 protein, the most significant motif enriched in the detected peaks was SPI1 (aka PU.1, Additional File
<xref ref-type="supplementary-material" rid="S6">6</xref>
). The total number and the significance of the detection of this motif increased following the adapter trimming step. These results support the notion that adapter trimming alone increases the strength of biological signals within it [
<xref ref-type="bibr" rid="B5">5</xref>
].</p>
</sec>
<sec>
<title>Removing duplicates negatively affects alignment statistics but strongly improves the detection of biologically relevant signal</title>
<p>Removing duplicates had the greatest effect on the alignment statistics, decreasing the total number of reads by ~40% in RNA-seq and ChIP-seq data, as compared with the unprocessed data (Additional Files
<xref ref-type="supplementary-material" rid="S1">1</xref>
,
<xref ref-type="supplementary-material" rid="S2">2</xref>
,
<xref ref-type="supplementary-material" rid="S3">3</xref>
). This step also decreased the percentage of singletons (Tables
<xref ref-type="table" rid="T1">1</xref>
and
<xref ref-type="table" rid="T2">2</xref>
). Yet, this step had a slight negative effect on the percentage of properly paired reads at the expense of increased percentage of mismapped reads. Overall, the effect of duplicates removal on the alignment statistics was detrimental.</p>
<p>Despite the lower number of short reads remaining after duplicates removal, the number of differentially expressed genes increased (Table
<xref ref-type="table" rid="T1">1</xref>
). Although the lists of differentially expressed genes before and after duplicate removal did not show complete overlap (Additional Files
<xref ref-type="supplementary-material" rid="S4">4</xref>
and
<xref ref-type="supplementary-material" rid="S5">5</xref>
), the order and the significance of the enriched KEGG, GOs and Reactome pathways remained similar indicating that the biological signal was retained and became more significant, as the enrichment p-values were also increased (Table
<xref ref-type="table" rid="T1">1</xref>
). We did not observe any increase in correlation of RNA-seq and microarray gene expression data.</p>
<p>In contrast to a larger number of differentially expressed genes when removing duplicates from the data, the number of detected motifs in ChIP-seq data decreased. However, the significance of the detected PU.1 motif increased (Table
<xref ref-type="table" rid="T2">2</xref>
Additional File
<xref ref-type="supplementary-material" rid="S6">6</xref>
). In summary, these results suggest that removal of duplicates, although decreasing the total number of reads, increases the significance of biological signals in both RNA-seq and ChIP-seq data.</p>
</sec>
<sec>
<title>Trimming adapters coupled with duplicates removal synergistically improves the quality of sequencing data</title>
<p>Adapter trimming coupled with duplicates removal led to an increase in the percentage of mismapped reads in RNA-seq data (Table
<xref ref-type="table" rid="T1">1</xref>
). Consequently, the percentage of properly paired reads has decreased. However, the number of differentially expressed genes was larger than in the unprocessed data and their enrichment analyses results, except for the adapter-trimming step, were comparably significant.</p>
<p>Applied to ChIP-seq data, trimming the adapters coupled with removing duplicates improved overlap statistics, making them better than the unprocessed data (Table
<xref ref-type="table" rid="T2">2</xref>
). Notably, despite lower total number of detected motifs, the detection of the PU.1 motif was the most significant (Additional File
<xref ref-type="supplementary-material" rid="S6">6</xref>
). Overall, our results suggest that both adapter trimming and duplicates removal steps help to emphasize biological signal in RNA-seq and ChIP-seq data.</p>
</sec>
<sec>
<title>Removing low complexity regions improves detection of true biological signal</title>
<p>We used different stringency thresholds for filtering reads overlapping low complexity regions. A threshold was defined by the percent of overlap of a read with low complexity regions. For example, 75% threshold indicates that a read will be filtered only if it overlaps with a low complexity region at least 75% of the read's total length. A special case of 0% threshold indicates that a read can be located side-by-side with a low complexity region to be considered for removal. Thus, lowering percent overlap threshold corresponds to a more stringent threshold for filtering out of reads overlapping low complexity regions, with 0% threshold indicating the most aggressive filtering.</p>
<p>Applying RepeatSoaker to RNA-seq and ChIP-seq data decreased the number of reads by ~3%, independently of the threshold used. However, the alignment statistics of RNA-seq and ChIP-seq data improved in all conditions, when compared with the corresponding data preprocessing steps. For example, comparing the alignment statistics for unprocessed and processed (filtered) data shows not only increase in the percentage of properly paired reads, but also in decrease in mismapped reads (Additional Files
<xref ref-type="supplementary-material" rid="S1">1</xref>
,
<xref ref-type="supplementary-material" rid="S2">2</xref>
,
<xref ref-type="supplementary-material" rid="S3">3</xref>
). Increasing the stringency of filtering of the reads overlapping low complexity regions further increased the alignment statistics. This suggests that simply removing reads overlapping low complexity regions is an essential step in improving the quality of RNA- and ChIP-seq data.</p>
<p>Reflecting the improvement in the alignment statistics, the significance of the GO/KEGG/Reactome enrichments (RNA-seq) and the PU.1 motif enrichments (ChIP-seq) have increased upon RepeatSoaking the data (Table
<xref ref-type="table" rid="T2">2</xref>
), while the order of the enrichments remained virtually unchanged. Moreover, more stringent removal of reads overlapping low complexity regions further increased the significance of the functional enrichments. Correlation of gene expression from RNA-seq and microarray data was also increased. This observation suggests that aggressive removal of reads overlapping low complexity regions (RepeatSoaker threshold 0%) aids in emphasizing true biological signal within the data.</p>
</sec>
<sec>
<title>Filtering out low complexity reads affects low expressed metabolism-related genes</title>
<p>We investigated the effect of RepeatSoaker on the number of differentially expressed (DE) genes in RNA-seq data with and without duplicates (Figure
<xref ref-type="fig" rid="F2">2</xref>
). The total number of DE genes decreased with more stringent removal of reads with RepeatSoaker. Each condition, from no RepeatSoaker to the most stringent RepeatSoaker threshold 0%, has a unique set of genes not detected as differentially expressed in other conditions ("leaves" of the Venn diagram, Figure
<xref ref-type="fig" rid="F2">2</xref>
). These genes showed comparable fold change distributions as the "core" gene set detected as DE in all conditions (Figure
<xref ref-type="fig" rid="F3">3A</xref>
). However, the expression level of those genes was lower (Figure
<xref ref-type="fig" rid="F3">3B</xref>
), making them susceptible to lose their DE status upon removal of reads with RepeatSoaker. We further investigated the biological significance of those genes. These were predominantly predicted (as opposed to known) genes and metabolism-related genes, as can be seen from their enrichment analysis results (Additional File
<xref ref-type="supplementary-material" rid="S7">7</xref>
). This observation suggests that RepeatSoaker removes biological "noise" from the data while increasing the signal that reflects the underlying biology of the experiment.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Differential expression detection</bold>
. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.</p>
</caption>
<graphic xlink:href="1471-2105-16-S13-S10-2"></graphic>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Gene comparison of distribution and expression levels</bold>
. Comparison of the log2 fold change (A) and expression (B) distributions among genes at different thresholds for removing reads overlapping low complexity (LC) regions. "All kept"/"Reads without LC overlaps" - metrics of all differentially expressed genes detected using all/none reads overlapping LC regions; "75%/50%/25% LC overlaps removed" - metrics of genes that became non-differentially expressed after removing reads overlapping LC regions at least 75%/50%/25%, respectively.</p>
</caption>
<graphic xlink:href="1471-2105-16-S13-S10-3"></graphic>
</fig>
</sec>
<sec>
<title>Removing duplicates decreases overall expression level but retains fold changes</title>
<p>Lastly, we investigated the effect of data preprocessing steps upon the expression and fold change levels in RNA-seq data. Removing duplicates had decreased the overall level of expression (Figure
<xref ref-type="fig" rid="F4">4A</xref>
), as can be expected from losing ~40% of reads. The effect of other processing steps on gene expression level was negligible, which is reflected in the virtually unchanged Pearson's correlation coefficient of RNA-seq and microarray gene expression data.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Gene comparison of distribution and expression levels</bold>
. Effect of data processing on expression (A) and fold change (B) distribution of differentially expressed genes.</p>
</caption>
<graphic xlink:href="1471-2105-16-S13-S10-4"></graphic>
</fig>
<p>To compare the effect of preprocessing steps on fold changes, we compared the distribution of fold changes at each step (Figure
<xref ref-type="fig" rid="F4">4B</xref>
), and investigated gene-by-gene fold change differences. As pre-processing steps were applied to both healthy and diseased groups of samples, thus uniformly changing read counts of the differentially expressed genes in both groups, the fold changes remained stable (Figure
<xref ref-type="fig" rid="F4">4B</xref>
). Expectedly, removing duplicates, with or without adapter trimming, decreased the maximum, but not overall, fold change, as would be expected from removing ~40% of reads. Similarly, filtering out reads overlapping and touching low complexity regions (RepeatSoaker threshold 0%) also slightly decreased the maximum fold change. Overall, our results suggest that data preprocessing steps, although affecting overall gene expression level, retain biological signal within the data, as reflected by relatively unchanged fold changes.</p>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>Our findings suggest some general guidelines for RNA-seq and ChIP-seq data processing. We show that each of the three steps, adapter trimming, duplicate removal, and filtering out reads overlapping low complexity regions, increases the strength of biological signals within the data, as shown by the more significant functional enrichment p-values. Our results emphasize the need of removing reads overlapping low complexity regions in order to improve biological signals in RNA-seq and ChIP-seq data. Overall, our analysis suggests that all three steps should be an integral part of NGS data processing pipelines in order to obtain better insights into the biology.</p>
<p>Although it is a general consensus that trimming the adapters improves the quality of sequencing data [
<xref ref-type="bibr" rid="B5">5</xref>
], our experience with applying adapter trimming to RNA-seq and ChIP-seq data has been mixed. Trimming the adapters did not significantly improve the significance of biological signals of RNA-seq data (Table
<xref ref-type="table" rid="T1">1</xref>
), in contrast with what we observed for ChIP-seq data. This may be attributed to the use of different aligners, TopHat and Bowtie2, used for RNA-seq and ChIP-seq data, respectively. The former, TopHat, has been designed to deal with unmapped portions of the reads, such as adapters [
<xref ref-type="bibr" rid="B31">31</xref>
], and therefore the data processed with it may not be notably improved by the adapter trimming step. Our results warrant the testing of other adapter trimming tools, each reported to have different effect on data quality [
<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<p>Our motivation to investigate the effect of reads overlapping low complexity regions came from our and others [
<xref ref-type="bibr" rid="B22">22</xref>
] empirical observations that such reads have multiple alignments within the reference genome and tend to pile up within low complexity regions, such as centromeres and telomeres. We hypothesized that such pileups may negatively affect the detection of true gene expression level when summarizing read counts into FPKM measurements. Furthermore, such pileups were picked up as the strongest peaks in ChIP-seq experiments, ultimately affecting motif enrichment analysis. To this end we have developed the RepeatSoaker tool that filters out reads overlapping low complexity regions. In addition to using low complexity regions defined by the RepeatMasker program [
<xref ref-type="bibr" rid="B28">28</xref>
], a user has an option to provide his/her own list of genomic coordinates of any other regions, such as the Duke excluded regions or the DAC blacklisted regions, defined by the ENCODE project and obtainable from the UCSC genome database [
<xref ref-type="bibr" rid="B36">36</xref>
], or the high-depth coverage regions defined by Pickrell et.al. by scanning the 1000 Genomes data [
<xref ref-type="bibr" rid="B22">22</xref>
]. This ability of RepeatSoaker to use any list of genomic regions as a "mask" further empowers a user to ignore reads not only in the low complexity regions, but in other uninteresting regions, such as ribosomal genes in RNA-seq data, or even completely mask out non-exome portions of the genome.</p>
<p>One limitation of our study is that we cannot judge the biological significance of our findings other than by indirect assessment of the number of differentially expressed genes (RNA-seq), the total number of motifs (ChIP-seq), and the results of the enrichment analyses. Our hypothesis here was that, if a processing step improves the quality of the data, it should improve the significance of enriched gene ontologies, KEGG and Reactome pathways (for RNA-seq data) and detected motifs (for ChIP-seq data). Although we observed improvements in the enrichment analyses results after each data preprocessing steps, future studies warrant investigation of expression level of genes removed by data preprocessing steps, and by RepeatSoaker, by other techniques, such as polymerase chain reaction, or by direct comparison with gene expression changes measured by microarray technology.</p>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>In conclusion, we recommend adapter trimming, duplicates removal, and filtering out reads overlapping low complexity regions as data preprocessing steps of RNA-seq and ChIP-seq data. Our comprehensive comparison suggests that these data preprocessing steps will help to emphasize true biological signals in sequencing studies.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>Conceived and designed the experiments: JDW, MGD. Provided the data: CM, KLS, LEO, TI, WMF, CJL. Analyzed the data: MGD, IA, CBG, SBG. Wrote the first draft of the manuscript: MGD. Made critical revisions and approved final version: JDW, IA, CL, CBG, EG. All authors reviewed and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional File 1</title>
<p>
<bold>Alignment statistics for the non-trimmed RNA-seq data in each data preprocessing step</bold>
. Each worksheet contains step-specific alignment statistics. "no_remDup"/"remDup" in the worksheet name indicate whether the data has duplicates kept/removed, respectively. "All"/"0.75" etc., indicate whether the reads overlapping low complexity regions were kept/removed at the specified threshold, respectively</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S1.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional File 2</title>
<p>
<bold>Alignment statistics for the trimmed RNA-seq data in each data preprocessing step</bold>
. Each worksheet contains step-specific alignment statistics. "no_remDup"/"remDup" in the worksheet name indicate whether the data has duplicates kept/removed, respectively. "All"/"0.75" etc., indicate whether the reads overlapping low complexity regions were kept/removed at the specified threshold, respectively.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S2.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3">
<caption>
<title>Additional File 3</title>
<p>
<bold>Alignment statistics for the ChIP-seq data in each data preprocessing step</bold>
.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S3.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S4">
<caption>
<title>Additional File 4</title>
<p>
<bold>Differentially expressed genes in the non-trimmed RNA-seq data, and the results of KEGG/GO/Reactome pathway enrichment analyses identified in each data preprocessing step</bold>
. Each worksheet contains step-specific alignment statistics, abbreviations are the same as in the Additional File 1.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S4.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S5">
<caption>
<title>Additional File 5</title>
<p>Differentially expressed genes in the trimmed RNA-seq data, and the results of KEGG/GO/Reactome pathway enrichment analyses identified in each data preprocessing step. Each worksheet contains step-specific alignment statistics, abbreviations are the same as in the Additional File 2.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S5.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S6">
<caption>
<title>Additional File 6</title>
<p>
<bold>Motifs and their detection E-values identified in the ChIP-seq data in each data preprocessing step</bold>
.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S6.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S7">
<caption>
<title>Additional File 7</title>
<p>
<bold>Investigation of the biology of genes lost after filtering out low complexity reads with RepeatSoaker</bold>
.</p>
</caption>
<media xlink:href="1471-2105-16-S13-S10-S7.pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We thank the two anonymous reviewers for their comments and critique. The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of any agency involved in the funding of the work. The authors would like to acknowledge the National Institute of Arthritis and Musculoskeletal and Skin Diseases (a subaward from grant # P30 AR053483 to Dozmorov MG), an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences (a subaward from grant # P30GM103510 to Dozmorov MG), the National Center for Advancing Translational Sciences (CTSA award No. UL1TR000058), NIH/NIGMS P30GM110766, NIH/NIGMS P20GM103456 and NIH/NIAMS R01AR065953 to Adrianto I, NIH/NIAMS P50AR060804, NIH/NIAMS R01AR065953 and NIH/NIGMS P20GM103456 to Lessard CJ, NIH/NIGMS P20GM103636 to Wren JD and Olson LE, and the National Science Foundation division of Advanced CyberInfrastructure # ACI-1345426 to Wren JD.</p>
</sec>
<sec>
<title>Funding for publication</title>
<p>Publication fees were paid primarily from NIH #P20GM103636.</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 16 Supplement 13, 2015: Proceedings of the 12th Annual MCBIOS Conference. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S13">http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S13</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Huddleston</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ranade</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Malig</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonacci</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Chaisson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>L</given-names>
</name>
<etal></etal>
<article-title>Reconstructing complex regions of genomes using long-read sequencing technology</article-title>
<source>Genome Res</source>
<year>2014</year>
<volume>24</volume>
<issue>4</issue>
<fpage>688</fpage>
<lpage>696</lpage>
<pub-id pub-id-type="doi">10.1101/gr.168450.113</pub-id>
<pub-id pub-id-type="pmid">24418700</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Howorka</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Cheley</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bayley</surname>
<given-names>H</given-names>
</name>
<article-title>Sequence-specific detection of individual DNA strands using engineered nanopores</article-title>
<source>Nat Biotechnol</source>
<year>2001</year>
<volume>19</volume>
<issue>7</issue>
<fpage>636</fpage>
<lpage>639</lpage>
<pub-id pub-id-type="doi">10.1038/90236</pub-id>
<pub-id pub-id-type="pmid">11433274</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="other">
<article-title>DNA sequencing costs</article-title>
<uri>http://www.genome.gov/sequencingcosts/</uri>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="other">
<name>
<surname>Walsh</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Carroll</surname>
<given-names>J</given-names>
</name>
<article-title>An Analysis of Next Generation Sequence Clipping Tools</article-title>
<source>Collaborative European Research Conference CERC 2013</source>
<year>2013</year>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Del Fabbro</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Scalabrin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morgante</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Giorgi</surname>
<given-names>FM</given-names>
</name>
<article-title>An extensive evaluation of read trimming effects on Illumina NGS data analysis</article-title>
<source>PLoS One</source>
<year>2013</year>
<volume>8</volume>
<issue>12</issue>
<fpage>e85024</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0085024</pub-id>
<pub-id pub-id-type="pmid">24376861</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="book">
<name>
<surname>Turner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Armstrong</surname>
<given-names>LL</given-names>
</name>
<name>
<surname>Bradford</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Crawford</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Crenshaw</surname>
<given-names>AT</given-names>
</name>
<etal></etal>
<person-group person-group-type="editor">Jonathan L Haines</person-group>
<article-title>Quality control procedures for genome-wide association studies</article-title>
<source>Current protocols in human genetics / editorial board</source>
<year>2011</year>
<volume>Chapter 1:Unit11.19</volume>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Banks</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Poplin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Garimella</surname>
<given-names>KV</given-names>
</name>
<name>
<surname>Maguire</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Hartl</surname>
<given-names>C</given-names>
</name>
<etal></etal>
<article-title>A framework for variation discovery and genotyping using next-generation DNA sequencing data</article-title>
<source>Nat Genet</source>
<year>2011</year>
<volume>43</volume>
<issue>5</issue>
<fpage>491</fpage>
<lpage>498</lpage>
<pub-id pub-id-type="doi">10.1038/ng.806</pub-id>
<pub-id pub-id-type="pmid">21478889</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="other">
<article-title>How PCR Duplicates Arise in Next-Generation Sequencing</article-title>
<uri>http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/</uri>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Abel</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Al-Kateb</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Cottrell</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Bredemeyer</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Pritchard</surname>
<given-names>CC</given-names>
</name>
<name>
<surname>Grossmann</surname>
<given-names>AH</given-names>
</name>
<etal></etal>
<article-title>Detection of Gene Rearrangements in Targeted Clinical Next-Generation Sequencing</article-title>
<source>J Mol Diagn</source>
<year>2014</year>
<volume>16</volume>
<issue>4</issue>
<fpage>405</fpage>
<lpage>417</lpage>
<pub-id pub-id-type="doi">10.1016/j.jmoldx.2014.03.006</pub-id>
<pub-id pub-id-type="pmid">24813172</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Negre</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Mieczkowska</surname>
<given-names>JO</given-names>
</name>
<name>
<surname>Slattery</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>T</given-names>
</name>
<etal></etal>
<article-title>Systematic evaluation of factors influencing ChIP-seq fidelity</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>6</issue>
<fpage>609</fpage>
<lpage>614</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1985</pub-id>
<pub-id pub-id-type="pmid">22522655</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Handsaker</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wysoker</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Fennell</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Homer</surname>
<given-names>N</given-names>
</name>
<etal></etal>
<article-title>The Sequence Alignment/Map format and SAMtools</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>16</issue>
<fpage>2078</fpage>
<lpage>2079</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
<pub-id pub-id-type="pmid">19505943</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<article-title>ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions</article-title>
<source>Nat Rev Genet</source>
<year>2012</year>
<volume>13</volume>
<issue>12</issue>
<fpage>840</fpage>
<lpage>852</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3306</pub-id>
<pub-id pub-id-type="pmid">23090257</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="other">
<name>
<surname>Zhou</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Eterovic</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Meric-Bernstam</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Mills</surname>
<given-names>GB</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<article-title>Bias from removing read duplication in ultra-deep sequencing experiments</article-title>
<source>Bioinformatics</source>
<year>2014</year>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Majewski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ott</surname>
<given-names>J</given-names>
</name>
<article-title>Distribution and characterization of regulatory elements in the human genome</article-title>
<source>Genome Res</source>
<year>2002</year>
<volume>12</volume>
<issue>12</issue>
<fpage>1827</fpage>
<lpage>1836</lpage>
<pub-id pub-id-type="doi">10.1101/gr.606402</pub-id>
<pub-id pub-id-type="pmid">12466286</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Zilversmit</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Hartl</surname>
<given-names>DL</given-names>
</name>
<article-title>On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins</article-title>
<source>Gene</source>
<year>2006</year>
<volume>378</volume>
<fpage>19</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="pmid">16806741</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Kapitonov</surname>
<given-names>VV</given-names>
</name>
<name>
<surname>Jurka</surname>
<given-names>J</given-names>
</name>
<article-title>A universal classification of eukaryotic transposable elements implemented in Repbase</article-title>
<source>Nat Rev Genet</source>
<year>2008</year>
<volume>9</volume>
<issue>5</issue>
<fpage>411</fpage>
<lpage>412</lpage>
<pub-id pub-id-type="doi">10.1038/nrg2165-c1</pub-id>
<pub-id pub-id-type="pmid">18421312</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ruotti</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Thomson</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Dewey</surname>
<given-names>CN</given-names>
</name>
<article-title>RNA-Seq gene expression estimation with read mapping uncertainty</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>4</issue>
<fpage>493</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp692</pub-id>
<pub-id pub-id-type="pmid">20022975</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Chung</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kuan</surname>
<given-names>PF</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Sanalkumar</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Bresnick</surname>
<given-names>EH</given-names>
</name>
<etal></etal>
<article-title>Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data</article-title>
<source>PLoS Comput Biol</source>
<year>2011</year>
<volume>7</volume>
<issue>7</issue>
<fpage>e1002111</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1002111</pub-id>
<pub-id pub-id-type="pmid">21779159</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Linton</surname>
<given-names>LM</given-names>
</name>
<name>
<surname>Birren</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zody</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Baldwin</surname>
<given-names>J</given-names>
</name>
<etal></etal>
<article-title>Initial sequencing and analysis of the human genome</article-title>
<source>Nature</source>
<year>2001</year>
<volume>409</volume>
<issue>6822</issue>
<fpage>860</fpage>
<lpage>921</lpage>
<pub-id pub-id-type="doi">10.1038/35057062</pub-id>
<pub-id pub-id-type="pmid">11237011</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<collab>1000 Genomes Project Consortium</collab>
<name>
<surname>Abecasis</surname>
<given-names>GR</given-names>
</name>
<name>
<surname>Auton</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Brooks</surname>
<given-names>LD</given-names>
</name>
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>RM</given-names>
</name>
<etal></etal>
<article-title>An integrated map of genetic variation from 1,092 human genomes</article-title>
<source>Nature</source>
<year>2012</year>
<volume>491</volume>
<issue>7422</issue>
<fpage>56</fpage>
<lpage>65</lpage>
<pub-id pub-id-type="doi">10.1038/nature11632</pub-id>
<pub-id pub-id-type="pmid">23128226</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Hangauer</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Vaughn</surname>
<given-names>IW</given-names>
</name>
<name>
<surname>McManus</surname>
<given-names>MT</given-names>
</name>
<article-title>Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs</article-title>
<source>PLoS Genet</source>
<year>2013</year>
<volume>9</volume>
<issue>6</issue>
<fpage>e1003569</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pgen.1003569</pub-id>
<pub-id pub-id-type="pmid">23818866</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Pickrell</surname>
<given-names>JK</given-names>
</name>
<name>
<surname>Gaffney</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Gilad</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
<article-title>False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>15</issue>
<fpage>2144</fpage>
<lpage>2146</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr354</pub-id>
<pub-id pub-id-type="pmid">21690102</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Stanciu</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
<article-title>Computational methods for discovering structural variation with next-generation sequencing</article-title>
<source>Nat Methods</source>
<year>2009</year>
<volume>6</volume>
<issue>11</issue>
<fpage>S13</fpage>
<lpage>S20</lpage>
<pub-id pub-id-type="pmid">19844226</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Lee</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Popodi</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Foster</surname>
<given-names>PL</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<article-title>Detection of structural variants involving repetitive regions in the reference genome</article-title>
<source>J Comput Biol</source>
<year>2014</year>
<volume>21</volume>
<issue>3</issue>
<fpage>219</fpage>
<lpage>233</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2013.0129</pub-id>
<pub-id pub-id-type="pmid">24552580</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Schmieder</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>R</given-names>
</name>
<article-title>Quality control and preprocessing of metagenomic datasets</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>863</fpage>
<lpage>864</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr026</pub-id>
<pub-id pub-id-type="pmid">21278185</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Olson</surname>
<given-names>LE</given-names>
</name>
<name>
<surname>Soriano</surname>
<given-names>P</given-names>
</name>
<article-title>Increased PDGFRalpha activation disrupts connective tissue development and drives systemic fibrosis</article-title>
<source>Dev Cell</source>
<year>2009</year>
<volume>16</volume>
<issue>2</issue>
<fpage>303</fpage>
<lpage>313</lpage>
<pub-id pub-id-type="doi">10.1016/j.devcel.2008.12.003</pub-id>
<pub-id pub-id-type="pmid">19217431</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Bolger</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Lohse</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Usadel</surname>
<given-names>B</given-names>
</name>
<article-title>Trimmomatic: A flexible trimmer for Illumina Sequence Data</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>15</issue>
<fpage>2114</fpage>
<lpage>2120</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu170</pub-id>
<pub-id pub-id-type="pmid">24695404</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="book">
<name>
<surname>Chen</surname>
<given-names>N</given-names>
</name>
<person-group person-group-type="editor">Andreas D Baxevanis</person-group>
<article-title>Using RepeatMasker to identify repetitive elements in genomic sequences</article-title>
<source>Curr Protoc Bioinformatics</source>
<year>2004</year>
<volume>Chapter 4:Unit 4.10</volume>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="other">
<article-title>RepeatMasker Open-3.0</article-title>
<uri>http://www.repeatmasker.org</uri>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="other">
<collab>FAQ</collab>
<source>BED format</source>
<uri>http://genome.ucsc.edu/goldenPath/help/customTrack.html</uri>
<comment>BED</comment>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Goff</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<etal></etal>
<article-title>Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks</article-title>
<source>Nat Protoc</source>
<year>2012</year>
<volume>7</volume>
<issue>3</issue>
<fpage>562</fpage>
<lpage>578</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2012.016</pub-id>
<pub-id pub-id-type="pmid">22383036</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="other">
<name>
<surname>Team</surname>
<given-names>RDC</given-names>
</name>
<source>R: A Language and Environment for Statistical Computing</source>
<year>2013</year>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>R25</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-3-r25</pub-id>
<pub-id pub-id-type="pmid">19261174</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Feng</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<article-title>Using MACS to identify peaks from ChIP-Seq data</article-title>
<source>Curr Protoc Bioinformatics</source>
<year>2011</year>
<volume>Chapter 2:Unit 2.14</volume>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Machanick</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<article-title>MEME-ChIP: motif analysis of large DNA datasets</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>12</issue>
<fpage>1696</fpage>
<lpage>1697</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr189</pub-id>
<pub-id pub-id-type="pmid">21486936</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Rosenbloom</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Sloan</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Malladi</surname>
<given-names>VS</given-names>
</name>
<name>
<surname>Dreszer</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Learned</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kirkup</surname>
<given-names>VM</given-names>
</name>
<etal></etal>
<article-title>ENCODE data in the UCSC Genome Browser: year 5 update</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<issue>Database issue</issue>
<fpage>D56</fpage>
<lpage>D63</lpage>
<pub-id pub-id-type="pmid">23193274</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000227  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000227  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024