Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data

Identifieur interne : 000081 ( Pmc/Corpus ); précédent : 000080; suivant : 000082

NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data

Auteurs : Mohan A. V. S. K. Katta ; Aamir W. Khan ; Dadakhalandar Doddamani ; Mahendar Thudi ; Rajeev K. Varshney

Source :

RBID : PMC:4604202

Abstract

Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.


Url:
DOI: 10.1371/journal.pone.0139868
PubMed: 26460497
PubMed Central: 4604202

Links to Exploration step

PMC:4604202

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data</title>
<author>
<name sortKey="Katta, Mohan A V S K" sort="Katta, Mohan A V S K" uniqKey="Katta M" first="Mohan A. V. S. K." last="Katta">Mohan A. V. S. K. Katta</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Khan, Aamir W" sort="Khan, Aamir W" uniqKey="Khan A" first="Aamir W." last="Khan">Aamir W. Khan</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Doddamani, Dadakhalandar" sort="Doddamani, Dadakhalandar" uniqKey="Doddamani D" first="Dadakhalandar" last="Doddamani">Dadakhalandar Doddamani</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Thudi, Mahendar" sort="Thudi, Mahendar" uniqKey="Thudi M" first="Mahendar" last="Thudi">Mahendar Thudi</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Varshney, Rajeev K" sort="Varshney, Rajeev K" uniqKey="Varshney R" first="Rajeev K." last="Varshney">Rajeev K. Varshney</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff002">
<addr-line>School of Plant Biology and Institute of Agriculture, The University of Western Australia, Crawley, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26460497</idno>
<idno type="pmc">4604202</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4604202</idno>
<idno type="RBID">PMC:4604202</idno>
<idno type="doi">10.1371/journal.pone.0139868</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000081</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data</title>
<author>
<name sortKey="Katta, Mohan A V S K" sort="Katta, Mohan A V S K" uniqKey="Katta M" first="Mohan A. V. S. K." last="Katta">Mohan A. V. S. K. Katta</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Khan, Aamir W" sort="Khan, Aamir W" uniqKey="Khan A" first="Aamir W." last="Khan">Aamir W. Khan</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Doddamani, Dadakhalandar" sort="Doddamani, Dadakhalandar" uniqKey="Doddamani D" first="Dadakhalandar" last="Doddamani">Dadakhalandar Doddamani</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Thudi, Mahendar" sort="Thudi, Mahendar" uniqKey="Thudi M" first="Mahendar" last="Thudi">Mahendar Thudi</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Varshney, Rajeev K" sort="Varshney, Rajeev K" uniqKey="Varshney R" first="Rajeev K." last="Varshney">Rajeev K. Varshney</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff002">
<addr-line>School of Plant Biology and Institute of Agriculture, The University of Western Australia, Crawley, Australia</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (
<ext-link ext-link-type="uri" xlink:href="http://htslib.org/">http://htslib.org</ext-link>
), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (
<ext-link ext-link-type="uri" xlink:href="https://github.com/ssadedin/bpipe">https://github.com/ssadedin/bpipe</ext-link>
) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either
<italic>quick</italic>
or
<italic>complete</italic>
mode. In addition, the pipeline in
<italic>quick</italic>
mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox">https://github.com/CEG-ICRISAT/NGS-QCbox</ext-link>
and
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Thudi, M" uniqKey="Thudi M">M Thudi</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Jackson, Sa" uniqKey="Jackson S">SA Jackson</name>
</author>
<author>
<name sortKey="May, Gd" uniqKey="May G">GD May</name>
</author>
<author>
<name sortKey="Varshney, Rk" uniqKey="Varshney R">RK Varshney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mccouch, S" uniqKey="Mccouch S">S McCouch</name>
</author>
<author>
<name sortKey="Baute, Gj" uniqKey="Baute G">GJ Baute</name>
</author>
<author>
<name sortKey="Bradeen, J" uniqKey="Bradeen J">J Bradeen</name>
</author>
<author>
<name sortKey="Bramel, P" uniqKey="Bramel P">P Bramel</name>
</author>
<author>
<name sortKey="Bretting, Pk" uniqKey="Bretting P">PK Bretting</name>
</author>
<author>
<name sortKey="Buckler, E" uniqKey="Buckler E">E Buckler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiao, Y" uniqKey="Jiao Y">Y Jiao</name>
</author>
<author>
<name sortKey="Zhao, H" uniqKey="Zhao H">H Zhao</name>
</author>
<author>
<name sortKey="Ren, L" uniqKey="Ren L">L Ren</name>
</author>
<author>
<name sortKey="Song, W" uniqKey="Song W">W Song</name>
</author>
<author>
<name sortKey="Zeng, B" uniqKey="Zeng B">B Zeng</name>
</author>
<author>
<name sortKey="Guo, J" uniqKey="Guo J">J Guo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mace, Es" uniqKey="Mace E">ES Mace</name>
</author>
<author>
<name sortKey="Tai, S" uniqKey="Tai S">S Tai</name>
</author>
<author>
<name sortKey="Gilding, Ek" uniqKey="Gilding E">EK Gilding</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Prentis, Pj" uniqKey="Prentis P">PJ Prentis</name>
</author>
<author>
<name sortKey="Bian, L" uniqKey="Bian L">L Bian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Varshney, Rk" uniqKey="Varshney R">RK Varshney</name>
</author>
<author>
<name sortKey="Song, C" uniqKey="Song C">C Song</name>
</author>
<author>
<name sortKey="Saxena, Rk" uniqKey="Saxena R">RK Saxena</name>
</author>
<author>
<name sortKey="Azam, S" uniqKey="Azam S">S Azam</name>
</author>
<author>
<name sortKey="Yu, S" uniqKey="Yu S">S Yu</name>
</author>
<author>
<name sortKey="Sharpe, Ag" uniqKey="Sharpe A">AG Sharpe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author>
<name sortKey="Hall, Bg" uniqKey="Hall B">BG Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bertels, F" uniqKey="Bertels F">F Bertels</name>
</author>
<author>
<name sortKey="Silander, Ok" uniqKey="Silander O">OK Silander</name>
</author>
<author>
<name sortKey="Pachkov, M" uniqKey="Pachkov M">M Pachkov</name>
</author>
<author>
<name sortKey="Rainey, Pb" uniqKey="Rainey P">PB Rainey</name>
</author>
<author>
<name sortKey="Van Nimwegen, E" uniqKey="Van Nimwegen E">E van Nimwegen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patel, Rk" uniqKey="Patel R">RK Patel</name>
</author>
<author>
<name sortKey="Jain, M" uniqKey="Jain M">M Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anders, S" uniqKey="Anders S">S Anders</name>
</author>
<author>
<name sortKey="Pyl, Pt" uniqKey="Pyl P">PT Pyl</name>
</author>
<author>
<name sortKey="Huber, W" uniqKey="Huber W">W Huber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cock, Pj" uniqKey="Cock P">PJ Cock</name>
</author>
<author>
<name sortKey="Fields, Cj" uniqKey="Fields C">CJ Fields</name>
</author>
<author>
<name sortKey="Goto, N" uniqKey="Goto N">N Goto</name>
</author>
<author>
<name sortKey="Heuer, Ml" uniqKey="Heuer M">ML Heuer</name>
</author>
<author>
<name sortKey="Rice, Pm" uniqKey="Rice P">PM Rice</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadedin, Sp" uniqKey="Sadedin S">SP Sadedin</name>
</author>
<author>
<name sortKey="Pope, B" uniqKey="Pope B">B Pope</name>
</author>
<author>
<name sortKey="Oshlack, A" uniqKey="Oshlack A">A Oshlack</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Handsaker, B" uniqKey="Handsaker B">B Handsaker</name>
</author>
<author>
<name sortKey="Wysoker, A" uniqKey="Wysoker A">A Wysoker</name>
</author>
<author>
<name sortKey="Fennell, T" uniqKey="Fennell T">T Fennell</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N Homer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quinlan, Ar" uniqKey="Quinlan A">AR Quinlan</name>
</author>
<author>
<name sortKey="Hall, Im" uniqKey="Hall I">IM Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmieder, R" uniqKey="Schmieder R">R Schmieder</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Myers, Jr" uniqKey="Myers J">JR Myers</name>
</author>
<author>
<name sortKey="Marth, Gt" uniqKey="Marth G">GT Marth</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goff, Sa" uniqKey="Goff S">SA Goff</name>
</author>
<author>
<name sortKey="Vaughn, M" uniqKey="Vaughn M">M Vaughn</name>
</author>
<author>
<name sortKey="Mckay, S" uniqKey="Mckay S">S McKay</name>
</author>
<author>
<name sortKey="Lyons, E" uniqKey="Lyons E">E Lyons</name>
</author>
<author>
<name sortKey="Stapleton, Ae" uniqKey="Stapleton A">AE Stapleton</name>
</author>
<author>
<name sortKey="Gessler, D" uniqKey="Gessler D">D Gessler</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26460497</article-id>
<article-id pub-id-type="pmc">4604202</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0139868</article-id>
<article-id pub-id-type="publisher-id">PONE-D-15-23733</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data</article-title>
<alt-title alt-title-type="running-head">NGS-QCbox for Quality Control of NGS Data</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Katta</surname>
<given-names>Mohan A. V. S. K.</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Khan</surname>
<given-names>Aamir W.</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Doddamani</surname>
<given-names>Dadakhalandar</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Thudi</surname>
<given-names>Mahendar</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Varshney</surname>
<given-names>Rajeev K.</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
<xref rid="cor001" ref-type="corresp">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>School of Plant Biology and Institute of Agriculture, The University of Western Australia, Crawley, Australia</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Wang</surname>
<given-names>Junwen</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>The University of Hong Kong, HONG KONG</addr-line>
</aff>
<author-notes>
<fn fn-type="conflict" id="coi001">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con" id="contrib001">
<p>Conceived and designed the experiments: RKV MAVSKK. Performed the experiments: MAVSKK AWK DD. Analyzed the data: MAVSKK AWK DD RKV. Contributed reagents/materials/analysis tools: RKV MAVSKK AWK DD MT. Wrote the paper: RKV MAVSKK AWK DD MT.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>r.k.varshney@cgiar.org</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>10</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>10</issue>
<elocation-id>e0139868</elocation-id>
<history>
<date date-type="received">
<day>1</day>
<month>6</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>9</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-year>2015</copyright-year>
<copyright-holder>Katta et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0139868.pdf"></self-uri>
<abstract>
<p>Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (
<ext-link ext-link-type="uri" xlink:href="http://htslib.org/">http://htslib.org</ext-link>
), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (
<ext-link ext-link-type="uri" xlink:href="https://github.com/ssadedin/bpipe">https://github.com/ssadedin/bpipe</ext-link>
) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either
<italic>quick</italic>
or
<italic>complete</italic>
mode. In addition, the pipeline in
<italic>quick</italic>
mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox">https://github.com/CEG-ICRISAT/NGS-QCbox</ext-link>
and
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.</p>
</abstract>
<funding-group>
<funding-statement>Authors are thankful to the CGIAR Generation Challenge Program for financial support. This work has been undertaken as part of the CGIAR Research Program on Grain Legumes. ICRISAT is a member of the CGIAR Consortium.</funding-statement>
</funding-group>
<counts>
<fig-count count="2"></fig-count>
<table-count count="2"></table-count>
<page-count count="9"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>Raspberry, the inhouse tool is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
The NGS-QCbox pipeline is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox">https://github.com/CEG-ICRISAT/NGS-QCbox</ext-link>
. The simulated dataset used for benchmarking is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing">https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>Raspberry, the inhouse tool is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
The NGS-QCbox pipeline is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox">https://github.com/CEG-ICRISAT/NGS-QCbox</ext-link>
. The simulated dataset used for benchmarking is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing">https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing</ext-link>
.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec001">
<title>Introduction</title>
<p>Next generation sequencing (NGS) technologies generates large volumes of data that are proven to be cost effective over conventional sequencing methods. Rapid decline in costs of the data generation in recent years has boosted rapid adoption of NGS based applications towards unraveling biological questions [
<xref rid="pone.0139868.ref001" ref-type="bibr">1</xref>
]. NGS approaches generate large volumes of data that are cost effective over conventional sequencing methods. Availability of genome wide information of species was a major constraint until NGS was introduced and adopted. The primary application of such studies involve
<italic>de novo</italic>
genome assembly, whole genome re-sequencing, targeted studies apart from other specialized analyses such as RNA-Seq. For example, several plant genomes have been sequenced [
<xref rid="pone.0139868.ref002" ref-type="bibr">2</xref>
] and now efforts are underway to harness the diversity for crop improvement though re-sequencing of thousands of germplasm lines for instance rice (
<ext-link ext-link-type="uri" xlink:href="http://www.gigasciencejournal.com/content/3/1/7">http://www.gigasciencejournal.com/content/3/1/7</ext-link>
), maize [
<xref rid="pone.0139868.ref003" ref-type="bibr">3</xref>
], sorghum [
<xref rid="pone.0139868.ref004" ref-type="bibr">4</xref>
], chickpea [
<xref rid="pone.0139868.ref005" ref-type="bibr">5</xref>
] have been sequenced. NGS technologies typically generate gigabytes to terabytes of raw data and in due course the data accumulates to the scale of terabytes to petabytes in public archives. For example, as of May 2015, the European Nucleotide Archive (ENA) contains a massive dataset of 13.7 trillion read sequences (1,757.3 trillion bases) with the number of reads deposited doubling every 22.9 months (
<ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/about/statistics#sra_growth">http://www.ebi.ac.uk/ena/about/statistics#sra_growth</ext-link>
). Notably, in the period between 2006 and 2010, ENA has shown significant increase in the volume of data deposited and hence reflects the data generated. In addition to the data storage related issues, the challenge is to process and hence develop efficient tools to use the huge data towards downstream analysis in a limited time [
<xref rid="pone.0139868.ref006" ref-type="bibr">6</xref>
,
<xref rid="pone.0139868.ref007" ref-type="bibr">7</xref>
]. The data needs to be analyzed and archived for re-use at a later stage. Hence, prior to the downstream analysis, the NGS data typically needs to be processed for quality thereby generating high quality reads. Several tools like NGS QC Toolkit [
<xref rid="pone.0139868.ref008" ref-type="bibr">8</xref>
], FastQC (
<ext-link ext-link-type="uri" xlink:href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">http://www.bioinformatics.babraham.ac.uk/projects/fastqc/</ext-link>
) and HTSeq [
<xref rid="pone.0139868.ref009" ref-type="bibr">9</xref>
] exist for extracting high quality read data. But the existing tools/pipelines are capable of handling only few to tens of samples at an instance. Nevertheless, these tools could not address the issue of processing the huge volumes of data in parallel. Hence there is a pressing need for tools that can scale up to process thousands of samples simultaneously in short time. In this context, quality control (QC) of raw and large-scale NGS data demands automation.</p>
<p>In recent past, stand-alone quality control tools and pipelines have been developed to manage the overwhelming volume of data. For instance, quality control tools/pipelines like NGS QC Toolkit [
<xref rid="pone.0139868.ref008" ref-type="bibr">8</xref>
] (
<ext-link ext-link-type="uri" xlink:href="http://59.163.192.90:8080/ngsqctoolkit">http://59.163.192.90:8080/ngsqctoolkit</ext-link>
) and Python (
<ext-link ext-link-type="uri" xlink:href="http://www.python.org/">http://www.python.org</ext-link>
) based HTSeq [
<xref rid="pone.0139868.ref009" ref-type="bibr">9</xref>
] (
<ext-link ext-link-type="uri" xlink:href="http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html">http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html</ext-link>
) were developed to address these constraints but are slow. In general, these pipelines/tools are meant to work on datasets in serial manner that can be daunting for the end user while dealing with large datasets. Nowadays, not only the servers, but also modern personal computers include multicore processors and therefore several NGS tools have been developed to process the data in parallel by multi-threading. Keeping in view the requirement of an automated pipeline to analyze large-scale raw NGS data, a menu driven pipeline, namely NGS-QCbox that integrates Raspberry, an in-house developed tool, with other open source tools has been developed. The pipeline focuses on processing large datasets in parallel and provides informative and crisp statistics. Typically, service providers or labs involved in generating NGS data develop in-house scripts, for pre-processing of NGS data. NGS-QCbox aims to hasten and ease the processing of huge data in reasonable time frame.</p>
<p>The NGS-QCbox is meant to complement existing tools for QC. It aims to be a decision making tool in assisting the scientist to judge if sufficient quality data has been generated with an optimal coverage as the experiment demands.</p>
</sec>
<sec id="sec002">
<title>Results and Discussion</title>
<sec id="sec003">
<title>Raspberry–a tool for FASTQ data statistics</title>
<p>The data generated from Illumina sequencing machines is in binary format (bcl). This is converted to FASTQ format, which we refer to as raw data. Depending on the number of samples or the data generated, often the raw data may be stored in compressed (gzip) or uncompressed format (plain text). In FASTQ format [
<xref rid="pone.0139868.ref010" ref-type="bibr">10</xref>
], each read is represented linearly as a record of four lines that includes identifier, sequence and its base quality information. The base quality information includes an offset of either 33 (HiSeq/MiSeq) or 64 (GAIIx). In order to assess the quality of the data generated, we have developed an in-house tool called Raspberry (v0.3) (
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
) in C language utilizing HTSlib (v1.2) (
<ext-link ext-link-type="uri" xlink:href="http://htslib.org/">http://htslib.org</ext-link>
) towards computing read/base level metrics. The tool provides an account of total number of bases, reads, Q20 and Q30 bases, range of read length, average read length, range of quality, range of phred quality score, number of A/T/G/C and N characters and GC content. In addition, it provides a file with read lengths that could be plotted using a Python script, ‘read_length_distribution.py’ included in ‘utils’ folder of the package. Raspberry can be used as a stand-alone tool. It accepts compressed and uncompressed FASTQ format files as input. Raspberry, by default utilizes all the processors available on the machine. However, it allows user to change the number of processors to be used with the “–t” option. This could be beneficial if the server/workstation is under heavy load and the user has less number of processors allocated than the total available. This option would facilitate processing of the data in batches by the number of processors provided. Note that the number of processors opted translates to the number of samples processed at a given instance (batch). The datasets of legacy Illumina platforms that were encoded with a phred offset ‘64’ could be processed using the ‘–p 64’ option. By default the value is set to 33 which is the latest standard phred offset followed on HiSeq and MiSeq platforms. The manual made available online lists these use cases.</p>
</sec>
<sec id="sec004">
<title>Integrated pipeline</title>
<p>A top-level Python script (NGSQCbox-v0.1.py) presents a menu driven interface for the required input data and spawns tasks in parallel for each sample using in-built shell scripts and Bpipe [
<xref rid="pone.0139868.ref011" ref-type="bibr">11</xref>
] configuration files. Internally, Bpipe (v0.9.8.6_beta_1) was used to integrate NGS tools such as Raspberry (v0.3) (
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/Raspberry">https://github.com/CEG-ICRISAT/Raspberry</ext-link>
), Sickle (v1.200) (
<ext-link ext-link-type="uri" xlink:href="https://github.com/najoshi/sickle">https://github.com/najoshi/sickle</ext-link>
), Bowtie 2 (v2.1.0) [
<xref rid="pone.0139868.ref012" ref-type="bibr">12</xref>
], SAMtools (v0.1.19+) [
<xref rid="pone.0139868.ref013" ref-type="bibr">13</xref>
] and bedtools (v2.17.0) [
<xref rid="pone.0139868.ref014" ref-type="bibr">14</xref>
]. Within Bpipe, each of the components of the pipeline is represented as re-usable tasks or blocks of code that may run in parallel to reduce computational run time–task oriented parallelism.</p>
<p>The pipeline can be used in two modes:
<italic>quick</italic>
and
<italic>complete</italic>
(
<xref rid="pone.0139868.g001" ref-type="fig">Fig 1</xref>
). If the user is limited by time,
<italic>quick</italic>
mode could be used to have a general overview of the reads generated that includes base level metrics with quality trimming step. Alternately, it could be run in
<italic>complete</italic>
mode to generate additional information such as coverage, alignment, mean read depths and variants. The
<italic>complete</italic>
mode QC is in a way a full-fledged pipeline that covers processing raw reads to identifying variants.
<xref rid="pone.0139868.g002" ref-type="fig">Fig 2</xref>
depicts the interface for the two modes of the pipeline. The parameters such as cut-off phred quality score, post trimming read length cut-off, data location and number of cores to be used for the quick QC mode. B) In addition to parameters included in quick QC mode, information related to the genome (bowtie index, genome size, number of processors used by bowtie) are incorporated as additional parameters in the complete QC mode.</p>
<fig id="pone.0139868.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0139868.g001</object-id>
<label>Fig 1</label>
<caption>
<title>Flowchart of NGS-QCbox pipeline illustrating the two modes of usage namely
<italic>quick</italic>
and
<italic>complete</italic>
.</title>
<p>NGS-QCbox comprises of two workflow modes namely
<italic>quick</italic>
and
<italic>complete</italic>
. In
<italic>quick</italic>
mode, read/base level metrics are computed in parallel using Raspberry, an in-house tool, both before and after quality trimming. On the other hand,
<italic>complete</italic>
mode is full-fledged quality control and variant calling pipeline that integrates quick mode and additionally generates genome coverage information in parallel. Quality of the data generated could be assessed using this information.</p>
</caption>
<graphic xlink:href="pone.0139868.g001"></graphic>
</fig>
<fig id="pone.0139868.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0139868.g002</object-id>
<label>Fig 2</label>
<caption>
<title>The menu driven interface for NGS-QCbox for quick and complete mode respectively.</title>
<p>(a) Shows the prompt and the parameters such as cut-off phred score, minimum read length after trimming, data source and number of processors to be used for the quick QC mode. (b) Complete QC mode adds more parameters to quick mode like information related to the genome (bowtie index, genome size, number of processors used by bowtie).</p>
</caption>
<graphic xlink:href="pone.0139868.g002"></graphic>
</fig>
<p>Quality trimming by sickle includes parameters ‘-q 30’, ‘-l 50’, ‘-n’ and ‘-t sanger’, when run with default parameters. The alignment parameters of Bowtie 2 include ‘–end-to-end’ mode with user provided insert size information. The bedtools
<italic>genomecov</italic>
parameters include ‘-bga’ followed by the output processing using an in-house python script ‘genome_cov.py’ to compute the genome coverage from the alignment. Variant calling with SAMtools include the parameters ‘-uf’ to samtools
<italic>mpileup</italic>
and ‘-bvcg’ to bcftools respectively. In
<italic>quick</italic>
mode the user needs the sample information alone while in the
<italic>complete</italic>
mode one needs to provide additional information such as insert size of each sample size and Bowtie 2 index of the genome (see online manual). Users can freely modify the steps and subsequently the parameters involved by editing the bpipe scripts in the pipeline for both
<italic>quick</italic>
and
<italic>complete</italic>
mode QC.</p>
</sec>
<sec id="sec005">
<title>The workflow</title>
<p>The pipeline starts processing paired end reads using Raspberry. Base quality trimming of the paired end reads is performed using Sickle to produce high quality reads. Raspberry generates the metrics both before and after the base quality trimming step to help compare and assess the amount of reads/bases passing QC filter. This is essentially the quick QC mode. The
<italic>complete</italic>
QC mode includes the
<italic>quick</italic>
QC mode as well as the following steps. The high quality reads are aligned to the reference genome using Bowtie 2 to generate alignments in SAM/BAM format (
<ext-link ext-link-type="uri" xlink:href="https://samtools.github.io/hts-specs/SAMv1.pdf">https://samtools.github.io/hts-specs/SAMv1.pdf</ext-link>
) followed by indexing with SAMtools. The singletons produced by the Sickle are also used in alignment. Bedtools is then used to compute the genome coverage based on the read depths at each base position. An in-house Python script, genome_cov.py (included with the tool source code) summarizes the output in terms of X coverage (1X, 2X, 5X, 10X, 15X). The X coverage trace could be used to evaluate the drop in read depth that may affect variant calling downstream. SAMtools is simultaneously used to call variants in VCF format (
<ext-link ext-link-type="uri" xlink:href="http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41">http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41</ext-link>
) for each sample. The VCF files generated could be filtered and used for downstream applications such as GWAS, diversity analysis and development of markers for genotyping applications. The
<italic>complete</italic>
QC mode serves as a full-fledged pipeline for variant calling and is unique feature of the pipeline. An example file containing information on the output is illustrated in the
<xref rid="pone.0139868.s001" ref-type="supplementary-material">S1 Table</xref>
. Computation of genome coverage and variant calling steps use the same BAM files as input and hence these tasks have been parallelized internally to save time.</p>
</sec>
<sec id="sec006">
<title>Salient features of the pipeline</title>
<list list-type="order">
<list-item>
<p>The pipeline accepts compressed/uncompressed paired-end Illumina FASTQ data.</p>
</list-item>
<list-item>
<p>Easy to use python based interface for fast (quick) and detailed (complete) processing of data.</p>
</list-item>
<list-item>
<p>Scalability–The pipeline is designed to process hundreds/thousands of samples in parallel (automated).</p>
</list-item>
<list-item>
<p>Batch processing of jobs–Even with the availability of limited number of processors all the samples can be processed in batch mode automatically.</p>
</list-item>
<list-item>
<p>Use of advanced shell features (Process substitution), task oriented parallelism, Python/multiprocessing, and HTSlib to gain performance in speed and save disk-space.</p>
</list-item>
<list-item>
<p>The pipeline computes and summarizes genome coverage and variant detection in parallel which reduces the processing time.</p>
</list-item>
<list-item>
<p>Available as docker image to ease portability of the pipeline.</p>
</list-item>
</list>
</sec>
<sec id="sec007">
<title>Benchmarking</title>
<p>The features of NGS-QCbox were compared with five well known tools/pipelines namely Prinseq-lite [
<xref rid="pone.0139868.ref015" ref-type="bibr">15</xref>
], NGS QC Toolkit, HTSeq, FastQC and FastX Toolkit (fastx_quality_stats) (
<ext-link ext-link-type="uri" xlink:href="http://hannonlab.cshl.edu/fastx_toolkit/index.html">http://hannonlab.cshl.edu/fastx_toolkit/index.html</ext-link>
) (
<xref rid="pone.0139868.t001" ref-type="table">Table 1</xref>
). The
<italic>quick</italic>
QC mode of NGS-QCbox was used for benchmarking. The unique features of the pipeline are simple menu interface, batch processing of jobs, task parallelization and information on genome coverage and variations.</p>
<table-wrap id="pone.0139868.t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0139868.t001</object-id>
<label>Table 1</label>
<caption>
<title>A comparative account of the features of NGS-QCbox pipeline with five similar pipeline/tools.</title>
</caption>
<alternatives>
<graphic id="pone.0139868.t001g" xlink:href="pone.0139868.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">NGS-QCbox</th>
<th align="left" rowspan="1" colspan="1">Prinseq-lite</th>
<th align="left" rowspan="1" colspan="1">NGS QC Toolkit</th>
<th align="left" rowspan="1" colspan="1">HTSeq</th>
<th align="left" rowspan="1" colspan="1">FastQC</th>
<th align="left" rowspan="1" colspan="1">FastX</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Compressed FASTQ (input)</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Batch job processing</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">Y/N</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Genome coverage</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">Y</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Variations (SNP/INDEL)</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Menu interface</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Feature richness</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Task parallelization</td>
<td align="left" rowspan="1" colspan="1">Y</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
<td align="left" rowspan="1" colspan="1">N</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="t001fn001">
<p>The symbols Y and N denote Yes and No respectively describing the presence or absence of the feature.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="sec008">
<title>Performance in processing large datasets</title>
<p>In order to test the scalability and the performance of NGS-QCbox, it was evaluated with data from 1, 100, 200 and 300 simulated paired-end samples (
<xref rid="pone.0139868.t002" ref-type="table">Table 2</xref>
)
<bold>.</bold>
The tools were evaluated based on the performance observed with 1 processor against using 20 processors (parallel) with similar dataset size. The time consumed by NGS-QCbox to process one sample of size 4.38 Gb running on one processor was 217 seconds. This is a notable speedup of 2.76X over Prinseq-lite, 132X over NGS QC Toolkit, and 2.9X over HTSeq, 1.65X over FastQC and 4.9X over FastX. Similarly the time taken to process 100, 200 and 300 samples were in corresponding proportion because of the serial processing of the samples. In this case, with increase in the number of samples, all the programs scale up linearly with increase in data size (number of samples).</p>
<table-wrap id="pone.0139868.t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0139868.t002</object-id>
<label>Table 2</label>
<caption>
<title>Parallel performance comparison of NGS-QCbox with other tools/pipelines.</title>
</caption>
<alternatives>
<graphic id="pone.0139868.t002g" xlink:href="pone.0139868.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th colspan="8" align="center" rowspan="1">Quick QC mode</th>
</tr>
<tr>
<th rowspan="2" align="left" colspan="1">Samples</th>
<th rowspan="2" align="left" colspan="1">Processors</th>
<th align="left" rowspan="1" colspan="1">NGS-QCbox</th>
<th align="left" rowspan="1" colspan="1">Prinseq-lite</th>
<th align="left" rowspan="1" colspan="1">NGS QC Toolkit</th>
<th align="left" rowspan="1" colspan="1">HTSeq</th>
<th align="left" rowspan="1" colspan="1">FastQC</th>
<th align="left" rowspan="1" colspan="1">FastX</th>
</tr>
<tr>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
<th align="left" rowspan="1" colspan="1">(seconds)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">217</td>
<td align="left" rowspan="1" colspan="1">600</td>
<td align="left" rowspan="1" colspan="1">28,618</td>
<td align="left" rowspan="1" colspan="1">630</td>
<td align="left" rowspan="1" colspan="1">360</td>
<td align="left" rowspan="1" colspan="1">1,073</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">21,849</td>
<td align="left" rowspan="1" colspan="1">60,116</td>
<td align="left" rowspan="1" colspan="1">2,861,800*</td>
<td align="left" rowspan="1" colspan="1">63,513</td>
<td align="left" rowspan="1" colspan="1">36,121</td>
<td align="left" rowspan="1" colspan="1">107,300*</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">200</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">43,741</td>
<td align="left" rowspan="1" colspan="1">120,232*</td>
<td align="left" rowspan="1" colspan="1">5,523,600*</td>
<td align="left" rowspan="1" colspan="1">127,026*</td>
<td align="left" rowspan="1" colspan="1">72,318</td>
<td align="left" rowspan="1" colspan="1">214,600*</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">300</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">65,319</td>
<td align="left" rowspan="1" colspan="1">180,348*</td>
<td align="left" rowspan="1" colspan="1">8,285,400*</td>
<td align="left" rowspan="1" colspan="1">190,539*</td>
<td align="left" rowspan="1" colspan="1">108,477*</td>
<td align="left" rowspan="1" colspan="1">321,900*</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">217</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">10,020</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">3,472</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">954,221</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">200</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">4,636</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">1,908,442*</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">300</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">7,189</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">2,862,663*</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td colspan="8" align="center" rowspan="1">
<bold>Complete QC mode</bold>
</td>
</tr>
<tr>
<td colspan="2" align="right" rowspan="1"></td>
<td colspan="2" align="left" rowspan="1">
<bold>NGS-QCbox (seconds)</bold>
</td>
<td colspan="4" align="left" rowspan="1">
<bold>Shell script (sequential processing time in seconds)</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td colspan="2" align="left" rowspan="1">2,800</td>
<td colspan="4" align="left" rowspan="1">2,838</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td colspan="2" align="left" rowspan="1">31,723</td>
<td colspan="4" align="left" rowspan="1">NA</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="t002fn001">
<p>The tools were evaluated based on the performance observed with 1 processor against using 20 processors (parallel). To process one sample of size 4.38 Gb with one processor, NGS-QCbox consumes 217 seconds. This is a notable speedup of 2.76X over Prinseq-lite, 132X over NGS QC Toolkit, 2.9X over HTSeq, 1.65X over FastQC and 4.9X over FastX. In this case, with increase in the number of samples, all the programs scale linearly with increase in data size (samples). Similarly with 100, 200 and 300 samples, the speedups are the same order because of the serial processing of the samples. But when processing 100 samples in parallel with 20 processors the speedup obtained is 6.25X over the one processor run. Similar speedups of 9.4X and 9X were observed when comparing the runtime of 200 and 300 samples. This translates to the fact that the runtime to process each sample gets reduced to 23–34 seconds with parallelization which is a huge gain over running them serially. The “*” symbol indicates that the values are extrapolated based on the linear run time. Extrapolation is necessary because in such cases the run time exceeds a time period over days and months. NA indicates the program does not support parallelization. We have executed the flow of commands used in complete QC mode of pipeline into sequential order instead of parallel mode with one processor as input. It was observed that there was a loss of 38 seconds per sample when NGS-QCbox steps were ran sequentially. When complete QC mode was tested for 100 samples parallel processing gave a massive speedup of 8.82X.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>On the other hand, NGS-QCbox runtime with 20 processors is the same 217 seconds as observed earlier with 1 processor (
<xref rid="pone.0139868.t002" ref-type="table">Table 2</xref>
). This is because in both the cases (1 processor or 20 processors), threads are opened based on the number of samples (one sample per thread). Therefore if one sample is processed, it would use only one thread from the pool of threads. But the performance gain is evident with increasing the number of samples from 1 to 100, 200 and 300. While processing 100 samples in parallel with 20 processors, the speedup obtained is 6.29X over the one processor run. Better speedups of 9.4X and 9X were observed when comparing the runtime of 200 and 300 samples. This translates to the fact that the runtime to process each sample gets reduced to 23–34 seconds with parallelization which is a huge gain over running them serially. The programs such as Prinseq-lite, HTSeq and FastX run in serial, can only use one processor and hence were not considered for comparison in parallel. NGS-QCbox is much faster than NGS QC Toolkit when 20 processors were considered. In parallel mode (20 processors), NGS-QCbox ran 46X, 275X, 412X and 398X times faster than NGS QC Toolkit as evident from the comparisons of 1, 100, 200 and 300 samples respectively. However, sequential execution of the steps used in complete QC mode for a sample with one processor resulted in a loss of 38 seconds, underlining the efficiency of the pipeline in complete QC mode. Whereas scaling of number of samples to 100 with 20 processors gave a speedup of 8.82X compared to the run of one sample with one processor in complete QC mode (
<xref rid="pone.0139868.t002" ref-type="table">Table 2</xref>
). In summary, parallelization drastically improved and facilitated processing of hundreds of samples. This would scale to thousands of samples but then disk I/O may be a constraint that might limit the performance. The mechanism of batch processing of samples in NGS-QCbox helps in containing the disk I/O because even though the number of samples exceeds the number of processors available, limited number of threads in a batch limit the disk read/writes proportionately and ensure that the disk I/O does not become a bottleneck while processing large number of samples. This would indirectly help contain the memory usage. NGS-QCbox consumes negligible amount of memory as it reads only a line at a time from a sample file and therefore is suitable for use on desktop machines.</p>
<p>NGS-QCbox can be used to process any paired end data from NGS experiments such as DNA-Seq, RAD-Seq, GBS, RNA-Seq. For processing single end read datasets one may need to use Raspberry independently.</p>
<p>The NGS-QCbox pipeline and Raspberry are available alternatively from dockerhub (
<ext-link ext-link-type="uri" xlink:href="https://registry.hub.docker.com/">https://registry.hub.docker.com</ext-link>
) as a docker image (dadu/ngsqcbox:v0.2.1 for linux and dadu/ngsqcbox_win:v0.2 for Windows). Docker (docker.io) is a popular and portable lightweight linux container based technology to host applications in a virtual environment. This technology eases the process of distributing the application and thereby helps solve application related cross platform compatibility issues. The docker image solves the issue of dependencies across various linux platforms and Windows.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="sec009">
<title>Conclusions</title>
<p>NGS-QCbox is a generic pipeline that integrates open-source NGS tools in order to process large datasets of any organism. It was designed and implemented to take advantage of parallelization. Parallelization enables quick analysis of datasets that would otherwise be a daunting task. In
<italic>quick</italic>
mode we observe 6X to 9X speedup while scaling up to hundreds of datasets. Raspberry, an in-house tool was developed for quality control of raw data is integrated into the pipeline.</p>
</sec>
<sec sec-type="materials|methods" id="sec010">
<title>Materials and Methods</title>
<p>Raspberry, an in-house tool to process large datasets generated from Illumina next generation sequencers was developed in C language using htslib C library (
<ext-link ext-link-type="uri" xlink:href="http://htslib.org/">http://htslib.org</ext-link>
) API (Application Programming Interface). This library was chosen for efficiency in processing datasets with low memory consumption and faster processing. In addition, it supports reading compressed or uncompressed datasets. OpenMP (
<ext-link ext-link-type="uri" xlink:href="http://openmp.org/">http://openmp.org</ext-link>
) (v4.0) was used to process datasets in batches. Cmake (
<ext-link ext-link-type="uri" xlink:href="http://www.cmake.org/">http://www.cmake.org</ext-link>
) (v2.8) build system was used to develop Raspberry as it supports cross-platform compilation. Static binaries compiled on x86_64 machine architecture are provided with the software to facilitate direct use of the application by naive users.</p>
<p>The NGS-QCbox pipeline was implemented by integrating NGS tools such as Bpipe, Sickle, bedtools, Bowtie 2, SAMtools and in-house software, Raspberry. Python, C, Bash shell was used extensively in building the application. Parallelization was envisaged at different levels. Python’s multiprocessing module was used to implement batch processing based on the number of processors requested by the user. Bpipe’s inbuilt parallelization of task blocks feature was used towards computing genome coverage and variation information simultaneously.</p>
<p>For benchmarking, a dataset of size 4.38 Gb, comprising of 100 bp paired-end reads were simulated from chickpea genome [
<xref rid="pone.0139868.ref005" ref-type="bibr">5</xref>
] using ART simulator [
<xref rid="pone.0139868.ref016" ref-type="bibr">16</xref>
] (v2.1.8) (
<ext-link ext-link-type="uri" xlink:href="http://www.niehs.nih.gov/research/resources/software/biostatistics/art/">http://www.niehs.nih.gov/research/resources/software/biostatistics/art/</ext-link>
) at 10 fold genome coverage. The dataset is publicly available (
<ext-link ext-link-type="uri" xlink:href="https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing">https://github.com/CEG-ICRISAT/NGS-QCbox/blob/master/README.md#datasets-used-for-testing</ext-link>
) on iPlant resource [
<xref rid="pone.0139868.ref017" ref-type="bibr">17</xref>
] (
<ext-link ext-link-type="uri" xlink:href="http://www.iplantcollaborative.org/">www.iplantcollaborative.org</ext-link>
). The test was conducted in quick mode on an Ubuntu Linux server with x86_64 architecture.</p>
<p>This was used to evaluate the performance of NGS-QCbox against five similar tools such as Prinseq-lite, NGS QC Toolkit, HTSeq, FastQC and FastX. To establish a proof of concept towards scalability, sample sets of 100, 200 and 300 samples were drawn from the same dataset to test the performance of the tools with one processor and 20 processors (in parallel) independently. This enables comparison of runtime in serial versus parallel modes of NGS-QCbox. All the benchmarking tests were performed on the 64-bit server with no load.</p>
</sec>
<sec sec-type="supplementary-material" id="sec011">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0139868.s001">
<label>S1 Table</label>
<caption>
<title>Sample output of the analysis of simulated NGS data using NGS-QCbox pipeline.</title>
<p>(XLS)</p>
</caption>
<media xlink:href="pone.0139868.s001.xls">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>We thank Manish Pandey for his invaluable suggestions towards improvement of the manuscript.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0139868.ref001">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Thudi</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Jackson</surname>
<given-names>SA</given-names>
</name>
,
<name>
<surname>May</surname>
<given-names>GD</given-names>
</name>
,
<name>
<surname>Varshney</surname>
<given-names>RK</given-names>
</name>
.
<article-title>Current state-of-art of sequencing technologies for plant genomics research</article-title>
.
<source>Briefings in Functional Genomics</source>
.
<year>2012</year>
;
<volume>11</volume>
(
<issue>1</issue>
):
<fpage>3</fpage>
<bold></bold>
<lpage>11</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bfgp/elr045">10.1093/bfgp/elr045</ext-link>
</comment>
<pub-id pub-id-type="pmid">22345601</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref002">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>McCouch</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Baute</surname>
<given-names>GJ</given-names>
</name>
,
<name>
<surname>Bradeen</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Bramel</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Bretting</surname>
<given-names>PK</given-names>
</name>
,
<name>
<surname>Buckler</surname>
<given-names>E</given-names>
</name>
,
<etal>et al</etal>
<article-title>Agriculture: feeding the future</article-title>
.
<source>Nature</source>
.
<year>2013</year>
;
<volume>499</volume>
(
<issue>7456</issue>
):
<fpage>23</fpage>
<bold></bold>
<lpage>24</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/499023a">10.1038/499023a</ext-link>
</comment>
<pub-id pub-id-type="pmid">23823779</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref003">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Jiao</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Zhao</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Ren</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Song</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Zeng</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Guo</surname>
<given-names>J</given-names>
</name>
,
<etal>et al</etal>
<article-title>Genome-wide genetic changes during modern breeding of maize</article-title>
.
<source>Nature genetics</source>
.
<year>2012</year>
;
<volume>44</volume>
(
<issue>7</issue>
):
<fpage>812</fpage>
<bold></bold>
<lpage>815</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ng.2312">10.1038/ng.2312</ext-link>
</comment>
<pub-id pub-id-type="pmid">22660547</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref004">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Mace</surname>
<given-names>ES</given-names>
</name>
,
<name>
<surname>Tai</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Gilding</surname>
<given-names>EK</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Prentis</surname>
<given-names>PJ</given-names>
</name>
,
<name>
<surname>Bian</surname>
<given-names>L</given-names>
</name>
,
<etal>et al</etal>
<article-title>Whole-genome sequencing reveals untapped genetic potential in Africa's indigenous cereal crop sorghum</article-title>
.
<source>Nature communications</source>
.
<year>2013</year>
;
<volume>4</volume>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ncomms3320">10.1038/ncomms3320</ext-link>
</comment>
<pub-id pub-id-type="pmid">23982223</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref005">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Varshney</surname>
<given-names>RK</given-names>
</name>
,
<name>
<surname>Song</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Saxena</surname>
<given-names>RK</given-names>
</name>
,
<name>
<surname>Azam</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Yu</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Sharpe</surname>
<given-names>AG</given-names>
</name>
,
<etal>et al</etal>
<article-title>Draft genome sequence of chickpea (
<italic>Cicer arietinum</italic>
) provides a resource for trait improvement</article-title>
.
<source>Nature biotechnology</source>
.
<year>2013</year>
;
<volume>31</volume>
(
<issue>3</issue>
):
<fpage>240</fpage>
<bold></bold>
<lpage>246</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nbt.2491">10.1038/nbt.2491</ext-link>
</comment>
<pub-id pub-id-type="pmid">23354103</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref006">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Gardner</surname>
<given-names>SN</given-names>
</name>
,
<name>
<surname>Hall</surname>
<given-names>BG</given-names>
</name>
.
<article-title>When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes</article-title>
.
<source>PLoS One</source>
.
<year>2013</year>
;
<volume>8</volume>
(
<issue>12</issue>
):
<fpage>e81760</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0081760">10.1371/journal.pone.0081760</ext-link>
</comment>
<pub-id pub-id-type="pmid">24349125</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref007">
<label>7</label>
<mixed-citation publication-type="journal">
<name>
<surname>Bertels</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Silander</surname>
<given-names>OK</given-names>
</name>
,
<name>
<surname>Pachkov</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Rainey</surname>
<given-names>PB</given-names>
</name>
,
<name>
<surname>van Nimwegen</surname>
<given-names>E</given-names>
</name>
.
<article-title>Automated reconstruction of whole-genome phylogenies from short-sequence reads</article-title>
.
<source>Molecular biology and evolution</source>
.
<year>2014</year>
;
<volume>31</volume>
(
<issue>5</issue>
):
<fpage>1077</fpage>
<bold></bold>
<lpage>1088</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/molbev/msu088">10.1093/molbev/msu088</ext-link>
</comment>
<pub-id pub-id-type="pmid">24600054</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref008">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Patel</surname>
<given-names>RK</given-names>
</name>
,
<name>
<surname>Jain</surname>
<given-names>M</given-names>
</name>
.
<article-title>NGS QC Toolkit: a toolkit for quality control of next generation sequencing data</article-title>
.
<source>PloS one</source>
.
<year>2012</year>
;
<volume>7</volume>
(
<issue>2</issue>
):
<fpage>e30619</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0030619">10.1371/journal.pone.0030619</ext-link>
</comment>
<pub-id pub-id-type="pmid">22312429</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref009">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Anders</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Pyl</surname>
<given-names>PT</given-names>
</name>
,
<name>
<surname>Huber</surname>
<given-names>W</given-names>
</name>
.
<article-title>HTSeq—A Python framework to work with high-throughput sequencing data</article-title>
.
<source>Bioinformatics</source>
.
<year>2014</year>
;
<fpage>btu638</fpage>
.</mixed-citation>
</ref>
<ref id="pone.0139868.ref010">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Cock</surname>
<given-names>PJ</given-names>
</name>
,
<name>
<surname>Fields</surname>
<given-names>CJ</given-names>
</name>
,
<name>
<surname>Goto</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Heuer</surname>
<given-names>ML</given-names>
</name>
,
<name>
<surname>Rice</surname>
<given-names>PM</given-names>
</name>
.
<article-title>The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants</article-title>
.
<source>Nucleic acids research</source>
.
<year>2010</year>
;
<volume>38</volume>
(
<issue>6</issue>
):
<fpage>1767</fpage>
<lpage>1771</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/gkp1137">10.1093/nar/gkp1137</ext-link>
</comment>
<pub-id pub-id-type="pmid">20015970</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref011">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Sadedin</surname>
<given-names>SP</given-names>
</name>
,
<name>
<surname>Pope</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Oshlack</surname>
<given-names>A</given-names>
</name>
.
<article-title>Bpipe: a tool for running and managing bioinformatics pipelines</article-title>
.
<source>Bioinformatics</source>
.
<year>2012</year>
;
<volume>28</volume>
(
<issue>11</issue>
):
<fpage>1525</fpage>
<bold></bold>
<lpage>1526</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/bts167">10.1093/bioinformatics/bts167</ext-link>
</comment>
<pub-id pub-id-type="pmid">22500002</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref012">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
.
<article-title>Fast gapped-read alignment with Bowtie 2</article-title>
.
<source>Nature methods</source>
.
<year>2012</year>
;
<volume>9</volume>
(
<issue>4</issue>
):
<fpage>357</fpage>
<bold></bold>
<lpage>359</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nmeth.1923">10.1038/nmeth.1923</ext-link>
</comment>
<pub-id pub-id-type="pmid">22388286</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref013">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Handsaker</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Wysoker</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Fennell</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Homer</surname>
<given-names>N</given-names>
</name>
,
<etal>et al</etal>
<article-title>The sequence alignment/map format and SAMtools</article-title>
.
<source>Bioinformatics</source>
.
<year>2009</year>
;
<volume>25</volume>
(
<issue>16</issue>
):
<fpage>2078</fpage>
<bold></bold>
<lpage>2079</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btp352">10.1093/bioinformatics/btp352</ext-link>
</comment>
<pub-id pub-id-type="pmid">19505943</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Quinlan</surname>
<given-names>AR</given-names>
</name>
,
<name>
<surname>Hall</surname>
<given-names>IM</given-names>
</name>
.
<article-title>BEDTools: a flexible suite of utilities for comparing genomic features</article-title>
.
<source>Bioinformatics</source>
.
<year>2010</year>
;
<volume>26</volume>
(
<issue>6</issue>
):
<fpage>841</fpage>
<bold></bold>
<lpage>842</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btq033">10.1093/bioinformatics/btq033</ext-link>
</comment>
<pub-id pub-id-type="pmid">20110278</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref015">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Schmieder</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Edwards</surname>
<given-names>R</given-names>
</name>
.
<article-title>Quality control and preprocessing of metagenomic datasets</article-title>
.
<source>Bioinformatics</source>
.
<year>2011</year>
;
<volume>27</volume>
(
<issue>6</issue>
):
<fpage>863</fpage>
<bold></bold>
<lpage>864</lpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btr026">10.1093/bioinformatics/btr026</ext-link>
</comment>
<pub-id pub-id-type="pmid">21278185</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref016">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Myers</surname>
<given-names>JR</given-names>
</name>
,
<name>
<surname>Marth</surname>
<given-names>GT</given-names>
</name>
.
<article-title>ART: a next-generation sequencing read simulator</article-title>
.
<source>Bioinformatics</source>
.
<year>2012</year>
;
<volume>28</volume>
(
<issue>4</issue>
):
<fpage>593</fpage>
<lpage>594</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btr708">10.1093/bioinformatics/btr708</ext-link>
</comment>
<pub-id pub-id-type="pmid">22199392</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0139868.ref017">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Goff</surname>
<given-names>SA</given-names>
</name>
,
<name>
<surname>Vaughn</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>McKay</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Lyons</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Stapleton</surname>
<given-names>AE</given-names>
</name>
,
<name>
<surname>Gessler</surname>
<given-names>D</given-names>
</name>
,
<etal>et al</etal>
<article-title>The iPlant collaborative: cyberinfrastructure for plant biology</article-title>
.
<source>Frontiers in plant science</source>
.
<year>2011</year>
;
<volume>2</volume>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.3389/fpls.2011.00034">10.3389/fpls.2011.00034</ext-link>
</comment>
<pub-id pub-id-type="pmid">22645531</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000081 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000081 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4604202
   |texte=   NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26460497" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024