Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Better quality score compression through sequence-based quality smoothing

Identifieur interne : 000287 ( Pmc/Curation ); précédent : 000286; suivant : 000288

Better quality score compression through sequence-based quality smoothing

Auteurs : Yoshihiro Shibuya [Italie, France] ; Matteo Comin [Italie]

Source :

RBID : PMC:6873394

Abstract

Motivation

Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.

Results

We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy.

We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources.

Availability

https://github.com/yhhshb/yalff

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2883-5) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12859-019-2883-5
PubMed: 31757199
PubMed Central: 6873394

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:6873394

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Better quality score compression through sequence-based quality smoothing</title>
<author>
<name sortKey="Shibuya, Yoshihiro" sort="Shibuya, Yoshihiro" uniqKey="Shibuya Y" first="Yoshihiro" last="Shibuya">Yoshihiro Shibuya</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>via Gradenigo 6/A, Padova</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9103 9111</institution-id>
<institution-id institution-id-type="GRID">grid.462940.d</institution-id>
<institution>Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée,</institution>
</institution-wrap>
Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Bâtiment Copernic - 5, bd Descartes, Champs sur Marne</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>via Gradenigo 6/A, Padova</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31757199</idno>
<idno type="pmc">6873394</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6873394</idno>
<idno type="RBID">PMC:6873394</idno>
<idno type="doi">10.1186/s12859-019-2883-5</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000287</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000287</idno>
<idno type="wicri:Area/Pmc/Curation">000287</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000287</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Better quality score compression through sequence-based quality smoothing</title>
<author>
<name sortKey="Shibuya, Yoshihiro" sort="Shibuya, Yoshihiro" uniqKey="Shibuya Y" first="Yoshihiro" last="Shibuya">Yoshihiro Shibuya</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>via Gradenigo 6/A, Padova</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9103 9111</institution-id>
<institution-id institution-id-type="GRID">grid.462940.d</institution-id>
<institution>Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée,</institution>
</institution-wrap>
Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Bâtiment Copernic - 5, bd Descartes, Champs sur Marne</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>via Gradenigo 6/A, Padova</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Motivation</title>
<p>Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.</p>
</sec>
<sec>
<title>Results</title>
<p>We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy.</p>
<p>We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://github.com/yhhshb/yalff">https://github.com/yhhshb/yalff</ext-link>
</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2883-5) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewing, B" uniqKey="Ewing B">B Ewing</name>
</author>
<author>
<name sortKey="Hillier, L" uniqKey="Hillier L">L Hillier</name>
</author>
<author>
<name sortKey="Wendl, Mc" uniqKey="Wendl M">MC Wendl</name>
</author>
<author>
<name sortKey="Green, P" uniqKey="Green P">P Green</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Leoni, A" uniqKey="Leoni A">A Leoni</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Leoni, A" uniqKey="Leoni A">A Leoni</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Malysa, G" uniqKey="Malysa G">G Malysa</name>
</author>
<author>
<name sortKey="Hernaez, M" uniqKey="Hernaez M">M Hernaez</name>
</author>
<author>
<name sortKey="Ochoa, I" uniqKey="Ochoa I">I Ochoa</name>
</author>
<author>
<name sortKey="Rao, M" uniqKey="Rao M">M Rao</name>
</author>
<author>
<name sortKey="Ganesan, K" uniqKey="Ganesan K">K Ganesan</name>
</author>
<author>
<name sortKey="Weissman, T" uniqKey="Weissman T">T Weissman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roguski, L" uniqKey="Roguski L">L Roguski</name>
</author>
<author>
<name sortKey="Ochoa, I" uniqKey="Ochoa I">I Ochoa</name>
</author>
<author>
<name sortKey="Hernaez, M" uniqKey="Hernaez M">M Hernaez</name>
</author>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holley, G" uniqKey="Holley G">G Holley</name>
</author>
<author>
<name sortKey="Wittler, R" uniqKey="Wittler R">R Wittler</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Roguski, L" uniqKey="Roguski L">L Roguski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Numanagi, I" uniqKey="Numanagi I">I Numanagić</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greenfield, Dl" uniqKey="Greenfield D">DL Greenfield</name>
</author>
<author>
<name sortKey="Stegle, O" uniqKey="Stegle O">O Stegle</name>
</author>
<author>
<name sortKey="Rrustemi, A" uniqKey="Rrustemi A">A Rrustemi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Yw" uniqKey="Yu Y">YW Yu</name>
</author>
<author>
<name sortKey="Yorukoglu, D" uniqKey="Yorukoglu D">D Yorukoglu</name>
</author>
<author>
<name sortKey="Peng, J" uniqKey="Peng J">J Peng</name>
</author>
<author>
<name sortKey="Berger, B" uniqKey="Berger B">B Berger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bonfield, James K" uniqKey="Bonfield J">James K. Bonfield</name>
</author>
<author>
<name sortKey="Mahoney, Matthew V" uniqKey="Mahoney M">Matthew V. Mahoney</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Canovas, R" uniqKey="Canovas R">R Cánovas</name>
</author>
<author>
<name sortKey="Moffat, A" uniqKey="Moffat A">A Moffat</name>
</author>
<author>
<name sortKey="Turpin, A" uniqKey="Turpin A">A Turpin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ochoa, I" uniqKey="Ochoa I">I Ochoa</name>
</author>
<author>
<name sortKey="Asnani, H" uniqKey="Asnani H">H Asnani</name>
</author>
<author>
<name sortKey="Bharadia, D" uniqKey="Bharadia D">D Bharadia</name>
</author>
<author>
<name sortKey="Chowdhury, M" uniqKey="Chowdhury M">M Chowdhury</name>
</author>
<author>
<name sortKey="Weissman, T" uniqKey="Weissman T">T Weissman</name>
</author>
<author>
<name sortKey="Yona, G" uniqKey="Yona G">G Yona</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ochoa, I" uniqKey="Ochoa I">I Ochoa</name>
</author>
<author>
<name sortKey="Hernaez, M" uniqKey="Hernaez M">M Hernaez</name>
</author>
<author>
<name sortKey="Goldfeder, R" uniqKey="Goldfeder R">R Goldfeder</name>
</author>
<author>
<name sortKey="Weissman, T" uniqKey="Weissman T">T Weissman</name>
</author>
<author>
<name sortKey="Ashley, E" uniqKey="Ashley E">E Ashley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Janin, L" uniqKey="Janin L">L Janin</name>
</author>
<author>
<name sortKey="Rosone, G" uniqKey="Rosone G">G Rosone</name>
</author>
<author>
<name sortKey="Cox, Aj" uniqKey="Cox A">AJ Cox</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benoit, G" uniqKey="Benoit G">G Benoit</name>
</author>
<author>
<name sortKey="Lemaitre, C" uniqKey="Lemaitre C">C Lemaitre</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author>
<name sortKey="Drezen, E" uniqKey="Drezen E">E Drezen</name>
</author>
<author>
<name sortKey="Dayris, T" uniqKey="Dayris T">T Dayris</name>
</author>
<author>
<name sortKey="Uricaru, R" uniqKey="Uricaru R">R Uricaru</name>
</author>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Yw" uniqKey="Yu Y">YW Yu</name>
</author>
<author>
<name sortKey="Yorukoglu, D" uniqKey="Yorukoglu D">D Yorukoglu</name>
</author>
<author>
<name sortKey="Berger, B" uniqKey="Berger B">B Berger</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qian, J" uniqKey="Qian J">J Qian</name>
</author>
<author>
<name sortKey="Marchiori, D" uniqKey="Marchiori D">D Marchiori</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shibuya, Y" uniqKey="Shibuya Y">Y Shibuya</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31757199</article-id>
<article-id pub-id-type="pmc">6873394</article-id>
<article-id pub-id-type="publisher-id">2883</article-id>
<article-id pub-id-type="doi">10.1186/s12859-019-2883-5</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Better quality score compression through sequence-based quality smoothing</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Shibuya</surname>
<given-names>Yoshihiro</given-names>
</name>
<address>
<email>yoshihiro.shibuya@studenti.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Comin</surname>
<given-names>Matteo</given-names>
</name>
<address>
<email>comin@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9103 9111</institution-id>
<institution-id institution-id-type="GRID">grid.462940.d</institution-id>
<institution>Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée,</institution>
</institution-wrap>
Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>22</day>
<month>11</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>22</day>
<month>11</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>20</volume>
<issue>Suppl 9</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</issue-sponsor>
<elocation-id>302</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>4</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>7</day>
<month>5</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Motivation</title>
<p>Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.</p>
</sec>
<sec>
<title>Results</title>
<p>We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy.</p>
<p>We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://github.com/yhhshb/yalff">https://github.com/yhhshb/yalff</ext-link>
</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2883-5) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>FASTQ compression</kwd>
<kwd>BWT</kwd>
<kwd>FM-Index</kwd>
</kwd-group>
<conference xlink:href="http://bioinformatics.it/">
<conf-name>Annual Meeting of the Bioinformatics Italian Society (BITS 2018)</conf-name>
<conf-acronym>BITS 2018</conf-acronym>
<conf-loc>Turin, Italy</conf-loc>
<conf-date>27 - 29 June 2018</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000287 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000287 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:6873394
   |texte=   Better quality score compression through sequence-based quality smoothing
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:31757199" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021