Improved assembly of noisy long reads by k-mer validation.
Identifieur interne : 001208 ( Main/Exploration ); précédent : 001207; suivant : 001209Improved assembly of noisy long reads by k-mer validation.
Auteurs : Antonio Bernardo Carvalho [Brésil] ; Eduardo G. Dupim [Brésil] ; Gabriel Goldstein [Brésil]Source :
- Genome research [ 1549-5469 ] ; 2016.
Descripteurs français
- KwdFr :
- MESH :
English descriptors
- KwdEn :
- MESH :
Abstract
Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.
DOI: 10.1101/gr.209247.116
PubMed: 27831497
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PubMed, to step Corpus: 000E98
- to stream PubMed, to step Curation: 000E98
- to stream PubMed, to step Checkpoint: 001057
- to stream Ncbi, to step Merge: 001838
- to stream Ncbi, to step Curation: 001838
- to stream Ncbi, to step Checkpoint: 001838
- to stream Main, to step Merge: 001212
- to stream Main, to step Curation: 001208
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Improved assembly of noisy long reads by k-mer validation.</title>
<author><name sortKey="Carvalho, Antonio Bernardo" sort="Carvalho, Antonio Bernardo" uniqKey="Carvalho A" first="Antonio Bernardo" last="Carvalho">Antonio Bernardo Carvalho</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Dupim, Eduardo G" sort="Dupim, Eduardo G" uniqKey="Dupim E" first="Eduardo G" last="Dupim">Eduardo G. Dupim</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Goldstein, Gabriel" sort="Goldstein, Gabriel" uniqKey="Goldstein G" first="Gabriel" last="Goldstein">Gabriel Goldstein</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2016">2016</date>
<idno type="RBID">pubmed:27831497</idno>
<idno type="pmid">27831497</idno>
<idno type="doi">10.1101/gr.209247.116</idno>
<idno type="wicri:Area/PubMed/Corpus">000E98</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000E98</idno>
<idno type="wicri:Area/PubMed/Curation">000E98</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000E98</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001057</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001057</idno>
<idno type="wicri:Area/Ncbi/Merge">001838</idno>
<idno type="wicri:Area/Ncbi/Curation">001838</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">001838</idno>
<idno type="wicri:Area/Main/Merge">001212</idno>
<idno type="wicri:Area/Main/Curation">001208</idno>
<idno type="wicri:Area/Main/Exploration">001208</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Improved assembly of noisy long reads by k-mer validation.</title>
<author><name sortKey="Carvalho, Antonio Bernardo" sort="Carvalho, Antonio Bernardo" uniqKey="Carvalho A" first="Antonio Bernardo" last="Carvalho">Antonio Bernardo Carvalho</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Dupim, Eduardo G" sort="Dupim, Eduardo G" uniqKey="Dupim E" first="Eduardo G" last="Dupim">Eduardo G. Dupim</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Goldstein, Gabriel" sort="Goldstein, Gabriel" uniqKey="Goldstein G" first="Gabriel" last="Goldstein">Gabriel Goldstein</name>
<affiliation wicri:level="3"><nlm:affiliation>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil.</nlm:affiliation>
<country xml:lang="fr">Brésil</country>
<wicri:regionArea>Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro</wicri:regionArea>
<placeName><settlement type="city">Rio de Janeiro</settlement>
<region type="state">État de Rio de Janeiro</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j">Genome research</title>
<idno type="eISSN">1549-5469</idno>
<imprint><date when="2016" type="published">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Animals</term>
<term>Contig Mapping (methods)</term>
<term>Contig Mapping (standards)</term>
<term>High-Throughput Nucleotide Sequencing (methods)</term>
<term>High-Throughput Nucleotide Sequencing (standards)</term>
<term>Humans</term>
<term>Nanopores</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Sequence Analysis, DNA (standards)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Analyse de séquence d'ADN (normes)</term>
<term>Animaux</term>
<term>Cartographie de contigs ()</term>
<term>Cartographie de contigs (normes)</term>
<term>Humains</term>
<term>Nanopores</term>
<term>Séquençage nucléotidique à haut débit ()</term>
<term>Séquençage nucléotidique à haut débit (normes)</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Contig Mapping</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" qualifier="normes" xml:lang="fr"><term>Analyse de séquence d'ADN</term>
<term>Cartographie de contigs</term>
<term>Séquençage nucléotidique à haut débit</term>
</keywords>
<keywords scheme="MESH" qualifier="standards" xml:lang="en"><term>Contig Mapping</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Animals</term>
<term>Humans</term>
<term>Nanopores</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Animaux</term>
<term>Cartographie de contigs</term>
<term>Humains</term>
<term>Nanopores</term>
<term>Séquençage nucléotidique à haut débit</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.</div>
</front>
</TEI>
<affiliations><list><country><li>Brésil</li>
</country>
<region><li>État de Rio de Janeiro</li>
</region>
<settlement><li>Rio de Janeiro</li>
</settlement>
</list>
<tree><country name="Brésil"><region name="État de Rio de Janeiro"><name sortKey="Carvalho, Antonio Bernardo" sort="Carvalho, Antonio Bernardo" uniqKey="Carvalho A" first="Antonio Bernardo" last="Carvalho">Antonio Bernardo Carvalho</name>
</region>
<name sortKey="Dupim, Eduardo G" sort="Dupim, Eduardo G" uniqKey="Dupim E" first="Eduardo G" last="Dupim">Eduardo G. Dupim</name>
<name sortKey="Goldstein, Gabriel" sort="Goldstein, Gabriel" uniqKey="Goldstein G" first="Gabriel" last="Goldstein">Gabriel Goldstein</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001208 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001208 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= Main |étape= Exploration |type= RBID |clé= pubmed:27831497 |texte= Improved assembly of noisy long reads by k-mer validation. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i -Sk "pubmed:27831497" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |