MersV1, Main, Exploration, bibRecord, 002641

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Identifieur interne : 002641 ( Main/Exploration ); précédent : 002640; suivant : 002642

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Auteurs : Xiaohong Zhao [États-Unis] ; Lance E. Palmer ; Randall Bolanos ; Cristian Mircean ; Dan Fasulo ; Gayle M. Wittenberg

Source :

Journal of computational biology : a journal of computational molecular cell biology [ 1557-8666 ] ; 2010.

RBID : pubmed:20973743

Descripteurs français

KwdFr :
- Algorithmes, Alignement de séquences (), Analyse de séquence d'ADN (), Biologie informatique (), Génome.
MESH :
- Algorithmes, Alignement de séquences, Analyse de séquence d'ADN, Biologie informatique, Génome.

English descriptors

KwdEn :
- Algorithms, Computational Biology (methods), Genome, Sequence Alignment (methods), Sequence Analysis, DNA (methods).
MESH :
- methods : Computational Biology, Sequence Alignment, Sequence Analysis, DNA.
- Algorithms, Genome.

Abstract

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

DOI: 10.1089/cmb.2010.0127
PubMed: 20973743

Affiliations:

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">EDAR: an efficient error detection and removal algorithm for next generation sequencing data.</title>
<author><name sortKey="Zhao, Xiaohong" sort="Zhao, Xiaohong" uniqKey="Zhao X" first="Xiaohong" last="Zhao">Xiaohong Zhao</name>
<affiliation wicri:level="2"><nlm:affiliation>Siemens Corporate Research , Princeton, New Jersey, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Siemens Corporate Research , Princeton, New Jersey</wicri:regionArea>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Palmer, Lance E" sort="Palmer, Lance E" uniqKey="Palmer L" first="Lance E" last="Palmer">Lance E. Palmer</name>
</author>
<author><name sortKey="Bolanos, Randall" sort="Bolanos, Randall" uniqKey="Bolanos R" first="Randall" last="Bolanos">Randall Bolanos</name>
</author>
<author><name sortKey="Mircean, Cristian" sort="Mircean, Cristian" uniqKey="Mircean C" first="Cristian" last="Mircean">Cristian Mircean</name>
</author>
<author><name sortKey="Fasulo, Dan" sort="Fasulo, Dan" uniqKey="Fasulo D" first="Dan" last="Fasulo">Dan Fasulo</name>
</author>
<author><name sortKey="Wittenberg, Gayle M" sort="Wittenberg, Gayle M" uniqKey="Wittenberg G" first="Gayle M" last="Wittenberg">Gayle M. Wittenberg</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2010">2010</date>
<idno type="RBID">pubmed:20973743</idno>
<idno type="pmid">20973743</idno>
<idno type="doi">10.1089/cmb.2010.0127</idno>
<idno type="wicri:Area/PubMed/Corpus">001F28</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001F28</idno>
<idno type="wicri:Area/PubMed/Curation">001F28</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001F28</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001E51</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001E51</idno>
<idno type="wicri:Area/Ncbi/Merge">000784</idno>
<idno type="wicri:Area/Ncbi/Curation">000784</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000784</idno>
<idno type="wicri:Area/Main/Merge">002666</idno>
<idno type="wicri:Area/Main/Curation">002641</idno>
<idno type="wicri:Area/Main/Exploration">002641</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">EDAR: an efficient error detection and removal algorithm for next generation sequencing data.</title>
<author><name sortKey="Zhao, Xiaohong" sort="Zhao, Xiaohong" uniqKey="Zhao X" first="Xiaohong" last="Zhao">Xiaohong Zhao</name>
<affiliation wicri:level="2"><nlm:affiliation>Siemens Corporate Research , Princeton, New Jersey, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Siemens Corporate Research , Princeton, New Jersey</wicri:regionArea>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Palmer, Lance E" sort="Palmer, Lance E" uniqKey="Palmer L" first="Lance E" last="Palmer">Lance E. Palmer</name>
</author>
<author><name sortKey="Bolanos, Randall" sort="Bolanos, Randall" uniqKey="Bolanos R" first="Randall" last="Bolanos">Randall Bolanos</name>
</author>
<author><name sortKey="Mircean, Cristian" sort="Mircean, Cristian" uniqKey="Mircean C" first="Cristian" last="Mircean">Cristian Mircean</name>
</author>
<author><name sortKey="Fasulo, Dan" sort="Fasulo, Dan" uniqKey="Fasulo D" first="Dan" last="Fasulo">Dan Fasulo</name>
</author>
<author><name sortKey="Wittenberg, Gayle M" sort="Wittenberg, Gayle M" uniqKey="Wittenberg G" first="Gayle M" last="Wittenberg">Gayle M. Wittenberg</name>
</author>
</analytic>
<series><title level="j">Journal of computational biology : a journal of computational molecular cell biology</title>
<idno type="eISSN">1557-8666</idno>
<imprint><date when="2010" type="published">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Genome</term>
<term>Sequence Alignment (methods)</term>
<term>Sequence Analysis, DNA (methods)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Algorithmes</term>
<term>Alignement de séquences ()</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Biologie informatique ()</term>
<term>Génome</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Computational Biology</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Genome</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Génome</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>New Jersey</li>
</region>
</list>
<tree><noCountry><name sortKey="Bolanos, Randall" sort="Bolanos, Randall" uniqKey="Bolanos R" first="Randall" last="Bolanos">Randall Bolanos</name>
<name sortKey="Fasulo, Dan" sort="Fasulo, Dan" uniqKey="Fasulo D" first="Dan" last="Fasulo">Dan Fasulo</name>
<name sortKey="Mircean, Cristian" sort="Mircean, Cristian" uniqKey="Mircean C" first="Cristian" last="Mircean">Cristian Mircean</name>
<name sortKey="Palmer, Lance E" sort="Palmer, Lance E" uniqKey="Palmer L" first="Lance E" last="Palmer">Lance E. Palmer</name>
<name sortKey="Wittenberg, Gayle M" sort="Wittenberg, Gayle M" uniqKey="Wittenberg G" first="Gayle M" last="Wittenberg">Gayle M. Wittenberg</name>
</noCountry>
<country name="États-Unis"><region name="New Jersey"><name sortKey="Zhao, Xiaohong" sort="Zhao, Xiaohong" uniqKey="Zhao X" first="Xiaohong" last="Zhao">Xiaohong Zhao</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002641 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002641 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     pubmed:20973743
   |texte=   EDAR: an efficient error detection and removal algorithm for next generation sequencing data.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:20973743" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

Serveur d'exploration MERS

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.