Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.

Identifieur interne : 000D17 ( PubMed/Corpus ); précédent : 000D16; suivant : 000D18

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.

Auteurs : Abedalrhman Alkhateeb ; Luis Rueda

Source :

RBID : pubmed:28414515

English descriptors

Abstract

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.

DOI: 10.1089/cmb.2017.0021
PubMed: 28414515

Links to Exploration step

pubmed:28414515

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.</title>
<author>
<name sortKey="Alkhateeb, Abedalrhman" sort="Alkhateeb, Abedalrhman" uniqKey="Alkhateeb A" first="Abedalrhman" last="Alkhateeb">Abedalrhman Alkhateeb</name>
<affiliation>
<nlm:affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Rueda, Luis" sort="Rueda, Luis" uniqKey="Rueda L" first="Luis" last="Rueda">Luis Rueda</name>
<affiliation>
<nlm:affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</nlm:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2017">2017</date>
<idno type="RBID">pubmed:28414515</idno>
<idno type="pmid">28414515</idno>
<idno type="doi">10.1089/cmb.2017.0021</idno>
<idno type="wicri:Area/PubMed/Corpus">000D17</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000D17</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.</title>
<author>
<name sortKey="Alkhateeb, Abedalrhman" sort="Alkhateeb, Abedalrhman" uniqKey="Alkhateeb A" first="Abedalrhman" last="Alkhateeb">Abedalrhman Alkhateeb</name>
<affiliation>
<nlm:affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Rueda, Luis" sort="Rueda, Luis" uniqKey="Rueda L" first="Luis" last="Rueda">Luis Rueda</name>
<affiliation>
<nlm:affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</nlm:affiliation>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Journal of computational biology : a journal of computational molecular cell biology</title>
<idno type="eISSN">1557-8666</idno>
<imprint>
<date when="2017" type="published">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Genome, Human</term>
<term>Genomics (methods)</term>
<term>High-Throughput Nucleotide Sequencing (methods)</term>
<term>Humans</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Genomics</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Software</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">28414515</PMID>
<DateCompleted>
<Year>2018</Year>
<Month>03</Month>
<Day>12</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>11</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1557-8666</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>24</Volume>
<Issue>8</Issue>
<PubDate>
<Year>2017</Year>
<Month>Aug</Month>
</PubDate>
</JournalIssue>
<Title>Journal of computational biology : a journal of computational molecular cell biology</Title>
<ISOAbbreviation>J. Comput. Biol.</ISOAbbreviation>
</Journal>
<ArticleTitle>Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.</ArticleTitle>
<Pagination>
<MedlinePgn>746-755</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1089/cmb.2017.0021</ELocationID>
<Abstract>
<AbstractText>Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Alkhateeb</LastName>
<ForeName>Abedalrhman</ForeName>
<Initials>A</Initials>
<AffiliationInfo>
<Affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Rueda</LastName>
<ForeName>Luis</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>School of Computer Science, University of Windsor , Windsor, Canada .</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2017</Year>
<Month>04</Month>
<Day>17</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>J Comput Biol</MedlineTA>
<NlmUniqueID>9433358</NlmUniqueID>
<ISSNLinking>1066-5277</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015894" MajorTopicYN="Y">Genome, Human</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D023281" MajorTopicYN="N">Genomics</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D059014" MajorTopicYN="N">High-Throughput Nucleotide Sequencing</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="Y">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">RNA-SEQ analysis</Keyword>
<Keyword MajorTopicYN="N">machine learning</Keyword>
<Keyword MajorTopicYN="N">next-generation sequencing</Keyword>
<Keyword MajorTopicYN="N">preprocessing</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2017</Year>
<Month>4</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2018</Year>
<Month>3</Month>
<Day>13</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2017</Year>
<Month>4</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">28414515</ArticleId>
<ArticleId IdType="doi">10.1089/cmb.2017.0021</ArticleId>
<ArticleId IdType="pmc">PMC5563921</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>J Mol Diagn. 2003 May;5(2):73-81</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12707371</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Evol Biol. 2008 Mar 27;8:99</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18371205</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2011 May 31;108(22):9172-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21571633</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Methods. 2008 Dec;5(12):1005-10</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19034268</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2013 Apr 25;14(4):R36</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23618408</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>G3 (Bethesda). 2014 Mar 13;4(5):871-83</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24626288</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Diagn Mol Pathol. 1997 Aug;6(4):209-21</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9360842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS One. 2013 Apr 29;8(4):e62856</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23638157</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2011 May 15;29(7):644-52</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21572440</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2006 Jun;13(5):1028-40</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16796549</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jan 15;30(2):165-71</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24255646</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Pathog. 2009 Oct;5(10):e1000644</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19898609</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):863-4</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21278185</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2006 Jan 31;34(2):564-74</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16449200</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Int J Cancer. 2003 May 20;105(1):26-32</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12672026</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Protoc. 2012 Mar 01;7(3):562-78</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22383036</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Expert Rev Mol Diagn. 2016 Sep;16(9):1011-23</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">27453996</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2011 Jul;21(7):1028-41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21724842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Eukaryot Microbiol. 1999 May-Jun;46(3):239-47</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10377985</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2005 Sep 15;437(7057):376-80</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16056220</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 1997 Sep 1;25(17):3389-402</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9254694</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D17 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 000D17 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:28414515
   |texte=   Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:28414515" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021