Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

An efficient classification algorithm for NGS data based on text similarity.

Identifieur interne : 000780 ( PubMed/Corpus ); précédent : 000779; suivant : 000781

An efficient classification algorithm for NGS data based on text similarity.

Auteurs : Xiangyu Liao ; Xingyu Liao ; Wufei Zhu ; Lu Fang ; Xing Chen

Source :

RBID : pubmed:30221607

English descriptors

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

DOI: 10.1017/S0016672318000058
PubMed: 30221607

Links to Exploration step

pubmed:30221607

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">An efficient classification algorithm for NGS data based on text similarity.</title>
<author>
<name sortKey="Liao, Xiangyu" sort="Liao, Xiangyu" uniqKey="Liao X" first="Xiangyu" last="Liao">Xiangyu Liao</name>
<affiliation>
<nlm:affiliation>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Liao, Xingyu" sort="Liao, Xingyu" uniqKey="Liao X" first="Xingyu" last="Liao">Xingyu Liao</name>
<affiliation>
<nlm:affiliation>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Zhu, Wufei" sort="Zhu, Wufei" uniqKey="Zhu W" first="Wufei" last="Zhu">Wufei Zhu</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Fang, Lu" sort="Fang, Lu" uniqKey="Fang L" first="Lu" last="Fang">Lu Fang</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Chen, Xing" sort="Chen, Xing" uniqKey="Chen X" first="Xing" last="Chen">Xing Chen</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2018">2018</date>
<idno type="RBID">pubmed:30221607</idno>
<idno type="pmid">30221607</idno>
<idno type="doi">10.1017/S0016672318000058</idno>
<idno type="wicri:Area/PubMed/Corpus">000780</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000780</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">An efficient classification algorithm for NGS data based on text similarity.</title>
<author>
<name sortKey="Liao, Xiangyu" sort="Liao, Xiangyu" uniqKey="Liao X" first="Xiangyu" last="Liao">Xiangyu Liao</name>
<affiliation>
<nlm:affiliation>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Liao, Xingyu" sort="Liao, Xingyu" uniqKey="Liao X" first="Xingyu" last="Liao">Xingyu Liao</name>
<affiliation>
<nlm:affiliation>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Zhu, Wufei" sort="Zhu, Wufei" uniqKey="Zhu W" first="Wufei" last="Zhu">Wufei Zhu</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Fang, Lu" sort="Fang, Lu" uniqKey="Fang L" first="Lu" last="Fang">Lu Fang</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Chen, Xing" sort="Chen, Xing" uniqKey="Chen X" first="Xing" last="Chen">Xing Chen</name>
<affiliation>
<nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Genetics research</title>
<idno type="eISSN">1469-5073</idno>
<imprint>
<date when="2018" type="published">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Data Analysis</term>
<term>Datasets as Topic</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Humans</term>
<term>Mycobacterium abscessus (genetics)</term>
<term>Rhodobacter sphaeroides (genetics)</term>
<term>Vibrio cholerae (genetics)</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Mycobacterium abscessus</term>
<term>Rhodobacter sphaeroides</term>
<term>Vibrio cholerae</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Data Analysis</term>
<term>Datasets as Topic</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Humans</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">30221607</PMID>
<DateCompleted>
<Year>2018</Year>
<Month>12</Month>
<Day>27</Day>
</DateCompleted>
<DateRevised>
<Year>2019</Year>
<Month>12</Month>
<Day>10</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1469-5073</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>100</Volume>
<PubDate>
<Year>2018</Year>
<Month>09</Month>
<Day>17</Day>
</PubDate>
</JournalIssue>
<Title>Genetics research</Title>
<ISOAbbreviation>Genet Res (Camb)</ISOAbbreviation>
</Journal>
<ArticleTitle>An efficient classification algorithm for NGS data based on text similarity.</ArticleTitle>
<Pagination>
<MedlinePgn>e8</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1017/S0016672318000058</ELocationID>
<Abstract>
<AbstractText>With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Liao</LastName>
<ForeName>Xiangyu</ForeName>
<Initials>X</Initials>
<AffiliationInfo>
<Affiliation>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Liao</LastName>
<ForeName>Xingyu</ForeName>
<Initials>X</Initials>
<AffiliationInfo>
<Affiliation>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Zhu</LastName>
<ForeName>Wufei</ForeName>
<Initials>W</Initials>
<AffiliationInfo>
<Affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Fang</LastName>
<ForeName>Lu</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Chen</LastName>
<ForeName>Xing</ForeName>
<Initials>X</Initials>
<AffiliationInfo>
<Affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D023362">Evaluation Study</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2018</Year>
<Month>09</Month>
<Day>17</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Genet Res (Camb)</MedlineTA>
<NlmUniqueID>101550220</NlmUniqueID>
<ISSNLinking>0016-6723</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000078332" MajorTopicYN="N">Data Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D066264" MajorTopicYN="N">Datasets as Topic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D059014" MajorTopicYN="Y">High-Throughput Nucleotide Sequencing</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000073358" MajorTopicYN="N">Mycobacterium abscessus</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012242" MajorTopicYN="N">Rhodobacter sphaeroides</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014734" MajorTopicYN="N">Vibrio cholerae</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="Y">NGS sequences data</Keyword>
<Keyword MajorTopicYN="Y">clustering</Keyword>
<Keyword MajorTopicYN="Y">text similarity</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2018</Year>
<Month>9</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2018</Year>
<Month>9</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2018</Year>
<Month>12</Month>
<Day>28</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">30221607</ArticleId>
<ArticleId IdType="pii">S0016672318000058</ArticleId>
<ArticleId IdType="doi">10.1017/S0016672318000058</ArticleId>
<ArticleId IdType="pmc">PMC6865153</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2006 Jul 1;22(13):1658-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16731699</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Comput Biol Med. 2018 Feb 1;93:66-74</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">29288886</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>BMC Bioinformatics. 2011 Jun 30;12:271</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21718538</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Brief Bioinform. 2012 Nov;13(6):656-68</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22772836</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Brief Bioinform. 2018 Jan 1;19(1):23-40</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">27742661</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>IEEE Trans Image Process. 2017 Jan;26(1):107-118</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">27775517</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Gigascience. 2012 Dec 27;1(1):18</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23587118</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2011 Sep 15;27(18):2502-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21810899</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>J Bioinform Comput Biol. 2017 Dec;15(6):1740006</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">29113561</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2010 Oct 1;26(19):2460-1</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20709691</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Nat Rev Microbiol. 2008 Jun;6(6):419-30</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18475305</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2015 May 15;31(10):1569-76</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25609798</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2013 Mar 1;29(5):652-3</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23325618</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Nat Methods. 2012 Mar 04;9(4):357-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22388286</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2017 Mar 15;33(6):834-842</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">28025198</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2013 Apr 15;29(8):1072-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23422339</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21187386</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Methods. 2015 Jun;79-80:52-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25448477</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2010 Mar 1;26(5):680-2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20053844</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
<ReferenceList>
<Reference>
<Citation>Genome Biol. 2010;11(11):R116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21114842</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000780 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 000780 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:30221607
   |texte=   An efficient classification algorithm for NGS data based on text similarity.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:30221607" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021