MersV1, PubMed, Checkpoint, bibRecord, 000495

Locality-sensitive hashing for the edit distance.

Identifieur interne : 000495 ( PubMed/Checkpoint ); précédent : 000494; suivant : 000496

Locality-sensitive hashing for the edit distance.

Auteurs : Guillaume Marçais [États-Unis] ; Dan Deblasio [États-Unis] ; Prashant Pandey [États-Unis] ; Carl Kingsford [États-Unis]

Source :

Bioinformatics (Oxford, England) [ 1367-4811 ] ; 2019.

RBID : pubmed:31510667

Abstract

Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

DOI: 10.1093/bioinformatics/btz354
PubMed: 31510667

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PubMed, to step Corpus: 000422
to stream PubMed, to step Curation: 000422

Links to Exploration step

pubmed:31510667

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Locality-sensitive hashing for the edit distance.</title>
<author><name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2019">2019</date>
<idno type="RBID">pubmed:31510667</idno>
<idno type="pmid">31510667</idno>
<idno type="doi">10.1093/bioinformatics/btz354</idno>
<idno type="wicri:Area/PubMed/Corpus">000422</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000422</idno>
<idno type="wicri:Area/PubMed/Curation">000422</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000422</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000495</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000495</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Locality-sensitive hashing for the edit distance.</title>
<author><name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4"><nlm:affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j">Bioinformatics (Oxford, England)</title>
<idno type="eISSN">1367-4811</idno>
<imprint><date when="2019" type="published">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.</div>
</front>
</TEI>
<pubmed><MedlineCitation Status="In-Data-Review" Owner="NLM"><PMID Version="1">31510667</PMID>
<DateRevised><Year>2020</Year>
<Month>03</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print"><Journal><ISSN IssnType="Electronic">1367-4811</ISSN>
<JournalIssue CitedMedium="Internet"><Volume>35</Volume>
<Issue>14</Issue>
<PubDate><Year>2019</Year>
<Month>Jul</Month>
<Day>15</Day>
</PubDate>
</JournalIssue>
<Title>Bioinformatics (Oxford, England)</Title>
<ISOAbbreviation>Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>Locality-sensitive hashing for the edit distance.</ArticleTitle>
<Pagination><MedlinePgn>i127-i135</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1093/bioinformatics/btz354</ELocationID>
<Abstract><AbstractText Label="MOTIVATION" NlmCategory="BACKGROUND">Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.</AbstractText>
<AbstractText Label="AVAILABILITY AND IMPLEMENTATION" NlmCategory="METHODS">The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019.</AbstractText>
<AbstractText Label="SUPPLEMENTARY INFORMATION" NlmCategory="BACKGROUND">Supplementary data are available at Bioinformatics online.</AbstractText>
<CopyrightInformation>© The Author(s) 2019. Published by Oxford University Press.</CopyrightInformation>
</Abstract>
<AuthorList CompleteYN="Y"><Author ValidYN="Y"><LastName>Marçais</LastName>
<ForeName>Guillaume</ForeName>
<Initials>G</Initials>
<AffiliationInfo><Affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>DeBlasio</LastName>
<ForeName>Dan</ForeName>
<Initials>D</Initials>
<AffiliationInfo><Affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Pandey</LastName>
<ForeName>Prashant</ForeName>
<Initials>P</Initials>
<AffiliationInfo><Affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Kingsford</LastName>
<ForeName>Carl</ForeName>
<Initials>C</Initials>
<AffiliationInfo><Affiliation>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<GrantList CompleteYN="Y"><Grant><GrantID>R01 GM122935</GrantID>
<Acronym>GM</Acronym>
<Agency>NIGMS NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
<PublicationTypeList><PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo><Country>England</Country>
<MedlineTA>Bioinformatics</MedlineTA>
<NlmUniqueID>9808944</NlmUniqueID>
<ISSNLinking>1367-4803</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
</MedlineCitation>
<PubmedData><History><PubMedPubDate PubStatus="entrez"><Year>2019</Year>
<Month>9</Month>
<Day>13</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed"><Year>2019</Year>
<Month>9</Month>
<Day>13</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline"><Year>2019</Year>
<Month>9</Month>
<Day>13</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList><ArticleId IdType="pubmed">31510667</ArticleId>
<ArticleId IdType="pii">5529166</ArticleId>
<ArticleId IdType="doi">10.1093/bioinformatics/btz354</ArticleId>
<ArticleId IdType="pmc">PMC6612865</ArticleId>
</ArticleIdList>
<ReferenceList><Reference><Citation>Science. 2000 Mar 24;287(5461):2196-204</Citation>
<ArticleIdList><ArticleId IdType="pubmed">10731133</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Genome Res. 2003 Jan;13(1):91-6</Citation>
<ArticleIdList><ArticleId IdType="pubmed">12529310</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Bioinformatics. 2010 Mar 1;26(5):589-95</Citation>
<ArticleIdList><ArticleId IdType="pubmed">20080505</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Bioinformatics. 2012 Mar 15;28(6):878-9</Citation>
<ArticleIdList><ArticleId IdType="pubmed">22285832</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Nat Methods. 2012 Mar 04;9(4):357-9</Citation>
<ArticleIdList><ArticleId IdType="pubmed">22388286</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>PLoS One. 2013 Dec 04;8(12):e82138</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24324759</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Nat Biotechnol. 2015 Jun;33(6):623-30</Citation>
<ArticleIdList><ArticleId IdType="pubmed">26006009</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Genome Biol. 2016 Jun 20;17(1):132</Citation>
<ArticleIdList><ArticleId IdType="pubmed">27323842</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Nucleic Acids Res. 2016 Sep 6;44(15):7109-19</Citation>
<ArticleIdList><ArticleId IdType="pubmed">27431326</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>PLoS Comput Biol. 2018 Jan 26;14(1):e1005944</Citation>
<ArticleIdList><ArticleId IdType="pubmed">29373581</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Pennsylvanie</li>
</region>
<settlement><li>Pittsburgh</li>
</settlement>
<orgName><li>Université Carnegie-Mellon</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Pennsylvanie"><name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
</region>
<name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Checkpoint

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000495 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd -nk 000495 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Checkpoint
   |type=    RBID
   |clé=     pubmed:31510667
   |texte=   Locality-sensitive hashing for the edit distance.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/RBID.i   -Sk "pubmed:31510667" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Locality-sensitive hashing for the edit distance.

Locality-sensitive hashing for the edit distance.

Source :

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki