Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Robust k-mer frequency estimation using gapped k-mers.

Identifieur interne : 001C55 ( PubMed/Curation ); précédent : 001C54; suivant : 001C56

Robust k-mer frequency estimation using gapped k-mers.

Auteurs : Mahmoud Ghandi [États-Unis] ; Morteza Mohammad-Noori ; Michael A. Beer

Source :

RBID : pubmed:23861010

Descripteurs français

English descriptors

Abstract

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

DOI: 10.1007/s00285-013-0705-3
PubMed: 23861010

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:23861010

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Robust k-mer frequency estimation using gapped k-mers.</title>
<author>
<name sortKey="Ghandi, Mahmoud" sort="Ghandi, Mahmoud" uniqKey="Ghandi M" first="Mahmoud" last="Ghandi">Mahmoud Ghandi</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA, ghandi@jhmi.edu.</nlm:affiliation>
<country wicri:rule="url">États-Unis</country>
<wicri:regionArea>Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Mohammad Noori, Morteza" sort="Mohammad Noori, Morteza" uniqKey="Mohammad Noori M" first="Morteza" last="Mohammad-Noori">Morteza Mohammad-Noori</name>
</author>
<author>
<name sortKey="Beer, Michael A" sort="Beer, Michael A" uniqKey="Beer M" first="Michael A" last="Beer">Michael A. Beer</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2014">2014</date>
<idno type="RBID">pubmed:23861010</idno>
<idno type="pmid">23861010</idno>
<idno type="doi">10.1007/s00285-013-0705-3</idno>
<idno type="wicri:Area/PubMed/Corpus">001C55</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001C55</idno>
<idno type="wicri:Area/PubMed/Curation">001C55</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001C55</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Robust k-mer frequency estimation using gapped k-mers.</title>
<author>
<name sortKey="Ghandi, Mahmoud" sort="Ghandi, Mahmoud" uniqKey="Ghandi M" first="Mahmoud" last="Ghandi">Mahmoud Ghandi</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA, ghandi@jhmi.edu.</nlm:affiliation>
<country wicri:rule="url">États-Unis</country>
<wicri:regionArea>Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Mohammad Noori, Morteza" sort="Mohammad Noori, Morteza" uniqKey="Mohammad Noori M" first="Morteza" last="Mohammad-Noori">Morteza Mohammad-Noori</name>
</author>
<author>
<name sortKey="Beer, Michael A" sort="Beer, Michael A" uniqKey="Beer M" first="Michael A" last="Beer">Michael A. Beer</name>
</author>
</analytic>
<series>
<title level="j">Journal of mathematical biology</title>
<idno type="eISSN">1432-1416</idno>
<imprint>
<date when="2014" type="published">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Binding Sites</term>
<term>DNA (chemistry)</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Transcription Factors (chemistry)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>ADN ()</term>
<term>Facteurs de transcription ()</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Sites de fixation</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="chemistry" xml:lang="en">
<term>DNA</term>
<term>Transcription Factors</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Binding Sites</term>
<term>Genome, Human</term>
<term>Humans</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>ADN</term>
<term>Facteurs de transcription</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Sites de fixation</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">23861010</PMID>
<DateCompleted>
<Year>2015</Year>
<Month>03</Month>
<Day>30</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>11</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1432-1416</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>69</Volume>
<Issue>2</Issue>
<PubDate>
<Year>2014</Year>
<Month>Aug</Month>
</PubDate>
</JournalIssue>
<Title>Journal of mathematical biology</Title>
<ISOAbbreviation>J Math Biol</ISOAbbreviation>
</Journal>
<ArticleTitle>Robust k-mer frequency estimation using gapped k-mers.</ArticleTitle>
<Pagination>
<MedlinePgn>469-500</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1007/s00285-013-0705-3</ELocationID>
<Abstract>
<AbstractText>Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Ghandi</LastName>
<ForeName>Mahmoud</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA, ghandi@jhmi.edu.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Mohammad-Noori</LastName>
<ForeName>Morteza</ForeName>
<Initials>M</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Beer</LastName>
<ForeName>Michael A</ForeName>
<Initials>MA</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<GrantList CompleteYN="Y">
<Grant>
<GrantID>R01 HG007348</GrantID>
<Acronym>HG</Acronym>
<Agency>NHGRI NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>R01 NS062972</GrantID>
<Acronym>NS</Acronym>
<Agency>NINDS NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>NS062972</GrantID>
<Acronym>NS</Acronym>
<Agency>NINDS NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D052061">Research Support, N.I.H., Extramural</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2013</Year>
<Month>07</Month>
<Day>17</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>Germany</Country>
<MedlineTA>J Math Biol</MedlineTA>
<NlmUniqueID>7502105</NlmUniqueID>
<ISSNLinking>0303-6812</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D014157">Transcription Factors</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>9007-49-2</RegistryNumber>
<NameOfSubstance UI="D004247">DNA</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D001665" MajorTopicYN="N">Binding Sites</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004247" MajorTopicYN="N">DNA</DescriptorName>
<QualifierName UI="Q000737" MajorTopicYN="Y">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015894" MajorTopicYN="Y">Genome, Human</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014157" MajorTopicYN="N">Transcription Factors</DescriptorName>
<QualifierName UI="Q000737" MajorTopicYN="Y">chemistry</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2012</Year>
<Month>12</Month>
<Day>04</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="revised">
<Year>2013</Year>
<Month>06</Month>
<Day>09</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2013</Year>
<Month>7</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2013</Year>
<Month>7</Month>
<Day>19</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2015</Year>
<Month>3</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">23861010</ArticleId>
<ArticleId IdType="doi">10.1007/s00285-013-0705-3</ArticleId>
<ArticleId IdType="pmc">PMC3895138</ArticleId>
<ArticleId IdType="mid">NIHMS506740</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2000 Jan;16(1):16-23</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10812473</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2004 Feb 12;20(3):399-406</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">14764560</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2004 Mar 1;20(4):467-76</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">14990442</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Cell. 2004 Apr 16;117(2):185-98</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15084257</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2004 Oct 28;5:169</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15511290</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2005;6(2):R18</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15693947</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2012 Mar 1;28(5):656-63</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22247280</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2006 Jul 15;22(14):e472-80</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16873509</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2007;8 Suppl 10:S7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18269701</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Comput Biol. 2008 Oct;4(10):e1000173</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18974822</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Dev Cell. 2009 Oct;17(4):568-79</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19853570</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2011 Mar;21(3):456-64</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21106903</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2011 Dec;21(12):2167-80</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21875935</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2005 Mar 17;434(7031):338-45</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15735639</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001C55 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 001C55 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:23861010
   |texte=   Robust k-mer frequency estimation using gapped k-mers.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:23861010" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021