Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A k-mer grammar analysis to uncover maize regulatory architecture

Identifieur interne : 000328 ( Pmc/Corpus ); précédent : 000327; suivant : 000329

A k-mer grammar analysis to uncover maize regulatory architecture

Auteurs : María Katherine Mejía-Guerra ; Edward S. Buckler

Source :

RBID : PMC:6419808

Abstract

Background

Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.

Results

We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built “bag-of-k-mers” and “vector-k-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-k-mers” achieved higher overall accuracy, while the “vector-k-mers” models were more useful in highlighting key groups of sequences within the regulatory regions.

Conclusions

These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.

Electronic supplementary material

The online version of this article (10.1186/s12870-019-1693-2) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12870-019-1693-2
PubMed: 30876396
PubMed Central: 6419808

Links to Exploration step

PMC:6419808

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A
<italic>k</italic>
-mer grammar analysis to uncover maize regulatory architecture</title>
<author>
<name sortKey="Mejia Guerra, Maria Katherine" sort="Mejia Guerra, Maria Katherine" uniqKey="Mejia Guerra M" first="María Katherine" last="Mejía-Guerra">María Katherine Mejía-Guerra</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Institute for Genomic Diversity, Cornell University,</institution>
</institution-wrap>
175 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Buckler, Edward S" sort="Buckler, Edward S" uniqKey="Buckler E" first="Edward S." last="Buckler">Edward S. Buckler</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Institute for Genomic Diversity, Cornell University,</institution>
</institution-wrap>
175 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0404 0958</institution-id>
<institution-id institution-id-type="GRID">grid.463419.d</institution-id>
<institution>USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center,</institution>
</institution-wrap>
Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Department of Plant Breeding and Genetics, Cornell University,</institution>
</institution-wrap>
159 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30876396</idno>
<idno type="pmc">6419808</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419808</idno>
<idno type="RBID">PMC:6419808</idno>
<idno type="doi">10.1186/s12870-019-1693-2</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000328</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000328</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A
<italic>k</italic>
-mer grammar analysis to uncover maize regulatory architecture</title>
<author>
<name sortKey="Mejia Guerra, Maria Katherine" sort="Mejia Guerra, Maria Katherine" uniqKey="Mejia Guerra M" first="María Katherine" last="Mejía-Guerra">María Katherine Mejía-Guerra</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Institute for Genomic Diversity, Cornell University,</institution>
</institution-wrap>
175 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Buckler, Edward S" sort="Buckler, Edward S" uniqKey="Buckler E" first="Edward S." last="Buckler">Edward S. Buckler</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Institute for Genomic Diversity, Cornell University,</institution>
</institution-wrap>
175 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0404 0958</institution-id>
<institution-id institution-id-type="GRID">grid.463419.d</institution-id>
<institution>USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center,</institution>
</institution-wrap>
Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Department of Plant Breeding and Genetics, Cornell University,</institution>
</institution-wrap>
159 Biotechnology Building, Ithaca, 14853 NY USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Plant Biology</title>
<idno type="eISSN">1471-2229</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.</p>
</sec>
<sec>
<title>Results</title>
<p>We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features -
<italic>k</italic>
-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-
<italic>k</italic>
-mers), that captures semantic and linguistic relationships between words. We built “bag-of-
<italic>k</italic>
-mers” and “vector-
<italic>k</italic>
-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-
<italic>k</italic>
-mers” achieved higher overall accuracy, while the “vector-
<italic>k</italic>
-mers” models were more useful in highlighting key groups of sequences within the regulatory regions.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12870-019-1693-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Wallace, Jg" uniqKey="Wallace J">JG Wallace</name>
</author>
<author>
<name sortKey="Bradbury, Pj" uniqKey="Bradbury P">PJ Bradbury</name>
</author>
<author>
<name sortKey="Zhang, N" uniqKey="Zhang N">N Zhang</name>
</author>
<author>
<name sortKey="Gibon, Y" uniqKey="Gibon Y">Y Gibon</name>
</author>
<author>
<name sortKey="Stitt, M" uniqKey="Stitt M">M Stitt</name>
</author>
<author>
<name sortKey="Buckler, Es" uniqKey="Buckler E">ES Buckler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author>
<name sortKey="Luo, X" uniqKey="Luo X">X Luo</name>
</author>
<author>
<name sortKey="Niu, L" uniqKey="Niu L">L Niu</name>
</author>
<author>
<name sortKey="Xiao, Y" uniqKey="Xiao Y">Y Xiao</name>
</author>
<author>
<name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
<author>
<name sortKey="Liu, J" uniqKey="Liu J">J Liu</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Jin, M" uniqKey="Jin M">M Jin</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author>
<name sortKey="Yan, J" uniqKey="Yan J">J Yan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodgers Melnick, E" uniqKey="Rodgers Melnick E">E Rodgers-Melnick</name>
</author>
<author>
<name sortKey="Vera, Dl" uniqKey="Vera D">DL Vera</name>
</author>
<author>
<name sortKey="Bass, Hw" uniqKey="Bass H">HW Bass</name>
</author>
<author>
<name sortKey="Buckler, Es" uniqKey="Buckler E">ES Buckler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lu, F" uniqKey="Lu F">F Lu</name>
</author>
<author>
<name sortKey="Romay, Mc" uniqKey="Romay M">MC Romay</name>
</author>
<author>
<name sortKey="Glaubitz, Jc" uniqKey="Glaubitz J">JC Glaubitz</name>
</author>
<author>
<name sortKey="Bradbury, Pj" uniqKey="Bradbury P">PJ Bradbury</name>
</author>
<author>
<name sortKey="Elshire, Rj" uniqKey="Elshire R">RJ Elshire</name>
</author>
<author>
<name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Semagn, K" uniqKey="Semagn K">K Semagn</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author>
<name sortKey="Hernandez, Ag" uniqKey="Hernandez A">AG Hernandez</name>
</author>
<author>
<name sortKey="Mikel, Ma" uniqKey="Mikel M">MA Mikel</name>
</author>
<author>
<name sortKey="Soifer, I" uniqKey="Soifer I">I Soifer</name>
</author>
<author>
<name sortKey="Barad, O" uniqKey="Barad O">O Barad</name>
</author>
<author>
<name sortKey="Buckler, Es" uniqKey="Buckler E">ES Buckler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ajmone Marsan, P" uniqKey="Ajmone Marsan P">P Ajmone-Marsan</name>
</author>
<author>
<name sortKey="Stella, A" uniqKey="Stella A">A Stella</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Poland, J" uniqKey="Poland J">J Poland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Franco Zorrilla, Jm" uniqKey="Franco Zorrilla J">JM Franco-Zorrilla</name>
</author>
<author>
<name sortKey="L Pez Vidriero, I" uniqKey="L Pez Vidriero I">I López-Vidriero</name>
</author>
<author>
<name sortKey="Carrasco, Jl" uniqKey="Carrasco J">JL Carrasco</name>
</author>
<author>
<name sortKey="Godoy, M" uniqKey="Godoy M">M Godoy</name>
</author>
<author>
<name sortKey="Vera, P" uniqKey="Vera P">P Vera</name>
</author>
<author>
<name sortKey="Solano, R" uniqKey="Solano R">R Solano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="O Alley, Rc" uniqKey="O Alley R">RC O’Malley</name>
</author>
<author>
<name sortKey="Huang, S Sc" uniqKey="Huang S">S-SC Huang</name>
</author>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Lewsey, Mg" uniqKey="Lewsey M">MG Lewsey</name>
</author>
<author>
<name sortKey="Bartlett, A" uniqKey="Bartlett A">A Bartlett</name>
</author>
<author>
<name sortKey="Nery, Jr" uniqKey="Nery J">JR Nery</name>
</author>
<author>
<name sortKey="Galli, M" uniqKey="Galli M">M Galli</name>
</author>
<author>
<name sortKey="Gallavotti, A" uniqKey="Gallavotti A">A Gallavotti</name>
</author>
<author>
<name sortKey="Ecker, Jr" uniqKey="Ecker J">JR Ecker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lescot, M" uniqKey="Lescot M">M Lescot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Machanick, P" uniqKey="Machanick P">P Machanick</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zamanighomi, M" uniqKey="Zamanighomi M">M Zamanighomi</name>
</author>
<author>
<name sortKey="Lin, Z" uniqKey="Lin Z">Z Lin</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Jiang, R" uniqKey="Jiang R">R Jiang</name>
</author>
<author>
<name sortKey="Wong, Wh" uniqKey="Wong W">WH Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cuellar Partida, G" uniqKey="Cuellar Partida G">G Cuellar-Partida</name>
</author>
<author>
<name sortKey="Buske, Fa" uniqKey="Buske F">FA Buske</name>
</author>
<author>
<name sortKey="Mcleay, Rc" uniqKey="Mcleay R">RC Mcleay</name>
</author>
<author>
<name sortKey="Whitington, T" uniqKey="Whitington T">T Whitington</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kleftogiannis, D" uniqKey="Kleftogiannis D">D Kleftogiannis</name>
</author>
<author>
<name sortKey="Kalnis, P" uniqKey="Kalnis P">P Kalnis</name>
</author>
<author>
<name sortKey="Bajic, Vb" uniqKey="Bajic V">VB Bajic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Natarajan, A" uniqKey="Natarajan A">A Natarajan</name>
</author>
<author>
<name sortKey="Yardimci, Gg" uniqKey="Yardimci G">GG Yardimci</name>
</author>
<author>
<name sortKey="Sheffield, Nc" uniqKey="Sheffield N">NC Sheffield</name>
</author>
<author>
<name sortKey="Crawford, Ge" uniqKey="Crawford G">GE Crawford</name>
</author>
<author>
<name sortKey="Ohler, U" uniqKey="Ohler U">U Ohler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huminiecki, L" uniqKey="Huminiecki L">Ł Huminiecki</name>
</author>
<author>
<name sortKey="Horba Czuk, J" uniqKey="Horba Czuk J">J Horbańczuk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stringham, Jl" uniqKey="Stringham J">JL Stringham</name>
</author>
<author>
<name sortKey="Brown, As" uniqKey="Brown A">AS Brown</name>
</author>
<author>
<name sortKey="Drewell, Ra" uniqKey="Drewell R">RA Drewell</name>
</author>
<author>
<name sortKey="Dresch, Jm" uniqKey="Dresch J">JM Dresch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stampfel, G" uniqKey="Stampfel G">G Stampfel</name>
</author>
<author>
<name sortKey="Kazmar, T" uniqKey="Kazmar T">T Kazmar</name>
</author>
<author>
<name sortKey="Frank, O" uniqKey="Frank O">O Frank</name>
</author>
<author>
<name sortKey="Wienerroither, S" uniqKey="Wienerroither S">S Wienerroither</name>
</author>
<author>
<name sortKey="Reiter, F" uniqKey="Reiter F">F Reiter</name>
</author>
<author>
<name sortKey="Stark, A" uniqKey="Stark A">A Stark</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Crocker, J" uniqKey="Crocker J">J Crocker</name>
</author>
<author>
<name sortKey="Abe, N" uniqKey="Abe N">N Abe</name>
</author>
<author>
<name sortKey="Rinaldi, L" uniqKey="Rinaldi L">L Rinaldi</name>
</author>
<author>
<name sortKey="Mcgregor, Ap" uniqKey="Mcgregor A">AP McGregor</name>
</author>
<author>
<name sortKey="Frankel, N" uniqKey="Frankel N">N Frankel</name>
</author>
<author>
<name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
<author>
<name sortKey="Alsawadi, A" uniqKey="Alsawadi A">A Alsawadi</name>
</author>
<author>
<name sortKey="Valenti, P" uniqKey="Valenti P">P Valenti</name>
</author>
<author>
<name sortKey="Plaza, S" uniqKey="Plaza S">S Plaza</name>
</author>
<author>
<name sortKey="Payre, F" uniqKey="Payre F">F Payre</name>
</author>
<author>
<name sortKey="Mann, Rs" uniqKey="Mann R">RS Mann</name>
</author>
<author>
<name sortKey="Stern, Dl" uniqKey="Stern D">DL Stern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raveh Sadka, T" uniqKey="Raveh Sadka T">T Raveh-Sadka</name>
</author>
<author>
<name sortKey="Levo, M" uniqKey="Levo M">M Levo</name>
</author>
<author>
<name sortKey="Shabi, U" uniqKey="Shabi U">U Shabi</name>
</author>
<author>
<name sortKey="Shany, B" uniqKey="Shany B">B Shany</name>
</author>
<author>
<name sortKey="Keren, L" uniqKey="Keren L">L Keren</name>
</author>
<author>
<name sortKey="Lotan Pompan, M" uniqKey="Lotan Pompan M">M Lotan-Pompan</name>
</author>
<author>
<name sortKey="Zeevi, D" uniqKey="Zeevi D">D Zeevi</name>
</author>
<author>
<name sortKey="Sharon, E" uniqKey="Sharon E">E Sharon</name>
</author>
<author>
<name sortKey="Weinberger, A" uniqKey="Weinberger A">A Weinberger</name>
</author>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Farley, Ek" uniqKey="Farley E">EK Farley</name>
</author>
<author>
<name sortKey="Olson, Km" uniqKey="Olson K">KM Olson</name>
</author>
<author>
<name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author>
<name sortKey="Rokhsar, Ds" uniqKey="Rokhsar D">DS Rokhsar</name>
</author>
<author>
<name sortKey="Levine, Ms" uniqKey="Levine M">MS Levine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ya Ez Cuna, Jo" uniqKey="Ya Ez Cuna J">JO Yáñez-Cuna</name>
</author>
<author>
<name sortKey="Kvon, Ez" uniqKey="Kvon E">EZ Kvon</name>
</author>
<author>
<name sortKey="Stark, A" uniqKey="Stark A">A Stark</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Karchin, R" uniqKey="Karchin R">R Karchin</name>
</author>
<author>
<name sortKey="Beer, Ma" uniqKey="Beer M">MA Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Gorkin, Du" uniqKey="Gorkin D">DU Gorkin</name>
</author>
<author>
<name sortKey="Baker, M" uniqKey="Baker M">M Baker</name>
</author>
<author>
<name sortKey="Strober, Bj" uniqKey="Strober B">BJ Strober</name>
</author>
<author>
<name sortKey="Asoni, Al" uniqKey="Asoni A">AL Asoni</name>
</author>
<author>
<name sortKey="Mccallion, As" uniqKey="Mccallion A">AS McCallion</name>
</author>
<author>
<name sortKey="Beer, Ma" uniqKey="Beer M">MA Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ghandi, M" uniqKey="Ghandi M">M Ghandi</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Mohammad Noori, M" uniqKey="Mohammad Noori M">M Mohammad-Noori</name>
</author>
<author>
<name sortKey="Beer, Ma" uniqKey="Beer M">MA Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alipanahi, B" uniqKey="Alipanahi B">B Alipanahi</name>
</author>
<author>
<name sortKey="Delong, A" uniqKey="Delong A">A Delong</name>
</author>
<author>
<name sortKey="Weirauch, Mt" uniqKey="Weirauch M">MT Weirauch</name>
</author>
<author>
<name sortKey="Frey, Bj" uniqKey="Frey B">BJ Frey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
<author>
<name sortKey="Troyanskaya, Og" uniqKey="Troyanskaya O">OG Troyanskaya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Snoek, J" uniqKey="Snoek J">J Snoek</name>
</author>
<author>
<name sortKey="Rinn, Jl" uniqKey="Rinn J">JL Rinn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, D" uniqKey="Zhang D">D Zhang</name>
</author>
<author>
<name sortKey="Wang, D" uniqKey="Wang D">D Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manning, Cd" uniqKey="Manning C">CD Manning</name>
</author>
<author>
<name sortKey="Schutze, H" uniqKey="Schutze H">H Schütze</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mikolov, T" uniqKey="Mikolov T">T Mikolov</name>
</author>
<author>
<name sortKey="Sutskever, I" uniqKey="Sutskever I">I Sutskever</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Corrado, G" uniqKey="Corrado G">G Corrado</name>
</author>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taddy, M" uniqKey="Taddy M">M Taddy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolduc, N" uniqKey="Bolduc N">N Bolduc</name>
</author>
<author>
<name sortKey="Yilmaz, A" uniqKey="Yilmaz A">A Yilmaz</name>
</author>
<author>
<name sortKey="Mejia Guerra, Mk" uniqKey="Mejia Guerra M">MK Mejía-Guerra</name>
</author>
<author>
<name sortKey="Morohashi, K" uniqKey="Morohashi K">K Morohashi</name>
</author>
<author>
<name sortKey="O Onnor, D" uniqKey="O Onnor D">D O’Connor</name>
</author>
<author>
<name sortKey="Grotewold, E" uniqKey="Grotewold E">E Grotewold</name>
</author>
<author>
<name sortKey="Hake, S" uniqKey="Hake S">S Hake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pautler, M" uniqKey="Pautler M">M Pautler</name>
</author>
<author>
<name sortKey="Eveland, Al" uniqKey="Eveland A">AL Eveland</name>
</author>
<author>
<name sortKey="Larue, T" uniqKey="Larue T">T LaRue</name>
</author>
<author>
<name sortKey="Yang, F" uniqKey="Yang F">F Yang</name>
</author>
<author>
<name sortKey="Weeks, R" uniqKey="Weeks R">R Weeks</name>
</author>
<author>
<name sortKey="Lunde, C" uniqKey="Lunde C">C Lunde</name>
</author>
<author>
<name sortKey="Je, Bi" uniqKey="Je B">BI Je</name>
</author>
<author>
<name sortKey="Meeley, R" uniqKey="Meeley R">R Meeley</name>
</author>
<author>
<name sortKey="Komatsu, M" uniqKey="Komatsu M">M Komatsu</name>
</author>
<author>
<name sortKey="Vollbrecht, E" uniqKey="Vollbrecht E">E Vollbrecht</name>
</author>
<author>
<name sortKey="Sakai, H" uniqKey="Sakai H">H Sakai</name>
</author>
<author>
<name sortKey="Jackson, D" uniqKey="Jackson D">D Jackson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alexandrov, Nn" uniqKey="Alexandrov N">NN Alexandrov</name>
</author>
<author>
<name sortKey="Brover, Vv" uniqKey="Brover V">VV Brover</name>
</author>
<author>
<name sortKey="Freidin, S" uniqKey="Freidin S">S Freidin</name>
</author>
<author>
<name sortKey="Troukhan, Me" uniqKey="Troukhan M">ME Troukhan</name>
</author>
<author>
<name sortKey="Tatarinova, Tv" uniqKey="Tatarinova T">TV Tatarinova</name>
</author>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author>
<name sortKey="Swaller, Tj" uniqKey="Swaller T">TJ Swaller</name>
</author>
<author>
<name sortKey="Lu, Y P" uniqKey="Lu Y">Y-P Lu</name>
</author>
<author>
<name sortKey="Bouck, J" uniqKey="Bouck J">J Bouck</name>
</author>
<author>
<name sortKey="Flavell, Rb" uniqKey="Flavell R">RB Flavell</name>
</author>
<author>
<name sortKey="Feldmann, Ka" uniqKey="Feldmann K">KA Feldmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Soderlund, C" uniqKey="Soderlund C">C Soderlund</name>
</author>
<author>
<name sortKey="Descour, A" uniqKey="Descour A">A Descour</name>
</author>
<author>
<name sortKey="Kudrna, D" uniqKey="Kudrna D">D Kudrna</name>
</author>
<author>
<name sortKey="Bomhoff, M" uniqKey="Bomhoff M">M Bomhoff</name>
</author>
<author>
<name sortKey="Boyd, L" uniqKey="Boyd L">L Boyd</name>
</author>
<author>
<name sortKey="Currie, J" uniqKey="Currie J">J Currie</name>
</author>
<author>
<name sortKey="Angelova, A" uniqKey="Angelova A">A Angelova</name>
</author>
<author>
<name sortKey="Collura, K" uniqKey="Collura K">K Collura</name>
</author>
<author>
<name sortKey="Wissotski, M" uniqKey="Wissotski M">M Wissotski</name>
</author>
<author>
<name sortKey="Ashley, E" uniqKey="Ashley E">E Ashley</name>
</author>
<author>
<name sortKey="Morrow, D" uniqKey="Morrow D">D Morrow</name>
</author>
<author>
<name sortKey="Fernandes, J" uniqKey="Fernandes J">J Fernandes</name>
</author>
<author>
<name sortKey="Walbot, V" uniqKey="Walbot V">V Walbot</name>
</author>
<author>
<name sortKey="Yu, Y" uniqKey="Yu Y">Y Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mejia Guerra, Mk" uniqKey="Mejia Guerra M">MK Mejía-Guerra</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Galeano, Nf" uniqKey="Galeano N">NF Galeano</name>
</author>
<author>
<name sortKey="Vidal, M" uniqKey="Vidal M">M Vidal</name>
</author>
<author>
<name sortKey="Gray, J" uniqKey="Gray J">J Gray</name>
</author>
<author>
<name sortKey="Doseff, Ai" uniqKey="Doseff A">AI Doseff</name>
</author>
<author>
<name sortKey="Grotewold, E" uniqKey="Grotewold E">E Grotewold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Q" uniqKey="Liu Q">Q Liu</name>
</author>
<author>
<name sortKey="Gan, M" uniqKey="Gan M">M Gan</name>
</author>
<author>
<name sortKey="Jiang, R" uniqKey="Jiang R">R Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolduc, N" uniqKey="Bolduc N">N Bolduc</name>
</author>
<author>
<name sortKey="Hake, S" uniqKey="Hake S">S Hake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Springer, Nm" uniqKey="Springer N">NM Springer</name>
</author>
<author>
<name sortKey="Anderson, Sn" uniqKey="Anderson S">SN Anderson</name>
</author>
<author>
<name sortKey="Andorf, Cm" uniqKey="Andorf C">CM Andorf</name>
</author>
<author>
<name sortKey="Ahern, Kr" uniqKey="Ahern K">KR Ahern</name>
</author>
<author>
<name sortKey="Bai, F" uniqKey="Bai F">F Bai</name>
</author>
<author>
<name sortKey="Barad, O" uniqKey="Barad O">O Barad</name>
</author>
<author>
<name sortKey="Barbazuk, Wb" uniqKey="Barbazuk W">WB Barbazuk</name>
</author>
<author>
<name sortKey="Bass, Hw" uniqKey="Bass H">HW Bass</name>
</author>
<author>
<name sortKey="Baruch, K" uniqKey="Baruch K">K Baruch</name>
</author>
<author>
<name sortKey="Ben Zvi, G" uniqKey="Ben Zvi G">G Ben-Zvi</name>
</author>
<author>
<name sortKey="Buckler, Es" uniqKey="Buckler E">ES Buckler</name>
</author>
<author>
<name sortKey="Bukowski, R" uniqKey="Bukowski R">R Bukowski</name>
</author>
<author>
<name sortKey="Campbell, Ms" uniqKey="Campbell M">MS Campbell</name>
</author>
<author>
<name sortKey="Cannon, Eks" uniqKey="Cannon E">EKS Cannon</name>
</author>
<author>
<name sortKey="Chomet, P" uniqKey="Chomet P">P Chomet</name>
</author>
<author>
<name sortKey="Dawe, Rk" uniqKey="Dawe R">RK Dawe</name>
</author>
<author>
<name sortKey="Davenport, R" uniqKey="Davenport R">R Davenport</name>
</author>
<author>
<name sortKey="Dooner, Hk" uniqKey="Dooner H">HK Dooner</name>
</author>
<author>
<name sortKey="Du, Lh" uniqKey="Du L">LH Du</name>
</author>
<author>
<name sortKey="Du, C" uniqKey="Du C">C Du</name>
</author>
<author>
<name sortKey="Easterling, Ka" uniqKey="Easterling K">KA Easterling</name>
</author>
<author>
<name sortKey="Gault, C" uniqKey="Gault C">C Gault</name>
</author>
<author>
<name sortKey="Guan, J C" uniqKey="Guan J">J-C Guan</name>
</author>
<author>
<name sortKey="Hunter, Ct" uniqKey="Hunter C">CT Hunter</name>
</author>
<author>
<name sortKey="Jander, G" uniqKey="Jander G">G Jander</name>
</author>
<author>
<name sortKey="Jiao, Y" uniqKey="Jiao Y">Y Jiao</name>
</author>
<author>
<name sortKey="Koch, Ke" uniqKey="Koch K">KE Koch</name>
</author>
<author>
<name sortKey="Kol, G" uniqKey="Kol G">G Kol</name>
</author>
<author>
<name sortKey="Kollner, Tg" uniqKey="Kollner T">TG Köllner</name>
</author>
<author>
<name sortKey="Kudo, T" uniqKey="Kudo T">T Kudo</name>
</author>
<author>
<name sortKey="Li, Q" uniqKey="Li Q">Q Li</name>
</author>
<author>
<name sortKey="Lu, F" uniqKey="Lu F">F Lu</name>
</author>
<author>
<name sortKey="Mayfield Jones, D" uniqKey="Mayfield Jones D">D Mayfield-Jones</name>
</author>
<author>
<name sortKey="Mei, W" uniqKey="Mei W">W Mei</name>
</author>
<author>
<name sortKey="Mccarty, Dr" uniqKey="Mccarty D">DR McCarty</name>
</author>
<author>
<name sortKey="Noshay, Jm" uniqKey="Noshay J">JM Noshay</name>
</author>
<author>
<name sortKey="Portwood, Jl" uniqKey="Portwood J">JL Portwood</name>
</author>
<author>
<name sortKey="Ronen, G" uniqKey="Ronen G">G Ronen</name>
</author>
<author>
<name sortKey="Settles, Am" uniqKey="Settles A">AM Settles</name>
</author>
<author>
<name sortKey="Shem Tov, D" uniqKey="Shem Tov D">D Shem-Tov</name>
</author>
<author>
<name sortKey="Shi, J" uniqKey="Shi J">J Shi</name>
</author>
<author>
<name sortKey="Soifer, I" uniqKey="Soifer I">I Soifer</name>
</author>
<author>
<name sortKey="Stein, Jc" uniqKey="Stein J">JC Stein</name>
</author>
<author>
<name sortKey="Stitzer, Mc" uniqKey="Stitzer M">MC Stitzer</name>
</author>
<author>
<name sortKey="Suzuki, M" uniqKey="Suzuki M">M Suzuki</name>
</author>
<author>
<name sortKey="Vera, Dl" uniqKey="Vera D">DL Vera</name>
</author>
<author>
<name sortKey="Vollbrecht, E" uniqKey="Vollbrecht E">E Vollbrecht</name>
</author>
<author>
<name sortKey="Vrebalov, Jt" uniqKey="Vrebalov J">JT Vrebalov</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
<author>
<name sortKey="Wei, S" uniqKey="Wei S">S Wei</name>
</author>
<author>
<name sortKey="Wimalanathan, K" uniqKey="Wimalanathan K">K Wimalanathan</name>
</author>
<author>
<name sortKey="Woodhouse, Mr" uniqKey="Woodhouse M">MR Woodhouse</name>
</author>
<author>
<name sortKey="Xiong, W" uniqKey="Xiong W">W Xiong</name>
</author>
<author>
<name sortKey="Brutnell, Tp" uniqKey="Brutnell T">TP Brutnell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsuda, K" uniqKey="Tsuda K">K Tsuda</name>
</author>
<author>
<name sortKey="Kurata, N" uniqKey="Kurata N">N Kurata</name>
</author>
<author>
<name sortKey="Ohyanagi, H" uniqKey="Ohyanagi H">H Ohyanagi</name>
</author>
<author>
<name sortKey="Hake, S" uniqKey="Hake S">S Hake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author>
<name sortKey="Zhuang, J" uniqKey="Zhuang J">J Zhuang</name>
</author>
<author>
<name sortKey="Iyer, S" uniqKey="Iyer S">S Iyer</name>
</author>
<author>
<name sortKey="Lin, X" uniqKey="Lin X">X Lin</name>
</author>
<author>
<name sortKey="Whitfield, Tw" uniqKey="Whitfield T">TW Whitfield</name>
</author>
<author>
<name sortKey="Greven, Mc" uniqKey="Greven M">MC Greven</name>
</author>
<author>
<name sortKey="Pierce, Bg" uniqKey="Pierce B">BG Pierce</name>
</author>
<author>
<name sortKey="Dong, X" uniqKey="Dong X">X Dong</name>
</author>
<author>
<name sortKey="Kundaje, A" uniqKey="Kundaje A">A Kundaje</name>
</author>
<author>
<name sortKey="Cheng, Y" uniqKey="Cheng Y">Y Cheng</name>
</author>
<author>
<name sortKey="Rando, Oj" uniqKey="Rando O">OJ Rando</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
<author>
<name sortKey="Myers, Rm" uniqKey="Myers R">RM Myers</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
<author>
<name sortKey="Snyder, M" uniqKey="Snyder M">M Snyder</name>
</author>
<author>
<name sortKey="Weng, Z" uniqKey="Weng Z">Z Weng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dror, I" uniqKey="Dror I">I Dror</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
<author>
<name sortKey="Mandel Gutfreund, Y" uniqKey="Mandel Gutfreund Y">Y Mandel-Gutfreund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Levy, O" uniqKey="Levy O">O Levy</name>
</author>
<author>
<name sortKey="Goldberg, Y" uniqKey="Goldberg Y">Y Goldberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiao, Y" uniqKey="Jiao Y">Y Jiao</name>
</author>
<author>
<name sortKey="Peluso, P" uniqKey="Peluso P">P Peluso</name>
</author>
<author>
<name sortKey="Shi, J" uniqKey="Shi J">J Shi</name>
</author>
<author>
<name sortKey="Liang, T" uniqKey="Liang T">T Liang</name>
</author>
<author>
<name sortKey="Stitzer, Mc" uniqKey="Stitzer M">MC Stitzer</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Campbell, Ms" uniqKey="Campbell M">MS Campbell</name>
</author>
<author>
<name sortKey="Stein, Jc" uniqKey="Stein J">JC Stein</name>
</author>
<author>
<name sortKey="Wei, X" uniqKey="Wei X">X Wei</name>
</author>
<author>
<name sortKey="Chin, C S" uniqKey="Chin C">C-S Chin</name>
</author>
<author>
<name sortKey="Guill, K" uniqKey="Guill K">K Guill</name>
</author>
<author>
<name sortKey="Regulski, M" uniqKey="Regulski M">M Regulski</name>
</author>
<author>
<name sortKey="Kumari, S" uniqKey="Kumari S">S Kumari</name>
</author>
<author>
<name sortKey="Olson, A" uniqKey="Olson A">A Olson</name>
</author>
<author>
<name sortKey="Gent, J" uniqKey="Gent J">J Gent</name>
</author>
<author>
<name sortKey="Schneider, Kl" uniqKey="Schneider K">KL Schneider</name>
</author>
<author>
<name sortKey="Wolfgruber, Tk" uniqKey="Wolfgruber T">TK Wolfgruber</name>
</author>
<author>
<name sortKey="May, Mr" uniqKey="May M">MR May</name>
</author>
<author>
<name sortKey="Springer, Nm" uniqKey="Springer N">NM Springer</name>
</author>
<author>
<name sortKey="Antoniou, E" uniqKey="Antoniou E">E Antoniou</name>
</author>
<author>
<name sortKey="Mccombie, Wr" uniqKey="Mccombie W">WR McCombie</name>
</author>
<author>
<name sortKey="Presting, Gg" uniqKey="Presting G">GG Presting</name>
</author>
<author>
<name sortKey="Mcmullen, M" uniqKey="Mcmullen M">M McMullen</name>
</author>
<author>
<name sortKey="Ross Ibarra, J" uniqKey="Ross Ibarra J">J Ross-Ibarra</name>
</author>
<author>
<name sortKey="Dawe, Rk" uniqKey="Dawe R">RK Dawe</name>
</author>
<author>
<name sortKey="Hastie, A" uniqKey="Hastie A">A Hastie</name>
</author>
<author>
<name sortKey="Rank, Dr" uniqKey="Rank D">DR Rank</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alexander, Rp" uniqKey="Alexander R">RP Alexander</name>
</author>
<author>
<name sortKey="Fang, G" uniqKey="Fang G">G Fang</name>
</author>
<author>
<name sortKey="Rozowsky, J" uniqKey="Rozowsky J">J Rozowsky</name>
</author>
<author>
<name sortKey="Snyder, M" uniqKey="Snyder M">M Snyder</name>
</author>
<author>
<name sortKey="Gerstein, Mb" uniqKey="Gerstein M">MB Gerstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buckler, Es" uniqKey="Buckler E">ES Buckler</name>
</author>
<author>
<name sortKey="Gaut, Bs" uniqKey="Gaut B">BS Gaut</name>
</author>
<author>
<name sortKey="Mcmullen, Md" uniqKey="Mcmullen M">MD McMullen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Asgari, E" uniqKey="Asgari E">E Asgari</name>
</author>
<author>
<name sortKey="Mofrad, Mrk" uniqKey="Mofrad M">MRK Mofrad</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schnable, Ps" uniqKey="Schnable P">PS Schnable</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
<author>
<name sortKey="Fulton, Rs" uniqKey="Fulton R">RS Fulton</name>
</author>
<author>
<name sortKey="Stein, Jc" uniqKey="Stein J">JC Stein</name>
</author>
<author>
<name sortKey="Wei, F" uniqKey="Wei F">F Wei</name>
</author>
<author>
<name sortKey="Pasternak, S" uniqKey="Pasternak S">S Pasternak</name>
</author>
<author>
<name sortKey="Liang, C" uniqKey="Liang C">C Liang</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Fulton, L" uniqKey="Fulton L">L Fulton</name>
</author>
<author>
<name sortKey="Graves, Ta" uniqKey="Graves T">TA Graves</name>
</author>
<author>
<name sortKey="Minx, P" uniqKey="Minx P">P Minx</name>
</author>
<author>
<name sortKey="Reily, Ad" uniqKey="Reily A">AD Reily</name>
</author>
<author>
<name sortKey="Courtney, L" uniqKey="Courtney L">L Courtney</name>
</author>
<author>
<name sortKey="Kruchowski, Ss" uniqKey="Kruchowski S">SS Kruchowski</name>
</author>
<author>
<name sortKey="Tomlinson, C" uniqKey="Tomlinson C">C Tomlinson</name>
</author>
<author>
<name sortKey="Strong, C" uniqKey="Strong C">C Strong</name>
</author>
<author>
<name sortKey="Delehaunty, K" uniqKey="Delehaunty K">K Delehaunty</name>
</author>
<author>
<name sortKey="Fronick, C" uniqKey="Fronick C">C Fronick</name>
</author>
<author>
<name sortKey="Courtney, B" uniqKey="Courtney B">B Courtney</name>
</author>
<author>
<name sortKey="Rock, Sm" uniqKey="Rock S">SM Rock</name>
</author>
<author>
<name sortKey="Belter, E" uniqKey="Belter E">E Belter</name>
</author>
<author>
<name sortKey="Du, F" uniqKey="Du F">F Du</name>
</author>
<author>
<name sortKey="Kim, K" uniqKey="Kim K">K Kim</name>
</author>
<author>
<name sortKey="Abbott, Rm" uniqKey="Abbott R">RM Abbott</name>
</author>
<author>
<name sortKey="Cotton, M" uniqKey="Cotton M">M Cotton</name>
</author>
<author>
<name sortKey="Levy, A" uniqKey="Levy A">A Levy</name>
</author>
<author>
<name sortKey="Marchetto, P" uniqKey="Marchetto P">P Marchetto</name>
</author>
<author>
<name sortKey="Ochoa, K" uniqKey="Ochoa K">K Ochoa</name>
</author>
<author>
<name sortKey="Jackson, Sm" uniqKey="Jackson S">SM Jackson</name>
</author>
<author>
<name sortKey="Gillam, B" uniqKey="Gillam B">B Gillam</name>
</author>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W Chen</name>
</author>
<author>
<name sortKey="Yan, L" uniqKey="Yan L">L Yan</name>
</author>
<author>
<name sortKey="Higginbotham, J" uniqKey="Higginbotham J">J Higginbotham</name>
</author>
<author>
<name sortKey="Cardenas, M" uniqKey="Cardenas M">M Cardenas</name>
</author>
<author>
<name sortKey="Waligorski, J" uniqKey="Waligorski J">J Waligorski</name>
</author>
<author>
<name sortKey="Applebaum, E" uniqKey="Applebaum E">E Applebaum</name>
</author>
<author>
<name sortKey="Phelps, L" uniqKey="Phelps L">L Phelps</name>
</author>
<author>
<name sortKey="Falcone, J" uniqKey="Falcone J">J Falcone</name>
</author>
<author>
<name sortKey="Kanchi, K" uniqKey="Kanchi K">K Kanchi</name>
</author>
<author>
<name sortKey="Thane, T" uniqKey="Thane T">T Thane</name>
</author>
<author>
<name sortKey="Scimone, A" uniqKey="Scimone A">A Scimone</name>
</author>
<author>
<name sortKey="Thane, N" uniqKey="Thane N">N Thane</name>
</author>
<author>
<name sortKey="Henke, J" uniqKey="Henke J">J Henke</name>
</author>
<author>
<name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
<author>
<name sortKey="Ruppert, J" uniqKey="Ruppert J">J Ruppert</name>
</author>
<author>
<name sortKey="Shah, N" uniqKey="Shah N">N Shah</name>
</author>
<author>
<name sortKey="Rotter, K" uniqKey="Rotter K">K Rotter</name>
</author>
<author>
<name sortKey="Hodges, J" uniqKey="Hodges J">J Hodges</name>
</author>
<author>
<name sortKey="Ingenthron, E" uniqKey="Ingenthron E">E Ingenthron</name>
</author>
<author>
<name sortKey="Cordes, M" uniqKey="Cordes M">M Cordes</name>
</author>
<author>
<name sortKey="Kohlberg, S" uniqKey="Kohlberg S">S Kohlberg</name>
</author>
<author>
<name sortKey="Sgro, J" uniqKey="Sgro J">J Sgro</name>
</author>
<author>
<name sortKey="Delgado, B" uniqKey="Delgado B">B Delgado</name>
</author>
<author>
<name sortKey="Mead, K" uniqKey="Mead K">K Mead</name>
</author>
<author>
<name sortKey="Chinwalla, A" uniqKey="Chinwalla A">A Chinwalla</name>
</author>
<author>
<name sortKey="Leonard, S" uniqKey="Leonard S">S Leonard</name>
</author>
<author>
<name sortKey="Crouse, K" uniqKey="Crouse K">K Crouse</name>
</author>
<author>
<name sortKey="Collura, K" uniqKey="Collura K">K Collura</name>
</author>
<author>
<name sortKey="Kudrna, D" uniqKey="Kudrna D">D Kudrna</name>
</author>
<author>
<name sortKey="Currie, J" uniqKey="Currie J">J Currie</name>
</author>
<author>
<name sortKey="He, R" uniqKey="He R">R He</name>
</author>
<author>
<name sortKey="Angelova, A" uniqKey="Angelova A">A Angelova</name>
</author>
<author>
<name sortKey="Rajasekar, S" uniqKey="Rajasekar S">S Rajasekar</name>
</author>
<author>
<name sortKey="Mueller, T" uniqKey="Mueller T">T Mueller</name>
</author>
<author>
<name sortKey="Lomeli, R" uniqKey="Lomeli R">R Lomeli</name>
</author>
<author>
<name sortKey="Scara, G" uniqKey="Scara G">G Scara</name>
</author>
<author>
<name sortKey="Ko, A" uniqKey="Ko A">A Ko</name>
</author>
<author>
<name sortKey="Delaney, K" uniqKey="Delaney K">K Delaney</name>
</author>
<author>
<name sortKey="Wissotski, M" uniqKey="Wissotski M">M Wissotski</name>
</author>
<author>
<name sortKey="Lopez, G" uniqKey="Lopez G">G Lopez</name>
</author>
<author>
<name sortKey="Campos, D" uniqKey="Campos D">D Campos</name>
</author>
<author>
<name sortKey="Braidotti, M" uniqKey="Braidotti M">M Braidotti</name>
</author>
<author>
<name sortKey="Ashley, E" uniqKey="Ashley E">E Ashley</name>
</author>
<author>
<name sortKey="Golser, W" uniqKey="Golser W">W Golser</name>
</author>
<author>
<name sortKey="Kim, H" uniqKey="Kim H">H Kim</name>
</author>
<author>
<name sortKey="Lee, S" uniqKey="Lee S">S Lee</name>
</author>
<author>
<name sortKey="Lin, J" uniqKey="Lin J">J Lin</name>
</author>
<author>
<name sortKey="Dujmic, Z" uniqKey="Dujmic Z">Z Dujmic</name>
</author>
<author>
<name sortKey="Kim, W" uniqKey="Kim W">W Kim</name>
</author>
<author>
<name sortKey="Talag, J" uniqKey="Talag J">J Talag</name>
</author>
<author>
<name sortKey="Zuccolo, A" uniqKey="Zuccolo A">A Zuccolo</name>
</author>
<author>
<name sortKey="Fan, C" uniqKey="Fan C">C Fan</name>
</author>
<author>
<name sortKey="Sebastian, A" uniqKey="Sebastian A">A Sebastian</name>
</author>
<author>
<name sortKey="Kramer, M" uniqKey="Kramer M">M Kramer</name>
</author>
<author>
<name sortKey="Spiegel, L" uniqKey="Spiegel L">L Spiegel</name>
</author>
<author>
<name sortKey="Nascimento, L" uniqKey="Nascimento L">L Nascimento</name>
</author>
<author>
<name sortKey="Zutavern, T" uniqKey="Zutavern T">T Zutavern</name>
</author>
<author>
<name sortKey="Miller, B" uniqKey="Miller B">B Miller</name>
</author>
<author>
<name sortKey="Ambroise, C" uniqKey="Ambroise C">C Ambroise</name>
</author>
<author>
<name sortKey="Muller, S" uniqKey="Muller S">S Muller</name>
</author>
<author>
<name sortKey="Spooner, W" uniqKey="Spooner W">W Spooner</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A Narechania</name>
</author>
<author>
<name sortKey="Ren, L" uniqKey="Ren L">L Ren</name>
</author>
<author>
<name sortKey="Wei, S" uniqKey="Wei S">S Wei</name>
</author>
<author>
<name sortKey="Kumari, S" uniqKey="Kumari S">S Kumari</name>
</author>
<author>
<name sortKey="Faga, B" uniqKey="Faga B">B Faga</name>
</author>
<author>
<name sortKey="Levy, Mj" uniqKey="Levy M">MJ Levy</name>
</author>
<author>
<name sortKey="Mcmahan, L" uniqKey="Mcmahan L">L McMahan</name>
</author>
<author>
<name sortKey="Van Buren, P" uniqKey="Van Buren P">P Van Buren</name>
</author>
<author>
<name sortKey="Vaughn, Mw" uniqKey="Vaughn M">MW Vaughn</name>
</author>
<author>
<name sortKey="Ying, K" uniqKey="Ying K">K Ying</name>
</author>
<author>
<name sortKey="Yeh, C T" uniqKey="Yeh C">C-T Yeh</name>
</author>
<author>
<name sortKey="Emrich, Sj" uniqKey="Emrich S">SJ Emrich</name>
</author>
<author>
<name sortKey="Jia, Y" uniqKey="Jia Y">Y Jia</name>
</author>
<author>
<name sortKey="Kalyanaraman, A" uniqKey="Kalyanaraman A">A Kalyanaraman</name>
</author>
<author>
<name sortKey="Hsia, A P" uniqKey="Hsia A">A-P Hsia</name>
</author>
<author>
<name sortKey="Barbazuk, Wb" uniqKey="Barbazuk W">WB Barbazuk</name>
</author>
<author>
<name sortKey="Baucom, Rs" uniqKey="Baucom R">RS Baucom</name>
</author>
<author>
<name sortKey="Brutnell, Tp" uniqKey="Brutnell T">TP Brutnell</name>
</author>
<author>
<name sortKey="Carpita, Nc" uniqKey="Carpita N">NC Carpita</name>
</author>
<author>
<name sortKey="Chaparro, C" uniqKey="Chaparro C">C Chaparro</name>
</author>
<author>
<name sortKey="Chia, J M" uniqKey="Chia J">J-M Chia</name>
</author>
<author>
<name sortKey="Deragon, J M" uniqKey="Deragon J">J-M Deragon</name>
</author>
<author>
<name sortKey="Estill, Jc" uniqKey="Estill J">JC Estill</name>
</author>
<author>
<name sortKey="Fu, Y" uniqKey="Fu Y">Y Fu</name>
</author>
<author>
<name sortKey="Jeddeloh, Ja" uniqKey="Jeddeloh J">JA Jeddeloh</name>
</author>
<author>
<name sortKey="Han, Y" uniqKey="Han Y">Y Han</name>
</author>
<author>
<name sortKey="Lee, H" uniqKey="Lee H">H Lee</name>
</author>
<author>
<name sortKey="Li, P" uniqKey="Li P">P Li</name>
</author>
<author>
<name sortKey="Lisch, Dr" uniqKey="Lisch D">DR Lisch</name>
</author>
<author>
<name sortKey="Liu, S" uniqKey="Liu S">S Liu</name>
</author>
<author>
<name sortKey="Liu, Z" uniqKey="Liu Z">Z Liu</name>
</author>
<author>
<name sortKey="Nagel, Dh" uniqKey="Nagel D">DH Nagel</name>
</author>
<author>
<name sortKey="Mccann, Mc" uniqKey="Mccann M">MC McCann</name>
</author>
<author>
<name sortKey="Sanmiguel, P" uniqKey="Sanmiguel P">P SanMiguel</name>
</author>
<author>
<name sortKey="Myers, Am" uniqKey="Myers A">AM Myers</name>
</author>
<author>
<name sortKey="Nettleton, D" uniqKey="Nettleton D">D Nettleton</name>
</author>
<author>
<name sortKey="Nguyen, J" uniqKey="Nguyen J">J Nguyen</name>
</author>
<author>
<name sortKey="Penning, Bw" uniqKey="Penning B">BW Penning</name>
</author>
<author>
<name sortKey="Ponnala, L" uniqKey="Ponnala L">L Ponnala</name>
</author>
<author>
<name sortKey="Schneider, Kl" uniqKey="Schneider K">KL Schneider</name>
</author>
<author>
<name sortKey="Schwartz, Dc" uniqKey="Schwartz D">DC Schwartz</name>
</author>
<author>
<name sortKey="Sharma, A" uniqKey="Sharma A">A Sharma</name>
</author>
<author>
<name sortKey="Soderlund, C" uniqKey="Soderlund C">C Soderlund</name>
</author>
<author>
<name sortKey="Springer, Nm" uniqKey="Springer N">NM Springer</name>
</author>
<author>
<name sortKey="Sun, Q" uniqKey="Sun Q">Q Sun</name>
</author>
<author>
<name sortKey="Wang, H" uniqKey="Wang H">H Wang</name>
</author>
<author>
<name sortKey="Waterman, M" uniqKey="Waterman M">M Waterman</name>
</author>
<author>
<name sortKey="Westerman, R" uniqKey="Westerman R">R Westerman</name>
</author>
<author>
<name sortKey="Wolfgruber, Tk" uniqKey="Wolfgruber T">TK Wolfgruber</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Yu, Y" uniqKey="Yu Y">Y Yu</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Zhou, S" uniqKey="Zhou S">S Zhou</name>
</author>
<author>
<name sortKey="Zhu, Q" uniqKey="Zhu Q">Q Zhu</name>
</author>
<author>
<name sortKey="Bennetzen, Jl" uniqKey="Bennetzen J">JL Bennetzen</name>
</author>
<author>
<name sortKey="Dawe, Rk" uniqKey="Dawe R">RK Dawe</name>
</author>
<author>
<name sortKey="Jiang, J" uniqKey="Jiang J">J Jiang</name>
</author>
<author>
<name sortKey="Jiang, N" uniqKey="Jiang N">N Jiang</name>
</author>
<author>
<name sortKey="Presting, Gg" uniqKey="Presting G">GG Presting</name>
</author>
<author>
<name sortKey="Wessler, Sr" uniqKey="Wessler S">SR Wessler</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
<author>
<name sortKey="Martienssen, Ra" uniqKey="Martienssen R">RA Martienssen</name>
</author>
<author>
<name sortKey="Clifton, Sw" uniqKey="Clifton S">SW Clifton</name>
</author>
<author>
<name sortKey="Mccombie, Wr" uniqKey="Mccombie W">WR McCombie</name>
</author>
<author>
<name sortKey="Wing, Ra" uniqKey="Wing R">RA Wing</name>
</author>
<author>
<name sortKey="Wilson, Rk" uniqKey="Wilson R">RK Wilson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paterson, Ah" uniqKey="Paterson A">AH Paterson</name>
</author>
<author>
<name sortKey="Bowers, Je" uniqKey="Bowers J">JE Bowers</name>
</author>
<author>
<name sortKey="Bruggmann, R" uniqKey="Bruggmann R">R Bruggmann</name>
</author>
<author>
<name sortKey="Dubchak, I" uniqKey="Dubchak I">I Dubchak</name>
</author>
<author>
<name sortKey="Grimwood, J" uniqKey="Grimwood J">J Grimwood</name>
</author>
<author>
<name sortKey="Gundlach, H" uniqKey="Gundlach H">H Gundlach</name>
</author>
<author>
<name sortKey="Haberer, G" uniqKey="Haberer G">G Haberer</name>
</author>
<author>
<name sortKey="Hellsten, U" uniqKey="Hellsten U">U Hellsten</name>
</author>
<author>
<name sortKey="Mitros, T" uniqKey="Mitros T">T Mitros</name>
</author>
<author>
<name sortKey="Poliakov, A" uniqKey="Poliakov A">A Poliakov</name>
</author>
<author>
<name sortKey="Schmutz, J" uniqKey="Schmutz J">J Schmutz</name>
</author>
<author>
<name sortKey="Spannagl, M" uniqKey="Spannagl M">M Spannagl</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Wicker, T" uniqKey="Wicker T">T Wicker</name>
</author>
<author>
<name sortKey="Bharti, Ak" uniqKey="Bharti A">AK Bharti</name>
</author>
<author>
<name sortKey="Chapman, J" uniqKey="Chapman J">J Chapman</name>
</author>
<author>
<name sortKey="Feltus, Fa" uniqKey="Feltus F">FA Feltus</name>
</author>
<author>
<name sortKey="Gowik, U" uniqKey="Gowik U">U Gowik</name>
</author>
<author>
<name sortKey="Grigoriev, Iv" uniqKey="Grigoriev I">IV Grigoriev</name>
</author>
<author>
<name sortKey="Lyons, E" uniqKey="Lyons E">E Lyons</name>
</author>
<author>
<name sortKey="Maher, Ca" uniqKey="Maher C">CA Maher</name>
</author>
<author>
<name sortKey="Martis, M" uniqKey="Martis M">M Martis</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A Narechania</name>
</author>
<author>
<name sortKey="Otillar, Rp" uniqKey="Otillar R">RP Otillar</name>
</author>
<author>
<name sortKey="Penning, Bw" uniqKey="Penning B">BW Penning</name>
</author>
<author>
<name sortKey="Salamov, Aa" uniqKey="Salamov A">AA Salamov</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Carpita, Nc" uniqKey="Carpita N">NC Carpita</name>
</author>
<author>
<name sortKey="Freeling, M" uniqKey="Freeling M">M Freeling</name>
</author>
<author>
<name sortKey="Gingle, Ar" uniqKey="Gingle A">AR Gingle</name>
</author>
<author>
<name sortKey="Hash, Ct" uniqKey="Hash C">CT Hash</name>
</author>
<author>
<name sortKey="Keller, B" uniqKey="Keller B">B Keller</name>
</author>
<author>
<name sortKey="Klein, P" uniqKey="Klein P">P Klein</name>
</author>
<author>
<name sortKey="Kresovich, S" uniqKey="Kresovich S">S Kresovich</name>
</author>
<author>
<name sortKey="Mccann, Mc" uniqKey="Mccann M">MC McCann</name>
</author>
<author>
<name sortKey="Ming, R" uniqKey="Ming R">R Ming</name>
</author>
<author>
<name sortKey="Peterson, Dg" uniqKey="Peterson D">DG Peterson</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
<author>
<name sortKey="Westhoff, P" uniqKey="Westhoff P">P Westhoff</name>
</author>
<author>
<name sortKey="Mayer, Kfx" uniqKey="Mayer K">KFX Mayer</name>
</author>
<author>
<name sortKey="Messing, J" uniqKey="Messing J">J Messing</name>
</author>
<author>
<name sortKey="Rokhsar, Ds" uniqKey="Rokhsar D">DS Rokhsar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Handsaker, B" uniqKey="Handsaker B">B Handsaker</name>
</author>
<author>
<name sortKey="Wysoker, A" uniqKey="Wysoker A">A Wysoker</name>
</author>
<author>
<name sortKey="Fennell, T" uniqKey="Fennell T">T Fennell</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N Homer</name>
</author>
<author>
<name sortKey="Marth, G" uniqKey="Marth G">G Marth</name>
</author>
<author>
<name sortKey="Abecasis, G" uniqKey="Abecasis G">G Abecasis</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Liu, T" uniqKey="Liu T">T Liu</name>
</author>
<author>
<name sortKey="Meyer, Ca" uniqKey="Meyer C">CA Meyer</name>
</author>
<author>
<name sortKey="Eeckhoute, J" uniqKey="Eeckhoute J">J Eeckhoute</name>
</author>
<author>
<name sortKey="Johnson, Ds" uniqKey="Johnson D">DS Johnson</name>
</author>
<author>
<name sortKey="Bernstein, Be" uniqKey="Bernstein B">BE Bernstein</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Myers, Rm" uniqKey="Myers R">RM Myers</name>
</author>
<author>
<name sortKey="Brown, M" uniqKey="Brown M">M Brown</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Liu, Xs" uniqKey="Liu X">XS Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pedregosa, F" uniqKey="Pedregosa F">F Pedregosa</name>
</author>
<author>
<name sortKey="Varoquaux, G" uniqKey="Varoquaux G">G Varoquaux</name>
</author>
<author>
<name sortKey="Gramfort, A" uniqKey="Gramfort A">A Gramfort</name>
</author>
<author>
<name sortKey="Michel, V" uniqKey="Michel V">V Michel</name>
</author>
<author>
<name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author>
<name sortKey="Grisel, O" uniqKey="Grisel O">O Grisel</name>
</author>
<author>
<name sortKey="Blondel, M" uniqKey="Blondel M">M Blondel</name>
</author>
<author>
<name sortKey="Prettenhofer, P" uniqKey="Prettenhofer P">P Prettenhofer</name>
</author>
<author>
<name sortKey="Weiss, R" uniqKey="Weiss R">R Weiss</name>
</author>
<author>
<name sortKey="Dubourg, V" uniqKey="Dubourg V">V Dubourg</name>
</author>
<author>
<name sortKey="Vanderplas, J" uniqKey="Vanderplas J">J Vanderplas</name>
</author>
<author>
<name sortKey="Passos, A" uniqKey="Passos A">A Passos</name>
</author>
<author>
<name sortKey="Cournapeau, D" uniqKey="Cournapeau D">D Cournapeau</name>
</author>
<author>
<name sortKey="Brucher, M" uniqKey="Brucher M">M Brucher</name>
</author>
<author>
<name sortKey="Perrot, M" uniqKey="Perrot M">M Perrot</name>
</author>
<author>
<name sortKey="Duchesnay, E" uniqKey="Duchesnay E">É Duchesnay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rehurek, R" uniqKey="Rehurek R">R Rehurek</name>
</author>
<author>
<name sortKey="Sojka, P" uniqKey="Sojka P">P Sojka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hunter, Jd" uniqKey="Hunter J">JD Hunter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mara Ais, G" uniqKey="Mara Ais G">G Marçais</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Coston, R" uniqKey="Coston R">R Coston</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Zimin, A" uniqKey="Zimin A">A Zimin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kulakovskiy, Iv" uniqKey="Kulakovskiy I">IV Kulakovskiy</name>
</author>
<author>
<name sortKey="Vorontsov, Ie" uniqKey="Vorontsov I">IE Vorontsov</name>
</author>
<author>
<name sortKey="Yevshin, Is" uniqKey="Yevshin I">IS Yevshin</name>
</author>
<author>
<name sortKey="Soboleva, Av" uniqKey="Soboleva A">AV Soboleva</name>
</author>
<author>
<name sortKey="Kasianov, As" uniqKey="Kasianov A">AS Kasianov</name>
</author>
<author>
<name sortKey="Ashoor, H" uniqKey="Ashoor H">H Ashoor</name>
</author>
<author>
<name sortKey="Ba Alawi, W" uniqKey="Ba Alawi W">W Ba-Alawi</name>
</author>
<author>
<name sortKey="Bajic, Vb" uniqKey="Bajic V">VB Bajic</name>
</author>
<author>
<name sortKey="Medvedeva, Ya" uniqKey="Medvedeva Y">YA Medvedeva</name>
</author>
<author>
<name sortKey="Kolpakov, Fa" uniqKey="Kolpakov F">FA Kolpakov</name>
</author>
<author>
<name sortKey="Makeev, Vj" uniqKey="Makeev V">VJ Makeev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gupta, S" uniqKey="Gupta S">S Gupta</name>
</author>
<author>
<name sortKey="Stamatoyannopoulos, Ja" uniqKey="Stamatoyannopoulos J">JA Stamatoyannopoulos</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Plant Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Plant Biol</journal-id>
<journal-title-group>
<journal-title>BMC Plant Biology</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2229</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30876396</article-id>
<article-id pub-id-type="pmc">6419808</article-id>
<article-id pub-id-type="publisher-id">1693</article-id>
<article-id pub-id-type="doi">10.1186/s12870-019-1693-2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A
<italic>k</italic>
-mer grammar analysis to uncover maize regulatory architecture</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-2791-182X</contrib-id>
<name>
<surname>Mejía-Guerra</surname>
<given-names>María Katherine</given-names>
</name>
<address>
<email>mm2842@cornell.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Buckler</surname>
<given-names>Edward S.</given-names>
</name>
<address>
<email>esb33@cornell.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Institute for Genomic Diversity, Cornell University,</institution>
</institution-wrap>
175 Biotechnology Building, Ithaca, 14853 NY USA</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0404 0958</institution-id>
<institution-id institution-id-type="GRID">grid.463419.d</institution-id>
<institution>USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center,</institution>
</institution-wrap>
Ithaca, 14853 NY USA</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">000000041936877X</institution-id>
<institution-id institution-id-type="GRID">grid.5386.8</institution-id>
<institution>Department of Plant Breeding and Genetics, Cornell University,</institution>
</institution-wrap>
159 Biotechnology Building, Ithaca, 14853 NY USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>15</day>
<month>3</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>15</day>
<month>3</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>19</volume>
<elocation-id>103</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>4</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>2</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.</p>
</sec>
<sec>
<title>Results</title>
<p>We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features -
<italic>k</italic>
-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-
<italic>k</italic>
-mers), that captures semantic and linguistic relationships between words. We built “bag-of-
<italic>k</italic>
-mers” and “vector-
<italic>k</italic>
-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-
<italic>k</italic>
-mers” achieved higher overall accuracy, while the “vector-
<italic>k</italic>
-mers” models were more useful in highlighting key groups of sequences within the regulatory regions.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12870-019-1693-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Gene regulatory regions</kwd>
<kwd>Machine learning models</kwd>
<kwd>Crops genomics</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000001</institution-id>
<institution>National Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>IOS #1238014</award-id>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The majority of sequence polymorphisms that are statistically associated with phenotypic variation (GWAS) lie in the non-genic portion of the genome, where they might play regulatory roles [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
]. Recently biochemical characterization of the open chromatin space in B73 (the maize reference line), revealed that as much as 40% of the significant sequence polymorphisms - as identified through variance components analyses – overlap with regions in which regulatory elements are expected [
<xref ref-type="bibr" rid="CR3">3</xref>
]. These biochemical assays are prohibitively expensive and time consuming at the scale of breeding programs for any crop species. This is even more true for species, such as maize, with high genomic diversity and a high rate of polymorphism. Similar to other crops, in maize, less than half of the genome sequence is expected to be shared between inbred lines [
<xref ref-type="bibr" rid="CR4">4</xref>
]. Building accurate models from expensive data derived from reference line(s) will enable breeders to project that information to other genotypes for use in genomic selection models and to prioritize regions of the genome to edit using strategies such as CRISPR technology [
<xref ref-type="bibr" rid="CR5">5</xref>
,
<xref ref-type="bibr" rid="CR6">6</xref>
].</p>
<p>The most common models to annotate a non-coding sequence with a regulatory role is the use of collections of transcription factor binding sites (TFBSs), or “motifs”, usually in the form of Position Weight Matrices (PWMs). Collections of PWMs are usually derived from large scale experiments (in-vivo or in-vitro) capable of biochemically characterize the interactions between proteins and the DNA. In plants, only in Arabidopsis, large collections of PWMs describing TF:DNA interactions are available. Franco-Zorrilla JM et al. and O’Malley RC et al. [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. For plant regulatory regions, a number of convenient tools to identify “motifs” from sets of sequences, or to identify candidate regulatory regions based on the presence of PWMs are routinely used in molecular biology relying on Arabidopsis annotations across species [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR10">10</xref>
]. As a shortcoming “motifs” are elusive, it is common to have experimental data from TF:DNA interactions from which a PWM can not be obtained [
<xref ref-type="bibr" rid="CR11">11</xref>
]. When available, PWMs are limited in their application to identify candidate regulatory regions, frequently achieving poor recognition performance [
<xref ref-type="bibr" rid="CR12">12</xref>
,
<xref ref-type="bibr" rid="CR13">13</xref>
].</p>
<p>Most of the experimental and computational approaches used to annotate functional non-coding regions focus on the regulatory role of TFBSs [
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR15">15</xref>
]. However, it has been observed that patterns of sequence organization (the grammar) and the chromatin context in which TFBSs are located contribute to the regulatory message [
<xref ref-type="bibr" rid="CR16">16</xref>
<xref ref-type="bibr" rid="CR18">18</xref>
]. For instance, the spatial arrangement of poly(dA:dT) tracts within yeast promoter regions have been identified as causal drivers of transcriptional patterns at comparable levels to TFBSs [
<xref ref-type="bibr" rid="CR19">19</xref>
]. More recently, it was shown that developmental enhancers in
<italic>Ciona</italic>
rely on the positioning, arrangement, and space between TFBSs to counterbalance low TFBS affinity [
<xref ref-type="bibr" rid="CR20">20</xref>
]. From this emerging view, it appears that regulatory regions have distinctive features that can be exploited for prediction, identifying enriched key sequences and sequence organization.</p>
<p>The frequency of oligomers of length
<italic>k</italic>
(i.e., short
<italic>k</italic>
-mers in the size range of TFBS) have been exploited to build supervised models capable of discriminating regulatory regions from random genomic regions, as well as to score sequence variation with few or no assumptions regarding to the role that a given
<italic>k</italic>
-mers might play [
<xref ref-type="bibr" rid="CR21">21</xref>
<xref ref-type="bibr" rid="CR23">23</xref>
]. The early
<italic>k</italic>
-mers count-based classifiers have been improved to count gapped
<italic>k</italic>
-mers, allowing exploration of short and long
<italic>k</italic>
values without losing power as the total number of
<italic>k</italic>
-mers increases [
<xref ref-type="bibr" rid="CR24">24</xref>
]. Some limitations of
<italic>k</italic>
-mers frequency-based methods include: (1) they make poor or no use of the
<italic>k</italic>
-mers positional relationships in their models, and (2) they perform poorly in the presence of repetitive regions, the frequencies of short size
<italic>k</italic>
-mers are misleading, which might hamper the performance of this methods for genomes with high repeat content.</p>
<p>Recently however, a growing set of computational tools using Neural Networks (NNs) have shown success in learning to recognize simple sequence patterns, similar to PWMs. These approaches have been able to further integrate those patterns into more complex features to discriminate regulatory regions [
<xref ref-type="bibr" rid="CR25">25</xref>
<xref ref-type="bibr" rid="CR27">27</xref>
]. Generally, the NNs implemented for genomic data are Convolutional Neural Networks (CNNs), a type of architecture that shows state-of-the-art performance for key phrase recognition tasks in Natural Language Processing (NLP), but not Recurrent Neural Networks (RNNs) which are preferred for comprehension of whole sentence semantics given their power in modeling long-span relations [
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
]. Despite their power, CNNs are often implemented in a black-box context and interpretation of their output is challenging; thus it remains unclear how much of their performance is derived from recognizing key motifs, motif relationships, and the general sequence context. For these reasons we choose to implement
<italic>k</italic>
-mer approaches rather than CNN’s or RNN’s.</p>
<p>To define sequence arrangements with putative regulatory roles, we analyzed the architecture of regulatory regions at the
<italic>k</italic>
-mer level, focusing on weighted individual frequencies and co-occurrences, while considering a genome environment with high repeat content. The core of the analysis builds on machine learning approaches commonly applied in the natural language processing (NLP) community. These methods are easily interpretable and rely on word statistics to recover semantic and syntactic cues [
<xref ref-type="bibr" rid="CR30">30</xref>
<xref ref-type="bibr" rid="CR33">33</xref>
]. We evaluated the accuracy and precision of these approaches with a diverse set of functional genomics experiments to provide a comprehensive description of the regulatory landscape of the maize genome. The software implementation that allows to select control regions, train and test models, is open source and available in a public Bitbucket repository.</p>
</sec>
<sec id="Sec2" sec-type="results">
<title>Results</title>
<sec id="Sec3">
<title>Weighted frequencies and co-occurrences of short sequences can accurately discriminate regulatory from random genomic regions</title>
<p>To build accurate classifiers we collected a comprehensive set of regions enriched in regulatory function (hereafter, ’regulatory regions’), as identified in B73 (maize reference genome) through different biochemical assays. We included in the open chromatin regions by MNA-seq derived from two tissues [
<xref ref-type="bibr" rid="CR3">3</xref>
], binding loci from ChIP-seq peaks of two TFs (i.e., Homeobox KNOTTED 1 – KN1, bZIP FASCIATED EAR4 – FEA4) [
<xref ref-type="bibr" rid="CR34">34</xref>
,
<xref ref-type="bibr" rid="CR35">35</xref>
], and core promoter regions around TSSs [
<xref ref-type="bibr" rid="CR36">36</xref>
<xref ref-type="bibr" rid="CR38">38</xref>
] (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1). Because the specific background signals from each individual experiment are not available, regulatory regions were paired with randomly chosen regions controlling for G+C content and genomic distribution. Each group of sequence (regulatory regions and their control) was separated into training and holdout sets for model evaluation. In total we analyzed 52,292,705 base pairs of regulatory regions corresponding to ∼2.5% of the effective genome size of the B73 genome.</p>
<p>The first part of the analysis involved the training of “bag-of-
<italic>k</italic>
-mers” and “vector-
<italic>k</italic>
-mers” models (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
). The “bag-of-
<italic>k</italic>
-mers” captures information from the
<italic>k</italic>
-mer individual frequencies and fits a logistic regression to a matrix filled with the TF*IDF (i.e., the term frequency–inverse document frequency) transformation of the raw counts per sequence [
<xref ref-type="bibr" rid="CR30">30</xref>
]. Thus, the
<italic>β</italic>
coefficients of the logistic regression can be interpreted as weights of the contribution of each
<italic>k</italic>
-mer to the classifier decision and of its enrichment in regulatory and random regions. By contrast, the “vector-
<italic>k</italic>
-mers” captures information from the
<italic>k</italic>
-mer co-occurrences by training a shallow NN that learns the probability for each
<italic>k</italic>
-mer given its context (window = 5). The output is n-dimensional vectors
<italic>v</italic>
<sub>
<italic>k</italic>
-mer</sub>
– one per
<italic>k</italic>
-mer - independently generated for regulatory regions and their respective control (
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
) to denote different geometric spaces containing
<italic>v</italic>
<sub>
<italic>k</italic>
-mer</sub>
. Next,
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
are utilized to determine the likelihood of groups of
<italic>k</italic>
-mers being observed in regulatory or control regions [
<xref ref-type="bibr" rid="CR32">32</xref>
,
<xref ref-type="bibr" rid="CR33">33</xref>
]. Put together, these two models aim to learn the importance of key sequence features and sequence feature relationships as descriptors of regulatory architecture.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Schematic of the steps to generate “bag-of-
<italic>k</italic>
-mers” and “vector-
<italic>k</italic>
-mers” models. The workflow shows the steps from data preprocessing to model output. We fitted “bag-of-
<italic>k</italic>
-mers” and “vector-
<italic>k</italic>
-mers” models for
<italic>k</italic>
values between 5 to 10 bp (within the common range in which regulatory elements have been observed). Training and evaluation of both methods happened on the same portion of the data to facilitate comparisons. The common pre-processing step involved the collapsing of complementary
<italic>k</italic>
-mers as the same token to reduce the noise of
<italic>k</italic>
-mer counts and the effective vocabulary for feature selection. The final outputs are both the classifiers and learned features</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>We choose to compare our models against a “motif” collection approach. For this we used the MEME-ChIP pipeline [
<xref ref-type="bibr" rid="CR10">10</xref>
]. In brief, MEME-ChIP combines several of the most popular algorithms of the MEME suite to generate PWMs (
<italic>de novo</italic>
) in a discriminative mode using the sequences in the training set. MEME-ChIP also scan sequences against a motif database from Arabidopsis [
<xref ref-type="bibr" rid="CR8">8</xref>
]. The goal of this analysis was to obtain PWMs capable to differentiate between regulatory regions and control to contrast against the models. We obtained five collections, one for each type of regulatory region, of PWMs, and used it to scan the corresponding holdout sets.</p>
<p>Model performance was measured with several metrics: (1) accuracy, precision, and recall (See “
<xref rid="Sec10" ref-type="sec">Methods</xref>
” section and Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S2), in addition (2) the receiver operating characteristic curve, and the precision recall curve were plotted and (3) the area under each curve was computed (auROC, auPRC) (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
a-b and Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figures S1). First, models were evaluated on balanced holdout sets (i.e., the same number of regulatory and random sequences). The two models perform similarly well, with average accuracy ∼90% and an average difference in accuracy of ∼3% between the two models. Overall, the “bag-of-
<italic>k</italic>
-mers” model shows better performance for most of the cases, with the “vector-k-mers” models slightly outperforming when
<italic>k</italic>
is small (
<italic>k</italic>
=5 and
<italic>k</italic>
=6) and training datasets are large (e.g., MNA-seq - shoot, root) (Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S2). The collection of PWMs as an alternative classifier underperformed against all the models, in all the combinations of
<italic>k</italic>
-size and regulatory regions. Overall, PWMs appear to work better for the identification of TFBSs from TF ChIP-seq data, and for core promoter, than for the open chromatin regions (MNA-seq data) (Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S2), which is expected given that enrichment of a single or few motifs is usually the landmark of TFs The performance of the “bag-of-
<italic>k</italic>
-mers” models was reliable even at
<italic>k</italic>
≥8, as opposed to similar approaches that rely on raw
<italic>k</italic>
-mer counts as features to train machine learning classifiers [
<xref ref-type="bibr" rid="CR22">22</xref>
,
<xref ref-type="bibr" rid="CR39">39</xref>
]. The above suggests that the TF*IDF transformation is efficient in alleviating some of the noise inherent to the matrix sparsity that increased with
<italic>k</italic>
.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Comparison between models of the precision-recall curves. Comparison of models performance under balanced (
<bold>a</bold>
-
<bold>b</bold>
) and unbalanced holdout sets (
<bold>c</bold>
-
<bold>d</bold>
). For each model (
<italic>k</italic>
=8), the precision recall (PR) curve for all the regulatory datasets are shown, and the corresponding curves for classification of the same holdout set with a collection of PWMs (dotted lines). The PR curve shows the trade-off between precision and recall for different decision threshold. A high area under the curve represents both high recall (low false negative rate) and high precision (low false positive rate)</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>To increase the stringency of our evaluation criteria, we measured each models’ performance with unbalanced holdout sets in which regulatory regions are outnumbered by random regions by 1 to 10 (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
c-d and Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figures S1C-D). Scaling up the number of random regions did not appreciably change accuracy and auROC values, but the auPRC showed a drop in model performance as the rate of false positive increased. At
<italic>k</italic>
=8, both models have a desirable precision, ∼80-70%, recovering ∼60% of the relevant regions (i.e., recall rate) for open chromatin and core promoter datasets. The “bag-of-
<italic>k</italic>
-mers” model works better for prediction of TF binding loci than the “vector-
<italic>k</italic>
-mers”, with the last one displaying an excess of false positives at our aimed recall rate (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figures S2). Across a more stringent test, the PWM collections under-performed against all the other models at any given
<italic>k</italic>
, as a consequence of an increasing in the number of false positives. The performance measurement under an unbalanced set suggests that applying extra stringency to the predicted probability, thereby allowing the recovery of ∼60% of the relevant sequences, would result in an acceptable trade-off between sensitivity and specificity for most of the models when non-regulatory regions are in large numbers.</p>
<p>Highly repetitive genomes include an abundance of low-complexity regions. These repetitive regions are expected to carry little information for regulation, and because of their high-frequency, they represent an obstacle to identifying the key elements from raw
<italic>k</italic>
-mer counts. To empirically determine a complexity threshold for
<italic>k</italic>
-mers unlikely to have a regulatory role, we examined a collection of regulatory motifs and calculated complexity (as measured with Shannon entropy) for the consensus sequences (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S3). Using this threshold,
<italic>k</italic>
-mers with low complexity were filtered out to build “bag-of-
<italic>k</italic>
-mers” models with a reduced vocabulary (filtered), and contrasted against models using the whole vocabulary (full). The difference between the two models at a base pair level is illustrated for the
<italic>ga2ox1</italic>
first intron recognized by KN1 [
<xref ref-type="bibr" rid="CR34">34</xref>
,
<xref ref-type="bibr" rid="CR40">40</xref>
]. We observed that low complexity regions overlapped with
<italic>k</italic>
-mers that have a high score from the model trained on the full
<italic>k</italic>
-mer vocabulary (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
a). This is different from the filtered model which appears to be in agreement with the ChIP-seq data (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S4). To evaluate the importance of these repetitive sequences in recognizing the regulatory regions, we compared the models with and without low complexity
<italic>k</italic>
-mers using an unbalanced holdout set and found that both models show almost identical performance for the auROC and non-significant differences for the auPRC (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
b-c, Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S5). This suggests that in general, low complexity
<italic>k</italic>
-mers in maize do not contribute substantially to the regulatory message. However, for scaling across the genome, controlling for repetitive sequences would be critical for prediction performance and for the extraction of key k-mers that are not frequency-biased.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Low complexity regions do not provide relevant information to discriminate regulatory regions.
<bold>a</bold>
Annotation at a base pair level of the first 1000 bases pairs of the long intron in the maize gene
<italic>ga2ox1</italic>
using sequence complexity (Entropy), scores from “bag-of-
<italic>k</italic>
-mers” models (Full and Filtered), and regulatory probabilities (Probability) from the “vector-
<italic>k</italic>
-mers” model. Sequence complexity and “bag-of-
<italic>k</italic>
-mers” scores were calculated using a 1bp sliding window of size
<italic>k</italic>
. Regulatory probabilities were calculated using a 1bp sliding window of 3*
<italic>k</italic>
to evaluate co-occurrence of groups of 3
<italic>k</italic>
-mers. The evaluated region includes the KN1 ChIP-seq peaks as identified from two biological replicates in developing ears (the center of the peak for each replicate is indicated with a vertical dotted line).
<bold>b</bold>
For all the models tested with KN1 unbalanced holdout set the performance measured as area under the PR curve shows the best performance for the “bag-of-k-mers” models at
<italic>k</italic>
=8.
<bold>c</bold>
The PR curve for (
<italic>k</italic>
=8) “bag-of-
<italic>k</italic>
-mers” filtering low-complexity k-mers shows similar performance than full 8-mer vocabulary across all the regulatory regions for the different decision threshold</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
</sec>
<sec id="Sec4">
<title>Models to predict regulatory regions are scalable to the genome-wide space</title>
<p>Under the assumption that annotation of non-coding regions would be part of general pipelines, in which ∼85% of the genome should be recognized as repeats and ∼5% as coding sequences, our models for annotating regulatory regions should be limited to ∼10% of the space. Still, it is a challenge to accurately predict a regulatory region using a model that was training in artificial balanced data from a context that might harbor similar sequence composition while surrounded by repetitive elements. To gain insights on the behavior of the models at a genome-wide scale, the sequence of chromosome 10 was partitioned into 1,943,698 regions (300 base pairs length) and 115,149 regions that were neither repeats nor coding sequences were selected to be annotated. We used models derived from MNA-seq shoot data applying different levels of stringency for the predicted probabilities (Additional file 
<xref rid="MOESM4" ref-type="media">4</xref>
: Table S3). According to the results obtained with unbalanced holdout set, and in order to balance sensitivity and specificity, we determined that the ideal predicted probability cut-off was the one that captures ∼60% of the regions that overlap with the annotated regulatory regions. Under this criteria the “bag-of-
<italic>k</italic>
-mers” (
<italic>k</italic>
=8, filtered, probability ≥0.85) and the “vector-
<italic>k</italic>
-mers” models (probability ≥0.95), predicted 38,945 and 41,932 regulatory regions respectively. The high confidence regions classified as regulatory correspond to ∼2.2–2.3% of the total regions from chromosome 10, in line with the expected portion of the genome with a regulatory function.</p>
<p>Next we aimed to annotate the genomes of ZmW22, a maize inbred line, that was recently made public [
<xref ref-type="bibr" rid="CR41">41</xref>
]. To do so, we choose to annotate the ZmW22 genome using the MNA-seq shoot models, as open chromatin regions are usually a collection of all the regulatory regions in the genome, including promoters and TFBSs. To get a set of “ground truths” to evaluate our results we aligned ZmB73 MNA-seq regions to the ZmW22 genome, and scored windows around the alignment hits with our models. This test allow us to determine how frequently the models were able to recognize a “candidate regulatory region” in their local context, without masking the genome. This analysis evaluated regulatory vs non-regulatory regions to a ratio of 3:20, more than twice than previous presented analysis for the unbalanced holodut set</p>
<p>According to the observations made in the chromosome 10 of ZmB73, we used first the “bag-of-k-mers” (filtered, probability ≥0.85) to obtain the “candidate regulatory regions”. And used on top the “vector-k-mers” to obtain distances of similarities between the candidate regulatory regions and the ZmB73 MNA-seq regions summarizing region with their vector centroid distance. The combined top prediction around each of the “ground truths” resulted in an intersection with the alignment hit in a ∼70% of the cases. Allowing up to three top predictions around each hit, increases to ∼77% of the cases.</p>
</sec>
<sec id="Sec5">
<title>Models trained in maize can be used to inspect the regulatory space in related species</title>
<p>Transference of functional genomic annotations across diverse maize lines requires models than can preferentially capture conserved features (those common between lines or related species). Consistently, we expect that models that are accurate in related species should also perform well in different maize lines. To gain insights into this we evaluated models trained on TF binding loci and core promoters in two species (sorghum and rice). In order to determine positional preferences among binding loci, we built peak meta-profiles that summarized KN1 models’ performance in maize and rice at the base-pair level (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
a-b). The “bag-of-
<italic>k</italic>
-mers” model can differentiate between regulatory regions and their control in maize, and in addition can distinguish rice KN1-like (i.e., OSH1) binding sites (i.e., peaks from rice OSH1 ChIP-seq data [
<xref ref-type="bibr" rid="CR42">42</xref>
]). On the other hand, the “vector-
<italic>k</italic>
-mers” cannot differentiate between random regions and regulatory regions in rice, predicting random as regulatory (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S6A). Interestingly, the distributions of regulatory probabilities for random and regulatory regions are noticeable different (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S6B), suggesting that the “vector-
<italic>k</italic>
-mers” model distinguish between OSH1 peaks and control regions, but not enough to assign greater non-regulatory probability to random regions. In maize, the “bag-of-
<italic>k</italic>
-mers” model (filtered) shows an slight preference towards the midpoint region versus the edges, while the “vector-
<italic>k</italic>
-mers” recognizes the whole region without preference for to the middle (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
a). In rice, the “bag-of-
<italic>k</italic>
-mers” shows a marked preference near or at the peak midpoint over the flanking (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
b). This suggests that the “bag-of-
<italic>k</italic>
-mers” capture a diverse array of features which are enriched at the center of the peak and beyond in maize. However, only the key features that are enriched at the center of the peak appear indeed conserved between the two species.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Prediction of Core Promoter Regions and TF Binding Loci Across Species.
<bold>a</bold>
<italic>k</italic>
-mer weights derived from a “bag-of-
<italic>k</italic>
-mers” model trained in maize (Zm) KN1 regions were used for the base-pair annotation of KN1 binding loci (blue) and control regions (red).
<bold>b</bold>
The same weights were used to annotated regions from rice, and to differentiate regions targeted by OSH1 (the functional orthologue of KN1) [
<xref ref-type="bibr" rid="CR42">42</xref>
] (blue) from control regions (red).
<bold>c</bold>
A “bag-of-
<italic>k</italic>
-mers” model trained in maize core promoters (Zm) and tested in maize.
<bold>d</bold>
Model trained in maize core promoters was tested in in sorghum (Sb) for a set of randomly chosen 1000 core promoter regions and their respective controls</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>For the evaluation of models trained on core promoters we used a balanced holdout set derived from a random sample of sorghum annotated gene models. The positional preferences in core promoters in maize are evident from average
<italic>k</italic>
-mers weights around the +30 region, in which a TATA-box is expected (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
c). The same is not observed in Sorghum (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
d). This likely result from the biased sample of TSS in maize that have a high proportion of TATA+ promoters, even when TATA-less promoter are the majority [
<xref ref-type="bibr" rid="CR38">38</xref>
]. A positional analysis using the “vector-
<italic>k</italic>
-mers” models did not reveal local enrichment along the sorghum promoter sequences. Yet, the probabilities scores are again different between control sequences and core promoter sequences. The difficulties of the model to identify control regions might be a consequence of the strong differences between the repeat landscape in the non-coding regions between sorghum and maize that is not captured in the maize training set, rather than a lack of similarities between the regulatory regions of the two species. Taken together we have shown that classifiers trained in maize can be useful to predict regulatory regions in sorghum and rice, and that features enriched in maize regulatory regions and in the random genomic space (as captured by the models) are of two general types: (1) maize specific and (2) conserved across related species.</p>
</sec>
<sec id="Sec6">
<title>Scored vocabularies highlight signatures of regulatory function</title>
<p>The methods proposed here were chosen because of the interpretability of the learned features, aiming to better understand the patterns in sequence that characterize regulatory regions. Thus, we focused on scored
<italic>k</italic>
-mer vocabularies (
<italic>k</italic>
=8, filtered) as easiest to interpret, and systematically analyzed the tails of the distribution as they concentrated the most informative sequences. Therefore, the largest positive coefficient values (top scored
<italic>k</italic>
-mers) are indicative of enrichment and the largest negative values (bottom scored
<italic>k</italic>
-mers) of depletion in regulatory regions. The absolute values from both sides of the score distribution are different, with preference for positive over negative ones, meaning that model’s prediction are the result of identifying those
<italic>k</italic>
-mers that are enriched in regulatory regions rather than depleted ones (or enriched in random regions). We found that properties of the scored
<italic>k</italic>
-mers obtained from applying an out-of-the-box NLP technique [
<xref ref-type="bibr" rid="CR32">32</xref>
] are similar to those previously described with sequence kernels developed to analyze vertebrate genomic data [
<xref ref-type="bibr" rid="CR22">22</xref>
<xref ref-type="bibr" rid="CR24">24</xref>
].</p>
<p>We observed a bias in the G+C content at the extremes of the score distribution for core promoters (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a) and to a lesser extend for open chromatin regions (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
b-c). The 1% of the top shows a bimodal distribution, in which a subpopulation of
<italic>k</italic>
-mers exhibits low G+C content, in contrast to the 1% of the bottom, and the remaining 98%. Conversely, the score distribution for TF binding loci shows a general shift of top and bottom tails towards higher G+C contents, in comparison to the remaining 98% (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S7). These results are in agreement with known roles for high A+T sequences within core promoters related to the TATA elements and high G+C sequences as TF binding sites [
<xref ref-type="bibr" rid="CR38">38</xref>
,
<xref ref-type="bibr" rid="CR43">43</xref>
]. Indeed, when investigated, individual
<italic>k</italic>
-mers with high A+T content were positionally restricted upstream of the TSS and preferentially on the region defined for the TATA element in maize (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
d).
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>G+C Content Bias is Related to Positional Constrains within Promoters and Open chromatin Regions. Comparison of the distribution of G+C content across top 1%, bottom 1% and remaining 98% of scored
<italic>k</italic>
-mer vocabularies (
<italic>k</italic>
=8, filtered) for
<bold>a</bold>
core promoters, MNA-seq
<bold>b</bold>
shoot and
<bold>c</bold>
root model’s results. The positional constraints of
<italic>k</italic>
-mers with high A+T content on the top 1% visualized as k-mer’s density with respect to a reference point:
<bold>d</bold>
TSS for core promoters and MNA-seq hotspot middle point for open chromatin regions
<bold>e</bold>
shoot and
<bold>f</bold>
root (solid blue lines). Contrasting density plots are shown for corresponding random regions (dotted gray lines)</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
<p>The enrichment of MNA-seq regions for
<italic>k</italic>
-mers with high A+T content (rich A+T
<italic>k</italic>
-mers) might be derived from signal co-localization between open chromatin regions and core promoters [
<xref ref-type="bibr" rid="CR3">3</xref>
]. If signal co-localization were sufficient to explain the similarities between open chromatin and core promoter regions, then controlling for distance to annotated genes should remove the signal from rich A+T
<italic>k</italic>
-mers in distal regions. Yet, controlling for near gene proximal (2kb) the positional constraints remain in both, proximal and distal, regions (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
e-f). These rich A+T
<italic>k</italic>
-mers might be part of poly(dA:dT) tracts which can provide an increase in DNA rigidity and are known to be in proximity to regions that are enriched in TFBSs [
<xref ref-type="bibr" rid="CR44">44</xref>
]. In agreement with the positional restriction, rich A+T
<italic>k</italic>
-mers flank the midpoints where G+C content is high, as expected for the regions that are bound by TFs [
<xref ref-type="bibr" rid="CR43">43</xref>
], and where the signal for open chromatin regions is concentrated.</p>
<p>In addition to key structural tracts,
<italic>k</italic>
-mers with the largest positive values for each regulatory category are expected to be enriched for TF motifs. Because the number of experimentally verified maize motifs is limited, we contrasted the top 1% of positive scored
<italic>k</italic>
-mers against two large collections of TF motifs as identified from large scale experiments in the reference plant
<italic>Arabidopsis thaliana</italic>
(TOMTOM, p-value <0.001) [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
] (Additional file 
<xref rid="MOESM5" ref-type="media">5</xref>
: Table S4). For the evaluated experiments we found that the top 1% of positive
<italic>k</italic>
-mers are ∼threefold more enriched for significant hits against the motif database than expected by chance for all the
<italic>k</italic>
-mers in the population. The enrichment for the top
<italic>k</italic>
-mers was statistically significant (hyper-geometric test, p-value <0.001). Further analyses revealed that
<italic>k</italic>
-mer scoring is consistent within families of TF binding sites. In particular, motifs preferentially hit by the top 1% of positive
<italic>k</italic>
-mers from FEA4 binding loci (a bZIP transcription factor) correspond to the bZIP/TGA-class, and motifs preferentially hit by
<italic>k</italic>
-mers enriched in KN1 (a Homeobox transcription factor) correspond to the Homeobox family (Additional file 
<xref rid="MOESM5" ref-type="media">5</xref>
: Table S4). Thus, the scored vocabularies produced a comprehensive catalog of
<italic>k</italic>
-mers with putative structural roles and a collection of
<italic>k</italic>
-mers similar to TFBSs that constitute signatures of the maize regulatory architecture.</p>
</sec>
<sec id="Sec7">
<title>Sequence similarity in the geometric space reveals a prevalent distinctive
<italic>k</italic>
-mer organization within regulatory regions</title>
<p>The set of highly enriched individually scored sequences, as output from “bag-of-
<italic>k</italic>
-mers” models, is likely to include groups of
<italic>k</italic>
-mers that correspond to the same motif, given the degeneracy of TFs binding sites. However, the question arises of how to group
<italic>k</italic>
-mers that likely share functional roles and constitute single motifs. In NLP, word2vec is an effective method to extract linguistic regularities between words by considering the local context in which they occurs (e.g., apple and oranges might share local contexts as they are words with similar meanings) [
<xref ref-type="bibr" rid="CR45">45</xref>
]. Because vector position in each geometric space is determined from the composition of the local word/
<italic>k</italic>
-mer context (i.e., neighboring
<italic>k</italic>
-mers), we can assume that two k-mers that are close (i.e., close in cosine distance) to each other in a geometric space share local sequence similarity (Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
a). Therefore, we used the geometric spaces obtained from the “vector-
<italic>k</italic>
-mers” models, to extract
<italic>k</italic>
-mer regularities or
<italic>k</italic>
-mer organizational ’rules’ that differentially arise between regulatory and random regions. Because, the position of
<italic>k</italic>
-mers between geometric spaces cannot be directly contrasted, we compared the lists of closest
<italic>k</italic>
-mers for any given
<italic>k</italic>
-mer in the vocabulary as obtained from the geometric spaces about regulatory and random regions (respectively,
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
).
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Local Sequence Context Defines Distinctive Groups of
<italic>k</italic>
-mers Between Regulatory and Random Regions.
<bold>a</bold>
Schematic of aligning flanking contexts versus contrasting local sequence composition, as implemented in the “vector-
<italic>k</italic>
-mer” models in which
<italic>k</italic>
-mers that share a similar context would be represented by close vectors (
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
) in a geometric space.
<bold>b</bold>
The vector space obtained from core promoters (
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
) and their corresponding control (
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
) define two different groups of closest
<italic>k</italic>
-mers (cosine similarity) to the ’CTATATA’ vector (
<italic>v</italic>
<sub>
<italic>CTATATA</italic>
</sub>
). The group of closest
<italic>k</italic>
-mers in
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
, when compared to the group formed in
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
, are more similar in sequence (shorter edit distance), and have in average more positive
<italic>k</italic>
-mer scores from the equivalent “bag-of-
<italic>k</italic>
-mers” model. This implies a semantic-like relationship between those
<italic>k</italic>
-mers in regulatory sequences versus random regions.
<bold>c</bold>
The group of
<italic>k</italic>
-mers closest in the
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
space have similar positional preferences (blue solid lines) to CTATATA (black dotted line) in the region expected for the TATA element.
<bold>d</bold>
In addition, the group of
<italic>k</italic>
-mers closest in the
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
(red solid lines) do not show similar positional constraints to CTATATA (black dotted line) do not show positional preferences relative to the TSSs</p>
</caption>
<graphic xlink:href="12870_2019_1693_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
<p>To illustrate, we compared the representative vector of the
<italic>7</italic>
-mer
<italic>CTATATA</italic>
in
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
(i.e., set of
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
learned from core promoter regions) and in
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
(i.e., set of
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
learned from random regions used as controls for core promoters). Using
<italic>v</italic>
<sub>
<italic>CTATATA</italic>
</sub>
we obtained the set of top five closest
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
in
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and in
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
and found that
<italic>k</italic>
-mers from
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
share more sequence similarity (average edit distance 1.8 vs 4.2 respectively) and have, on average, more positive scores from the respective “bag-of-
<italic>k</italic>
-mers” model (1.49 vs 0.01) (Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
b). In addition,
<italic>k</italic>
-mers close to
<italic>v</italic>
<sub>
<italic>CTATATA</italic>
</sub>
in
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
share positional constraints that are not recovered from those related in
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
(Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
c-d). This example shows how the output of the geometric spaces can be exploited to determine groups of similar
<italic>k</italic>
-mers according to their context.</p>
<p>To obtain a global view of how many
<italic>k</italic>
-mers are embedded in different local sequences between regulatory and random regions, we collected for any given
<italic>k</italic>
-mer (
<italic>k</italic>
=8) in the vocabulary, the list of the closest similar
<italic>k</italic>
-mers ranked by cosine similarity from
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
. Next, we contrasted the two ranked lists and determined which
<italic>k</italic>
-mers show the greatest dissimilarity between regulatory and random regions [
<xref ref-type="bibr" rid="CR46">46</xref>
]. In general, we found that low complexity
<italic>k</italic>
-mers do not show distinctive organizational ’rules’ between regulatory regions and random, reinforcing our view that short repetitive sequences are not important to define the identity of a sequence. We found that, in terms of the number of
<italic>k</italic>
-mers with different relationships between
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
, “vector-
<italic>k</italic>
-mers” models derived from TF binding loci (∼45%) and core promoter regions (∼30%) result in notably more differentially represented
<italic>k</italic>
-mers than models derived from open chromatin regions (∼5%). In all the cases, we observed a similar proportion of
<italic>k</italic>
-mers enriched and depleted in regulatory regions (as established from the “bag-of-
<italic>k</italic>
-mers” scores). The results from models trained in open chromatin regions, might represent the heterogeneity of the regions that prevents the model from learning many specific
<italic>k</italic>
-mer vectors. However, the fact that the classifiers work with great accuracy indicates that even when the differences are less pronounced than for TF binding loci and core promoter regions, they are large enough to distinguish between an open chromatin region and its control.</p>
<p>We integrated the information obtained from the “bag-of-
<italic>k</italic>
-mers” and the “vector-
<italic>k</italic>
-mers” models and found that for the top 1% of the
<italic>k</italic>
-mers that are enriched in frequency in regulatory regions there is little overlap between
<italic>k</italic>
-mers that resemble motifs and
<italic>k</italic>
-mers that show differential relationships between regulatory regions and random regions. For instance, from the FEA4 models, only 10 out of 103
<italic>k</italic>
-mers, that are statistically similar to Arabidopsis motifs, show differential
<italic>k</italic>
-mer relationships between regulatory and random regions. Such difference might be derived from the proportion of TFBSs that are not similar between Maize and Arabidopsis
<italic>cis</italic>
-regulatory elements. In summary, we have compiled a regulatory vocabulary that includes a proportion of key k-mers that are enriched in regulatory regions and (1) resemble known motifs, and (2) are embedded in a specific regulatory context.</p>
</sec>
</sec>
<sec id="Sec8" sec-type="discussion">
<title>Discussion</title>
<p>The decreased cost of large scale genotyping and genome assemblies for crops such as maize and related species, has already shown potential to accelerate the breeding process by linking sequence and structural variation to phenotype [
<xref ref-type="bibr" rid="CR47">47</xref>
]. A majority of functional genetic variation that is important to phenotype is located in the non-coding regions of the genome. This variation is largely untapped because recognizing functional alleles in the non-coding regions of the genome is both expensive and laborious. In humans and other metazoan models, non-coding annotation that allows identification of functional genetic variation has been accelerated over the last decade using two types of analyses: (1) functional analysis from large collections of biochemical assays; and (2) comparative sequence analysis between reference genomes of closely related species [
<xref ref-type="bibr" rid="CR48">48</xref>
]. Yet, in maize, these two types of analyses are particularly challenging. Large collections of biochemical assays remain prohibitive at the scale necessary to cover maize diversity, which is 20 times more than the diversity found in humans [
<xref ref-type="bibr" rid="CR49">49</xref>
]. In addition, comparative sequence analysis requires genome alignment between closely related species, which for maize and its relatives is complicated by the presence of a large number of repetitive sequences in the genome.</p>
<p>In this study, we introduce a computational framework consisting of two type of machine learning models that can accurately classify regulatory regions obtained from functional genomic experiments and random genomic regions. These approaches were borrowed from the fields of natural language processing and information retrieval, and were explicitly chosen to overcome the challenges of annotating intergenic regions in maize. To address highly repetitive sequences and the role of low-complexity regions in maize non-coding regions the “bag-of-
<italic>k</italic>
-mers” model relies on first filtering out
<italic>k</italic>
-mers with low-complexity, and next using a sub-linear function to transform raw
<italic>k</italic>
-mer frequencies to down weight
<italic>k</italic>
-mers that are too frequently observed in a group of sequences and in consequence have less power to discriminate between regulatory and non-regulatory regions. In parallel, the “vector-
<italic>k</italic>
-mers” model learns local
<italic>k</italic>
-mer organization from
<italic>k</italic>
-mer co-occurrence frequencies, which in practice results in a geometric space that allows alignment-free comparisons between sequences [
<xref ref-type="bibr" rid="CR50">50</xref>
]. The simultaneous use of two different approaches adds robustness to the predicted annotations, allowing researchers to contrast or to combine the results of the two types of models.</p>
<p>In most of the functional genomics experiments the expectation is to identify rare instances of a biochemical event (e.g., the locations in the genome in which the chromatin is accessible for enzymatic digestion) versus thousands of instances that represent noise. Learning from imbalanced data occurs frequently in many machine learning applications. However, in machine learning rare instances (in our case regulatory regions) are treated as noise. So, training with the true genomic ratio of regulatory:non-regulatory regions will cause the models to learn non-regulatory features over regulatory ones. In the maize genome, non-regulatory features will be the ones that characterize the most abundant class of repeats. On the other hand training in re-sampled data (balancing the ratio of regulatory:non-regulatory region), generate models that expect a distribution of instances that strongly differs from the genomic distribution of events. We decided to pose the problem in a way that the models could learn features from regulatory regions. Next we used a series of evaluations with “real-world” constraints to adjust the probability cut-offs at which the models predictions are still reliable while taking care of the excess of false positives. We show that the adjustment of the probabilities
<italic>a posteriori</italic>
and the combined use of the two models allow us to “transfer” annotations from ZmB73 to ZmW22 with reasonable precision.</p>
<p>Because both models are amenable to interpretation, examination of the learned features offers novel insights about key sequence characteristics that can help to build mechanistic hypotheses to be tested at molecular level, and allow comparison of regulatory programs under the same framework. For instance, both types of models suggest that low complexity
<italic>k</italic>
-mers are not important for regulatory regions in maize. The comparative use of the models shows that TFBSs (i.e., FEA4 and KN1) are better predicted with the bag-of-
<italic>k</italic>
-mers. Also, through modeling MNA-seq data we found that open chromatin regions in maize are characteristically organized within poly(dA:dT) tracts flanking G+C rich
<italic>k</italic>
-mers resembling motifs (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a-b). Likewise, from modeling maize KN1 ChIP-seq data and further annotation of regions bound by OSH1, we determined conservation at the center of binding loci for the key individual
<italic>k</italic>
-mers (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
b) and a lousy conservation in the pattern of
<italic>k</italic>
-mer co-occurrences (Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S6A). These results suggests that, though the non-coding regions change rapidly across species, the use of sequence models allows alignment-free comparisons to determine regulatory features that are conserved across million years of evolution.</p>
</sec>
<sec id="Sec9" sec-type="conclusion">
<title>Conclusions</title>
<p>Taken together, our framework can be used beyond the transference of regional annotations, as can easily be extended to evaluate
<italic>in silico</italic>
, the putative effect of sequence variation (i.e., SNPs, single nucleotide polymorphisms) in regulatory function from the differences in k-mer scores and regulatory probabilities for small groups of
<italic>k</italic>
-mers.</p>
<p>This work opens many avenues for improving models by adding relevant layers of information. Possible layers to add include: predictions of the 3D structure of regulatory regions, joint modeling of functional genomic data spanning the range of maize diversity to identify general patterns for relevant phenotypes, or even extended across species to build more generalizable models that capture the most conserved features. Furthermore, we expect these annotations to be useful as priors to improve marker assisted technologies such as genomic selection to purge deleterious non-coding sequence variation and to identify targets for genome editing contributing to gene expression dysregulation.</p>
</sec>
<sec id="Sec10">
<title>Methods</title>
<sec id="Sec11">
<title>Definition of maize regulatory regions</title>
<p>In the analyses presented throughout this study, we used data sets derived from different functional genomic experiments and obtained from the reference genome (ZmB73 AGPv3, chromosomes 1 to 10) [
<xref ref-type="bibr" rid="CR51">51</xref>
]. We included in the analysis open chromatin regions in shoot and roots derived from MNA-seq data [
<xref ref-type="bibr" rid="CR3">3</xref>
]; binding loci for KNOTTED 1 (KN1) and FASCIATED EAR 4 (FEA4) transcription factors from ChIP-seq data [
<xref ref-type="bibr" rid="CR34">34</xref>
,
<xref ref-type="bibr" rid="CR35">35</xref>
], and promoter regions [
<xref ref-type="bibr" rid="CR36">36</xref>
<xref ref-type="bibr" rid="CR38">38</xref>
] from the intersection of TSSs obtained with CAGE and FLcDNAs (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1). For MNA-seq hotspots, ChIP-seq, we collected sequences of 300 base pairs length symmetrically surrounding the midpoints from the originally defined regions. Similarly, for core promoters, we selected the region between -250;+50 base pairs surrounding the TSSs. Each group of regulatory regions was randomly divided between training and holdout sets and reserved for further analyses. Training and testing was performed independently for each type of regulatory regions.</p>
<p>To randomly select control regions, we search in the vicinity (maximum in a 100 kb window) around a given regulatory region for a control region that have a matching G+C content and does not overlap with any of the other regulatory region; if no match was found, we removed the vicinity criteria and searched for a G+C matching region in the same chromosome. For the holdout sets we build balanced and unbalanced sets from randomly selecting one, and ten control regions, respectively, for each regulatory one.</p>
</sec>
<sec id="Sec12">
<title>Definition of grasses regulatory regions</title>
<p>Sorghum (
<italic>Sorghum bicolor</italic>
) core promoter regions were obtained from the reference genome (v2.1) [
<xref ref-type="bibr" rid="CR52">52</xref>
] for the coordinates between -250;+50 base pairs surrounding the start position of genes with annotated 5’UTR and a subset of 1000 sequences randomly selected for further analyses. Rice (
<italic>Oryza sativa Nipponbare</italic>
) KNOTTED 1-like (i.e., OSH1) binding regions were obtained from re-analyzing ChIP-seq experiment starting with the download of raw data from DDBJ (
<ext-link ext-link-type="uri" xlink:href="http://www.ddbj.nig.ac.jp/">http://www.ddbj.nig.ac.jp/</ext-link>
) (accession numbers DRA000206 and DR000313) corresponding to two biological replicates of immunoprecipitation with
<italic>α</italic>
-OSH1 and IgG antibodies [
<xref ref-type="bibr" rid="CR42">42</xref>
]. Raw reads were mapped against the rice reference genome (IRGSP-1.0 [
<xref ref-type="bibr" rid="CR52">52</xref>
]), using bowtie v1.1.2 (options -n 2, -l 60, -X 500, –best, –strata, -m 1) [
<xref ref-type="bibr" rid="CR53">53</xref>
] and low quality and duplicated reads were removed using picard (
<ext-link ext-link-type="uri" xlink:href="http://broadinstitute.github.io/picard/">http://broadinstitute.github.io/picard/</ext-link>
) (MarkDuplicates) and samtools (options -F 780, -F 1024, -f 2) [
<xref ref-type="bibr" rid="CR54">54</xref>
] MACS v2.1.0 [
<xref ref-type="bibr" rid="CR55">55</xref>
] was used for peak calling (options -g 3.73e8, -q 0.01) for each of the replicates and 42 peaks with a reproducible absolute summit reserved and further extended to 300 base pairs for downstream analyses. Corresponding control regions were obtained as explained above for maize. Briefly, each reference genome was divided into windows and after removal of sequences overlapping the putative regulatory regions we randomly selected sequences matching G+C content and when possible in the vicinity (∼10 kb) of each of the regulatory sequences.</p>
</sec>
<sec id="Sec13">
<title>Preprocessing of sequences</title>
<p>Sequences were preprocessed before fitting models. The preprocessing for the “bag-of-
<italic>k</italic>
-mers” model involves the dividing of each sequence into 1 base pair sliding (overlapping) windows of a given size
<italic>k</italic>
(
<italic>k</italic>
-mers) to collect for a sequence of length L (L-
<italic>k</italic>
)+1
<italic>k</italic>
-mers. Next,
<italic>k</italic>
-mers were converted into tokens (
<italic>t</italic>
) that correspond to collapsed pairs of
<italic>k</italic>
-mer and their respective reversed complementary. For the “vector-
<italic>k</italic>
-mers” models, each sequence is described as a collection of “sentences” resulting from walking
<italic>k</italic>
times and sliding by 1 base pair. Each sentence is broken into ordered non-overlapping new tokens. For testing sentences are divided in neighborhoods to obtain regulatory and non-regulatory likelihoods for groups of
<italic>k</italic>
-mers</p>
</sec>
<sec id="Sec14">
<title>Calculation of TF*IDF and implementation of the “bag-of-
<italic>k</italic>
-mers” model</title>
<p>Let’s define all the sequences in a given set from a functional genomics experiment and its corresponding control regions as a collection
<italic>S</italic>
={
<italic>s</italic>
<sub>1</sub>
,
<italic>s</italic>
<sub>2</sub>
,…
<italic>s</italic>
<sub>
<italic>n</italic>
</sub>
} of individual sequences. Next, for each individual sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
let’s define a set of tokens
<italic>T</italic>
<sub>
<italic>i</italic>
</sub>
={
<italic>t</italic>
<sub>1</sub>
,
<italic>t</italic>
<sub>2</sub>
,…,
<italic>t</italic>
<sub>
<italic>n</italic>
</sub>
}. All the possible tokens for a given
<italic>k</italic>
belong to the vocabulary,
<italic>Y</italic>
. Each
<italic>T</italic>
<sub>
<italic>i</italic>
</sub>
is mapped to a list of token weights -
<italic>W</italic>
<sub>
<italic>s</italic>
</sub>
- of size |
<italic>Y</italic>
| that contains “weights” for each token that occurs in
<italic>T</italic>
<sub>
<italic>i</italic>
</sub>
, where the “weight” (Eq.
<xref rid="Equ1" ref-type="">1</xref>
) is defined as the product of the token frequency -
<italic>f(t)</italic>
- in
<italic>s</italic>
, and its inverse collection frequency -
<italic>idf(t)</italic>
-. Calculation of TF*IDF were done according to the implementation in the python library scikit-learn v0.19.0 [
<xref ref-type="bibr" rid="CR56">56</xref>
].
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ weights(s,t) = f(t) log \frac{1 + |S| }{|s \in S : t \in T| + 1} $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mtext mathvariant="italic">weights</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:mtext mathvariant="italic">log</mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12870_2019_1693_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>To generate a “bag-of-
<italic>k</italic>
-mers” model, each training data set is represented as a
<italic>x</italic>
matrix, with Ws -list of token weights- as rows, and a list
<italic>y</italic>
of sequence labels (1 for regulatory regions and 0 for control regions). The “bag-of-
<italic>k</italic>
-mers” model results from fitting a regression curve,
<italic>y = f(x)</italic>
(i.e., a logistic regression). The C parameter for the logistic regression was chosen by fivefold cross-validation using a grid search function. Logistic regression and grid search functions as used here correspond to the implementation of the python library scikit-learn v0.19.0 [
<xref ref-type="bibr" rid="CR56">56</xref>
].</p>
</sec>
<sec id="Sec15">
<title>Implementation of “vector-
<italic>k</italic>
-mers” model</title>
<p>To generate “vector-
<italic>k</italic>
-mers” models we used the implementation of word2vec algorithms from the python library gensim v1.0.0, which fits sequence representations (
<italic>k</italic>
-mer vectors -
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
) via Stochastic Gradient Descent (SGD) that aims to optimize an objective function, that implicitly correspond to likelihood for
<italic>k</italic>
-mer co-occurrences [
<xref ref-type="bibr" rid="CR32">32</xref>
,
<xref ref-type="bibr" rid="CR57">57</xref>
]. Next, as shown for text classification, sequence representations -
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
- can be turned through inversion via Bayes rule to determine the likelihood of a new sequence of being part of a regulatory region based on its
<italic>k</italic>
-mer composition [
<xref ref-type="bibr" rid="CR33">33</xref>
]. This classification schema interprets the individual
<italic>v</italic>
<sub>
<italic>k</italic>
-mers</sub>
as components in a composite likelihood approximation that allows classification of sequences without extra modeling or estimation steps.</p>
<p>In brief, we trained a shallow (one single hidden layer), fully connected neural network aimed to optimize the probability of predicting a given
<italic>k</italic>
-mer (
<italic>k</italic>
-mer
<sub>target</sub>
) from its context, that is from the observation of the co-occurring
<italic>k</italic>
-mers appearing anywhere within a small window around the target. We ran word2vec with 30 iterations using hierarchical softmax and no negative sampling for each data set (options iter=30, hs=1, negative=0, size=300, min_count=0 and window=5, all others parameters were kept as the defaults) to obtain two independent geometric spaces (a continuous space of sequence representations), one for the regulatory regions (
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
) and the other for the control regions (
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
).</p>
<p>For the classification step, we calculated the probability of every new sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
under each sequence representation –
<italic>V</italic>
<sub>
<italic>regulatory</italic>
</sub>
and
<italic>V</italic>
<sub>
<italic>random</italic>
</sub>
– by first calculating the likelihood of every window within a sentence (using the score function from gensim) and the averaging likelihoods to obtain sentence likelihoods. Next, from the matrix of sentence likelihoods by the two categories (i.e.,
<italic>C</italic>
= regulatory and control) we derive the sequence probabilities -
<italic>pV</italic>
<sub>
<italic>regulatory</italic>
</sub>
(
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
) and
<italic>pV</italic>
<sub>
<italic>random</italic>
</sub>
(
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
). The category probabilities were calculated via Bayes rule, using as prior
<italic>π</italic>
<sub>
<italic>c</italic>
</sub>
=
<italic>1/C</italic>
, such that the classification proceeds by assigning the category for which
<italic>pV</italic>
<sub>
<italic>category</italic>
</sub>
(
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
) is greater [
<xref ref-type="bibr" rid="CR33">33</xref>
].</p>
</sec>
<sec id="Sec16">
<title>Generation of PWMs collections</title>
<p>For any given regulatory region we generated a collection of PWMs using the MEME-ChIP pipeline, in discriminative mode. The PWMs were generated from the same training sets described above. The collection of PWMs were further used to predict on the respective holdout set. To do so, we run FIMO and consider a prediction as “positive” for any sequence with a
<italic>p</italic>
-value of less than 1e-4 for any of the motifs and a PWM scores greater than log2(10 000)=13.28 bits. This parameters have been defined as “gold-standard” to determine “positive PWMs hits” previously [
<xref ref-type="bibr" rid="CR12">12</xref>
]. The collections of PWMs obtained with MEME-ChIP are available to the community at the Cyverse data store (
<ext-link ext-link-type="uri" xlink:href="http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology">http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology</ext-link>
)</p>
</sec>
<sec id="Sec17">
<title>Models evaluation</title>
<p>Confusion matrix, and the Receiver Operating Characteristic (ROC) and precision recall (PR) curves were generated using the python library scikit-learn v0.19.0 [
<xref ref-type="bibr" rid="CR56">56</xref>
] and plotted with python matplotlib v2.0.0 [
<xref ref-type="bibr" rid="CR58">58</xref>
].</p>
<p>In brief, for each trained model we obtained a confusion matrix from predicting on the holdout data and compared predictions against the true categories to which each region belong. As mentioned for the training, evaluation of the model’s performance was made only in data from the same type of regulatory region in which we trained the models. It means, for instance, that only FEA4 data was used for training and evaluation of FEA4 models.</p>
<p>From the confusion matrix we obtained</p>
<p>
<list list-type="bullet">
<list-item>
<p>True positives (TP): Regions in which we predicted the regulatory category and truly belong to the regulatory category</p>
</list-item>
<list-item>
<p>True negatives (TN): Regions in which we predicted the control category and truly belong to the control category</p>
</list-item>
<list-item>
<p>False positives (FP): Regions in which we predicted the regulatory category, but truly belong to the control category. (Also known as a “Type I error”).</p>
</list-item>
<list-item>
<p>False negatives (FN): Regions in which we predicted the control category, but truly belong to the regulatory category. (Also known as a “Type II error”)</p>
</list-item>
</list>
</p>
<p>To evaluate the models, we computed from the output of the confusion matrix the following metrics:</p>
<p>
<list list-type="bullet">
<list-item>
<p>Accuracy: (TP+TN)/total regions</p>
</list-item>
<list-item>
<p>Precision: TP /(TP + FP)</p>
</list-item>
<list-item>
<p>Recall: TP /(TP + FN)</p>
</list-item>
</list>
</p>
<p>In addition to the metrics derived from the confusion matrix we generated ROC and PR curves for each model. The ROC shows the true positive rate in function of the false positive rate for different decision thresholds (a point, sensitivity, specificity). In a ROC curve, the closer it is to the upper left corner (auROC = 1), the better the performance of the classifier. The PR curve shows the trade-off between precision and recall for different decision threshold. A high area under the curve represents both high recall (low false negative rate) and high precision (low false positive rate). The PR curve is preferred over ROC to measure the performance of a binary classifier under imbalanced datasets [
<xref ref-type="bibr" rid="CR56">56</xref>
].</p>
</sec>
<sec id="Sec18">
<title>Prediction of open chromatin regions in the ZmW22 genome</title>
<p>In order to evaluate model performance in the annotation of a non-reference maize genome we used the recently published W22 genome [
<xref ref-type="bibr" rid="CR41">41</xref>
]. First we collected “ground truths” from aligning MNA-seq regions from B73 to W22 using MUMmer4, a system designed for genome alignments that can handle specie divergent DNA sequence alignments [
<xref ref-type="bibr" rid="CR59">59</xref>
]. The hits in the W22 genome that correspond to the corresponding chromosome were considered “truths” or homologous regions. Next, we used the bag-of-
<italic>k</italic>
-mers models trained in MNAseq data to score overlapping (stride 150 bps) windows (lenght 300 bps) in a region corresponding to 4Kb centered in the hit. We used the vector-
<italic>k</italic>
-mers models to score each window based on their similarity to B73 MNAseq regions. For this we calculated the mean of the
<italic>k</italic>
-mers vectors to obtain a “centroid” that summarize each evaluated window to calculate the cosine similarity distance to the centroid vector of the B73 MNAseq regions. The best-scored window was compared against the hits from MUMmer4 and counted as intersecting if at least half of the length of the window was included in the MUMmer4 hit. A file with the coordinates and the predictions from each model as well as the MUMmer4 results are available to the community at the Cyverse data store (
<ext-link ext-link-type="uri" xlink:href="http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology">http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/dataFromPubs/Mejia2018BMCBiology</ext-link>
)</p>
</sec>
<sec id="Sec19">
<title>Calculation of
<italic>k</italic>
-mer complexity on a TF motifs database</title>
<p>The sequence complexity of any
<italic>k</italic>
-mer was approximated to the Shannon entropy for the symbols succession given by (Eq.
<xref rid="Equ2" ref-type="">2</xref>
). Were
<italic>p</italic>
<sub>
<italic>i</italic>
</sub>
correspond to the probability of appearance of the
<italic>i</italic>
-th symbol in the
<italic>k</italic>
-mer.
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ entropy(k-mer) = \sum p_{i} \log_{2} p_{i} $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mtext mathvariant="italic">entropy</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mtext mathvariant="italic">mer</mml:mtext>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:munder>
<mml:mrow>
<mml:mo>log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<graphic xlink:href="12870_2019_1693_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>To empirically establish a threshold of complexity for
<italic>k</italic>
-mers within regulatory regions we calculated the
<italic>k</italic>
-mer complexity for any given
<italic>k</italic>
and for all the consensus sequences derived from transcription factor (TF) binding models represented as Position Weight Matrices (PWMs) in the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) v11 [
<xref ref-type="bibr" rid="CR60">60</xref>
].</p>
</sec>
<sec id="Sec20">
<title>Motif enrichment analyses</title>
<p>To identify
<italic>k</italic>
-mers similarity to transcription factor binding sites we used TOMTOM from the MEME suite [
<xref ref-type="bibr" rid="CR61">61</xref>
] (
<ext-link ext-link-type="uri" xlink:href="http://meme-suite.org">http://meme-suite.org</ext-link>
) and two collections of
<italic>Arabidopsis thaliana</italic>
TF binding motifs derived from large-scale experiments [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. The enrichment was calculated according to (Eq.
<xref rid="Equ3" ref-type="">3</xref>
), in which
<italic>N</italic>
correspond to the size of the
<italic>k</italic>
-mer vocabulary,
<italic>n</italic>
correspond to the 1% of the
<italic>k</italic>
-mer vocabulary taking from the top after sorted with the weights obtained from the model,
<italic>M</italic>
correspond to the number of
<italic>k</italic>
-mers with a significant hit against a TF motif and
<italic>m</italic>
to the number of
<italic>k</italic>
-mers that are in the top 1% and have a significant hit against a TF motif.
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ enrichment = \frac{m/n}{M/N} $$ \end{document}</tex-math>
<mml:math id="M6">
<mml:mtext mathvariant="italic">enrichment</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12870_2019_1693_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>The statistical significance of the enrichment was calculated using the hyper-geometric test, as implemented with the python library scipy 0.18.1 (stats.hypergeom) [
<xref ref-type="bibr" rid="CR62">62</xref>
], after applying the Bonferroni correction for multiple testing hypothesis to the
<italic>α</italic>
(alpha) value required for statistical significance.</p>
</sec>
</sec>
<sec sec-type="supplementary-material">
<title>Additional files</title>
<sec id="Sec21">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12870_2019_1693_MOESM1_ESM.xlsx">
<label>Additional file 1</label>
<caption>
<p>Supplementary
<bold>Tables S1</bold>
. (XLSX 36 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="MOESM2">
<media xlink:href="12870_2019_1693_MOESM2_ESM.xlsx">
<label>Additional file 2</label>
<caption>
<p>Supplementary
<bold>Tables S2</bold>
. (XLSX 78 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="MOESM3">
<media xlink:href="12870_2019_1693_MOESM3_ESM.pdf">
<label>Additional file 3</label>
<caption>
<p>Supplementary
<bold>Figures S1</bold>
to
<bold>S7</bold>
. (PDF 1253 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="MOESM4">
<media xlink:href="12870_2019_1693_MOESM4_ESM.xlsx">
<label>Additional file 4</label>
<caption>
<p>Supplementary
<bold>Tables S3</bold>
. (XLSX 39 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="MOESM5">
<media xlink:href="12870_2019_1693_MOESM5_ESM.xlsx">
<label>Additional file 5</label>
<caption>
<p>Supplementary
<bold>Tables S4</bold>
. (XLSX 66 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>auPRC</term>
<def>
<p>Area under the precision recall curve</p>
</def>
</def-item>
<def-item>
<term>auROC</term>
<def>
<p>Area under the receiver operating characteristic curve</p>
</def>
</def-item>
<def-item>
<term>CNN</term>
<def>
<p>Convolutional Neural Networks</p>
</def>
</def-item>
<def-item>
<term>NN</term>
<def>
<p>Neural Network</p>
</def>
</def-item>
<def-item>
<term>NLP</term>
<def>
<p>Natural Language Processing</p>
</def>
</def-item>
<def-item>
<term>PWM</term>
<def>
<p>Position Weight Matrix</p>
</def>
</def-item>
<def-item>
<term>RNN</term>
<def>
<p>Recurrent Neural Network</p>
</def>
</def-item>
<def-item>
<term>TF</term>
<def>
<p>Transcription Factor</p>
</def>
</def-item>
<def-item>
<term>TFBS</term>
<def>
<p>Transcription factor binding site</p>
</def>
</def-item>
<def-item>
<term>TF*IDF</term>
<def>
<p>Term frequency * inverse document frequency</p>
</def>
</def-item>
</def-list>
</glossary>
<ack>
<title>Acknowledgements</title>
<p>We thank to the members of the Buckler lab for comments that greatly improved the manuscript. Specially to Sara Miller for her assistance in language editing, and proofreading.</p>
<sec id="d29e2331">
<title>Funding</title>
<p>This work has been funded by NSF Plant Genome Project (IOS #1238014) and the USDA-ARS. The funding sources had no role in the design of the study, data collection, data analysis, or manuscript writing.</p>
</sec>
<sec id="d29e2336" sec-type="data-availability">
<title>Availability of data and materials</title>
<p>All the regulatory regions sequences and their controls, as well with the code used to train models and evaluate models’ performance are available through a public Bitbucket repository (
<ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/bucklerlab/k-mer_grammar/">https://bitbucket.org/bucklerlab/k-mer_grammar/</ext-link>
) and through Cyverse data store (
<ext-link ext-link-type="uri" xlink:href="http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Mejia2019BMCBiology/">http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Mejia2019BMCBiology/</ext-link>
).</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>MKMG, Conceptualization, Data curation, Software, Formal analysis, Methodology, Writing—original draft, Writing—review and editing; ESB, Conceptualization, Supervision, Funding acquisition, Writing—review and editing. Both authors read and approved the final version of the manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wallace</surname>
<given-names>JG</given-names>
</name>
<name>
<surname>Bradbury</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Gibon</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Stitt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Buckler</surname>
<given-names>ES</given-names>
</name>
</person-group>
<article-title>Association mapping across numerous traits reveals patterns of functional variation in maize</article-title>
<source>PLoS Genet</source>
<year>2014</year>
<volume>10</volume>
<issue>12</issue>
<fpage>1004845</fpage>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Niu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Distant eQTLs and non-coding sequences play critical roles in regulating gene expression and quantitative trait variation in maize</article-title>
<source>Mol Plant</source>
<year>2017</year>
<volume>10</volume>
<issue>3</issue>
<fpage>414</fpage>
<lpage>26</lpage>
<pub-id pub-id-type="pmid">27381443</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodgers-Melnick</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Vera</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Bass</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Buckler</surname>
<given-names>ES</given-names>
</name>
</person-group>
<article-title>Open chromatin reveals the functional maize genome</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2016</year>
<volume>113</volume>
<issue>22</issue>
<fpage>3177</fpage>
<lpage>84</lpage>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Romay</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Glaubitz</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Bradbury</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Elshire</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Semagn</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Hernandez</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Mikel</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Soifer</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Barad</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Buckler</surname>
<given-names>ES</given-names>
</name>
</person-group>
<article-title>High-resolution genetic mapping of maize pan-genome sequence anchors</article-title>
<source>Nat Commun</source>
<year>2015</year>
<volume>6</volume>
<fpage>6914</fpage>
<pub-id pub-id-type="pmid">25881062</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ajmone-Marsan</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Stella</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Commentary on the 6th international symposium of animal functional genomics</article-title>
<source>Genet Sel Evol</source>
<year>2016</year>
<volume>48</volume>
<issue>1</issue>
<fpage>97</fpage>
<pub-id pub-id-type="pmid">27938327</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Poland</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Breeding-assisted genomics</article-title>
<source>Curr Opin Plant Biol</source>
<year>2015</year>
<volume>24</volume>
<fpage>119</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="pmid">25795171</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Franco-Zorrilla</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>López-Vidriero</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Carrasco</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Godoy</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vera</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Solano</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>DNA-binding specificities of plant transcription factors and their potential to define target genes</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2014</year>
<volume>111</volume>
<issue>6</issue>
<fpage>2367</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="pmid">24477691</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>O’Malley</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>S-SC</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lewsey</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Bartlett</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nery</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Galli</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gallavotti</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ecker</surname>
<given-names>JR</given-names>
</name>
</person-group>
<article-title>Cistrome and epicistrome features shape the regulatory DNA landscape</article-title>
<source>Cell</source>
<year>2016</year>
<volume>166</volume>
<issue>6</issue>
<fpage>1598</fpage>
<pub-id pub-id-type="pmid">27610578</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lescot</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Plantcare, a database of plant cis-acting regulatory elements and a portal to tools for
<italic>in silico</italic>
analysis of promoter sequences</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>1</issue>
<fpage>325</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">11752327</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Machanick</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
</person-group>
<article-title>Meme-chip: motif analysis of large dna datasets</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>12</issue>
<fpage>1696</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">21486936</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zamanighomi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>WH</given-names>
</name>
</person-group>
<article-title>Predicting transcription factor binding motifs from dna-binding domains, chromatin accessibility and gene expression data</article-title>
<source>Nucleic Acids Res</source>
<year>2017</year>
<volume>45</volume>
<issue>10</issue>
<fpage>5666</fpage>
<lpage>77</lpage>
<pub-id pub-id-type="pmid">28472398</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cuellar-Partida</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Buske</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Mcleay</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Whitington</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
</person-group>
<article-title>Epigenetic priors for identifying active transcription factor binding sites</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>28</volume>
<issue>1</issue>
<fpage>56</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="pmid">22072382</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kleftogiannis</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kalnis</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bajic</surname>
<given-names>VB</given-names>
</name>
</person-group>
<article-title>Progress and challenges in bioinformatics approaches for enhancer identification</article-title>
<source>Brief Bioinforma</source>
<year>2015</year>
<volume>17</volume>
<issue>6</issue>
<fpage>967</fpage>
<lpage>79</lpage>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Natarajan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Yardimci</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Sheffield</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Crawford</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Ohler</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Predicting cell-type-specific gene expression from regions of open chromatin</article-title>
<source>Genome Res</source>
<year>2012</year>
<volume>22</volume>
<issue>9</issue>
<fpage>1711</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">22955983</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huminiecki</surname>
<given-names>Ł</given-names>
</name>
<name>
<surname>Horbańczuk</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Can we predict gene expression by understanding proximal promoter architecture?</article-title>
<source>Trends Biotechnol</source>
<year>2017</year>
<volume>35</volume>
<issue>6</issue>
<fpage>530</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="pmid">28377102</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stringham</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Drewell</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Dresch</surname>
<given-names>JM</given-names>
</name>
</person-group>
<article-title>Flanking sequence context-dependent transcription factor binding in early drosophila development</article-title>
<source>BMC Bioinformatics</source>
<year>2013</year>
<volume>14</volume>
<fpage>298</fpage>
<pub-id pub-id-type="pmid">24093548</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stampfel</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kazmar</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Wienerroither</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Reiter</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Stark</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Transcriptional regulators form diverse groups with context-dependent regulatory functions</article-title>
<source>Nature</source>
<year>2015</year>
<volume>528</volume>
<issue>7580</issue>
<fpage>147</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="pmid">26550828</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Crocker</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Abe</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Rinaldi</surname>
<given-names>L</given-names>
</name>
<name>
<surname>McGregor</surname>
<given-names>AP</given-names>
</name>
<name>
<surname>Frankel</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Alsawadi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Valenti</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Plaza</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Payre</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Stern</surname>
<given-names>DL</given-names>
</name>
</person-group>
<article-title>Low affinity binding site clusters confer hox specificity and regulatory robustness</article-title>
<source>Cell</source>
<year>2015</year>
<volume>160</volume>
<issue>1-2</issue>
<fpage>191</fpage>
<lpage>203</lpage>
<pub-id pub-id-type="pmid">25557079</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raveh-Sadka</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Levo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shabi</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Shany</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Keren</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lotan-Pompan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zeevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sharon</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Manipulating nucleosome disfavoring sequences allows fine-tune regulation of gene expression in yeast</article-title>
<source>Nat Genet</source>
<year>2012</year>
<volume>44</volume>
<issue>7</issue>
<fpage>743</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="pmid">22634752</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Farley</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Rokhsar</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2016</year>
<volume>113</volume>
<issue>23</issue>
<fpage>6508</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="pmid">27155014</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yáñez-Cuna</surname>
<given-names>JO</given-names>
</name>
<name>
<surname>Kvon</surname>
<given-names>EZ</given-names>
</name>
<name>
<surname>Stark</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Deciphering the transcriptional cis-regulatory code</article-title>
<source>Trends Genet</source>
<year>2013</year>
<volume>29</volume>
<issue>1</issue>
<fpage>11</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">23102583</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Karchin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>Discriminative prediction of mammalian enhancers from DNA sequence</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<issue>12</issue>
<fpage>2167</fpage>
<lpage>80</lpage>
<pub-id pub-id-type="pmid">21875935</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Gorkin</surname>
<given-names>DU</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Strober</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Asoni</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>McCallion</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>A method to predict the impact of regulatory variants from DNA sequence</article-title>
<source>Nat Genet</source>
<year>2015</year>
<volume>47</volume>
<issue>8</issue>
<fpage>955</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="pmid">26075791</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ghandi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Mohammad-Noori</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>Enhanced regulatory sequence prediction using gapped k-mer features</article-title>
<source>PLoS Comput Biol</source>
<year>2014</year>
<volume>10</volume>
<issue>7</issue>
<fpage>1003711</fpage>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alipanahi</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Delong</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weirauch</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Frey</surname>
<given-names>BJ</given-names>
</name>
</person-group>
<article-title>Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning</article-title>
<source>Nat Biotechnol</source>
<year>2015</year>
<volume>33</volume>
<issue>8</issue>
<fpage>831</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="pmid">26213851</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Troyanskaya</surname>
<given-names>OG</given-names>
</name>
</person-group>
<article-title>Predicting effects of noncoding variants with deep learning-based sequence model</article-title>
<source>Nat Methods</source>
<year>2015</year>
<volume>12</volume>
<issue>10</issue>
<fpage>931</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="pmid">26301843</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Snoek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rinn</surname>
<given-names>JL</given-names>
</name>
</person-group>
<article-title>Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks</article-title>
<source>Genome Res</source>
<year>2016</year>
<volume>26</volume>
<issue>7</issue>
<fpage>990</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">27197224</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>D</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Lin</surname>
<given-names>CY</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Relation classification: CNN or RNN?</article-title>
<source>Natural Language Understanding and Intelligent Applications. ICCPOL 2016, NLPCC 2016. Lecture Notes in Computer Science, vol 10102</source>
<year>2016</year>
<publisher-loc>Cham</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<mixed-citation publication-type="other">Yin W, Kann K, Yu M, Schütze H. Comparative study of CNN and RNN for natural language processing. ArXiv e-prints. 2017; abs/1702.01923.
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1702.01923">http://arxiv.org/abs/1702.01923</ext-link>
.</mixed-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manning</surname>
<given-names>CD</given-names>
</name>
<name>
<surname>Schütze</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Foundations of Statistical Natural Language Processing</article-title>
<source>MIT Press</source>
<year>1999</year>
<volume>5</volume>
<fpage>141</fpage>
<lpage>77</lpage>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<mixed-citation publication-type="other">Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. ArXiv e-prints. 2013; abs/1301.3781.
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1301.3781">http://arxiv.org/abs/1301.3781</ext-link>
.</mixed-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Mikolov</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Corrado</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Dean</surname>
<given-names>J</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Burges</surname>
<given-names>CJC</given-names>
</name>
<name>
<surname>Bottou</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Welling</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ghahramani</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>KQ</given-names>
</name>
</person-group>
<article-title>Distributed representations of words and phrases and their compositionality</article-title>
<source>Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13), vol 2</source>
<year>2013</year>
<publisher-loc>USA</publisher-loc>
<publisher-name>Curran Associates, Inc.</publisher-name>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Taddy</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Document classification by inversion of distributed language representations</article-title>
<source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)</source>
<year>2015</year>
<publisher-loc>Stroudsburg</publisher-loc>
<publisher-name>Association for Computational Linguistics</publisher-name>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bolduc</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Yilmaz</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mejía-Guerra</surname>
<given-names>MK</given-names>
</name>
<name>
<surname>Morohashi</surname>
<given-names>K</given-names>
</name>
<name>
<surname>O’Connor</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Grotewold</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hake</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Unraveling the KNOTTED1 regulatory network in maize meristems</article-title>
<source>Genes Dev</source>
<year>2012</year>
<volume>26</volume>
<issue>15</issue>
<fpage>1685</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="pmid">22855831</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pautler</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Eveland</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>LaRue</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Weeks</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lunde</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Je</surname>
<given-names>BI</given-names>
</name>
<name>
<surname>Meeley</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Komatsu</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vollbrecht</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Sakai</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Jackson</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>FASCIATED EAR4 encodes a bZIP transcription factor that regulates shoot meristem size in maize</article-title>
<source>Plant Cell</source>
<year>2015</year>
<volume>27</volume>
<issue>1</issue>
<fpage>104</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="pmid">25616871</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alexandrov</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Brover</surname>
<given-names>VV</given-names>
</name>
<name>
<surname>Freidin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Troukhan</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Tatarinova</surname>
<given-names>TV</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Swaller</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y-P</given-names>
</name>
<name>
<surname>Bouck</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Flavell</surname>
<given-names>RB</given-names>
</name>
<name>
<surname>Feldmann</surname>
<given-names>KA</given-names>
</name>
</person-group>
<article-title>Insights into corn genes derived from large-scale cDNA sequencing</article-title>
<source>Plant Mol Biol</source>
<year>2009</year>
<volume>69</volume>
<issue>1-2</issue>
<fpage>179</fpage>
<lpage>94</lpage>
<pub-id pub-id-type="pmid">18937034</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soderlund</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Descour</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kudrna</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bomhoff</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Boyd</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Currie</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Angelova</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Collura</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wissotski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ashley</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Morrow</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Fernandes</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Walbot</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs</article-title>
<source>PLoS Genet</source>
<year>2009</year>
<volume>5</volume>
<issue>11</issue>
<fpage>1000740</fpage>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mejía-Guerra</surname>
<given-names>MK</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Galeano</surname>
<given-names>NF</given-names>
</name>
<name>
<surname>Vidal</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gray</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Doseff</surname>
<given-names>AI</given-names>
</name>
<name>
<surname>Grotewold</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Core promoter plasticity between maize tissues and genotypes contrasts with predominance of sharp transcription initiation sites</article-title>
<source>Plant Cell</source>
<year>2015</year>
<volume>27</volume>
<issue>12</issue>
<fpage>3309</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="pmid">26628745</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Gan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>A sequence-based method to predict the impact of regulatory variants using random forest</article-title>
<source>BMC Syst Biol</source>
<year>2017</year>
<volume>11</volume>
<issue>Suppl 2</issue>
<fpage>7</fpage>
<pub-id pub-id-type="pmid">28361702</pub-id>
</element-citation>
</ref>
<ref id="CR40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bolduc</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Hake</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>The maize transcription factor KNOTTED1 directly regulates the gibberellin catabolism gene ga2ox1</article-title>
<source>Plant Cell</source>
<year>2009</year>
<volume>21</volume>
<issue>6</issue>
<fpage>1647</fpage>
<lpage>58</lpage>
<pub-id pub-id-type="pmid">19567707</pub-id>
</element-citation>
</ref>
<ref id="CR41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Springer</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>SN</given-names>
</name>
<name>
<surname>Andorf</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Ahern</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Barad</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Barbazuk</surname>
<given-names>WB</given-names>
</name>
<name>
<surname>Bass</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Baruch</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ben-Zvi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Buckler</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Bukowski</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Cannon</surname>
<given-names>EKS</given-names>
</name>
<name>
<surname>Chomet</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Dawe</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Davenport</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dooner</surname>
<given-names>HK</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Easterling</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Gault</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>J-C</given-names>
</name>
<name>
<surname>Hunter</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Jander</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Jiao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Koch</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Kol</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Köllner</surname>
<given-names>TG</given-names>
</name>
<name>
<surname>Kudo</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Mayfield-Jones</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Mei</surname>
<given-names>W</given-names>
</name>
<name>
<surname>McCarty</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Noshay</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Portwood</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Ronen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Settles</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Shem-Tov</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Soifer</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Stitzer</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Suzuki</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vera</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Vollbrecht</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Vrebalov</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Ware</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wimalanathan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Woodhouse</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Brutnell</surname>
<given-names>TP</given-names>
</name>
</person-group>
<article-title>The maize w22 genome provides a foundation for functional genomics and transposon biology</article-title>
<source>Nat Genet</source>
<year>2018</year>
<volume>50</volume>
<issue>9</issue>
<fpage>1282</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="pmid">30061736</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tsuda</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kurata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Ohyanagi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Hake</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Genome-wide study of KNOX regulatory network reveals brassinosteroid catabolic genes important for shoot meristem function in rice</article-title>
<source>Plant Cell</source>
<year>2014</year>
<volume>26</volume>
<issue>9</issue>
<fpage>3488</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="pmid">25194027</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhuang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Iyer</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Whitfield</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Greven</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Pierce</surname>
<given-names>BG</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Kundaje</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rando</surname>
<given-names>OJ</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Weng</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors</article-title>
<source>Genome Res</source>
<year>2012</year>
<volume>22</volume>
<issue>9</issue>
<fpage>1798</fpage>
<lpage>812</lpage>
<pub-id pub-id-type="pmid">22955990</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dror</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mandel-Gutfreund</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>How motif environment influences transcription factor search dynamics: Finding a needle in a haystack</article-title>
<source>Bioessays</source>
<year>2016</year>
<volume>38</volume>
<issue>7</issue>
<fpage>605</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="pmid">27192961</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<label>45</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Levy</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Goldberg</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Linguistic regularities in sparse and explicit word representations</article-title>
<source>Proceedings of the Eighteenth Conference on Computational Natural Language Learning</source>
<year>2014</year>
<publisher-loc>Stroudsburg</publisher-loc>
<publisher-name>Association for Computational Linguistics</publisher-name>
</element-citation>
</ref>
<ref id="CR46">
<label>46</label>
<mixed-citation publication-type="other">Webber W, Moffat A, Zobel J. A similarity measure for indefinite rankings. ACM Trans Inf Syst. 2010; 28(4):38. 10.1145/1852102.1852106.</mixed-citation>
</ref>
<ref id="CR47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Peluso</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Stitzer</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>C-S</given-names>
</name>
<name>
<surname>Guill</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Regulski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kumari</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gent</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Wolfgruber</surname>
<given-names>TK</given-names>
</name>
<name>
<surname>May</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Springer</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Antoniou</surname>
<given-names>E</given-names>
</name>
<name>
<surname>McCombie</surname>
<given-names>WR</given-names>
</name>
<name>
<surname>Presting</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>McMullen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ross-Ibarra</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Dawe</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rank</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Ware</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Improved maize reference genome with single-molecule technologies</article-title>
<source>Nature</source>
<year>2017</year>
<volume>546</volume>
<issue>7659</issue>
<fpage>524</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">28605751</pub-id>
</element-citation>
</ref>
<ref id="CR48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alexander</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rozowsky</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>MB</given-names>
</name>
</person-group>
<article-title>Annotating non-coding regions of the genome</article-title>
<source>Nat Rev Genet</source>
<year>2010</year>
<volume>11</volume>
<issue>8</issue>
<fpage>559</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="pmid">20628352</pub-id>
</element-citation>
</ref>
<ref id="CR49">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buckler</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Gaut</surname>
<given-names>BS</given-names>
</name>
<name>
<surname>McMullen</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>Molecular and functional diversity of maize</article-title>
<source>Curr Opin Plant Biol</source>
<year>2006</year>
<volume>9</volume>
<issue>2</issue>
<fpage>172</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">16459128</pub-id>
</element-citation>
</ref>
<ref id="CR50">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Asgari</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Mofrad</surname>
<given-names>MRK</given-names>
</name>
</person-group>
<article-title>Continuous distributed representation of biological sequences for deep proteomics and genomics</article-title>
<source>PLoS ONE</source>
<year>2015</year>
<volume>10</volume>
<issue>11</issue>
<fpage>0141287</fpage>
</element-citation>
</ref>
<ref id="CR51">
<label>51</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schnable</surname>
<given-names>PS</given-names>
</name>
<name>
<surname>Ware</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Fulton</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Pasternak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fulton</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Graves</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Minx</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Reily</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Courtney</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kruchowski</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Tomlinson</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Strong</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Delehaunty</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Fronick</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Courtney</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Rock</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Belter</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Abbott</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Cotton</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Marchetto</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ochoa</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackson</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Gillam</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Higginbotham</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cardenas</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Waligorski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Applebaum</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Phelps</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Falcone</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kanchi</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Thane</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Scimone</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Thane</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Henke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ruppert</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Rotter</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Hodges</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ingenthron</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Cordes</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kohlberg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sgro</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Delgado</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mead</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chinwalla</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Leonard</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Crouse</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Collura</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kudrna</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Currie</surname>
<given-names>J</given-names>
</name>
<name>
<surname>He</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Angelova</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rajasekar</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mueller</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lomeli</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Scara</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Delaney</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wissotski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lopez</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Campos</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Braidotti</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ashley</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Golser</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Dujmic</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Talag</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zuccolo</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Sebastian</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kramer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Spiegel</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Nascimento</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zutavern</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ambroise</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Muller</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Spooner</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Narechania</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kumari</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Faga</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>McMahan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Van Buren</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vaughn</surname>
<given-names>MW</given-names>
</name>
<name>
<surname>Ying</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Yeh</surname>
<given-names>C-T</given-names>
</name>
<name>
<surname>Emrich</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kalyanaraman</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hsia</surname>
<given-names>A-P</given-names>
</name>
<name>
<surname>Barbazuk</surname>
<given-names>WB</given-names>
</name>
<name>
<surname>Baucom</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Brutnell</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Carpita</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Chaparro</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Chia</surname>
<given-names>J-M</given-names>
</name>
<name>
<surname>Deragon</surname>
<given-names>J-M</given-names>
</name>
<name>
<surname>Estill</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Jeddeloh</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lisch</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Nagel</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>McCann</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>SanMiguel</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Nettleton</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Penning</surname>
<given-names>BW</given-names>
</name>
<name>
<surname>Ponnala</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Schwartz</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Soderlund</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Springer</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Westerman</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wolfgruber</surname>
<given-names>TK</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Bennetzen</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Dawe</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Presting</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Wessler</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Martienssen</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Clifton</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>McCombie</surname>
<given-names>WR</given-names>
</name>
<name>
<surname>Wing</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>RK</given-names>
</name>
</person-group>
<article-title>The B73 maize genome: complexity, diversity, and dynamics</article-title>
<source>Science</source>
<year>2009</year>
<volume>326</volume>
<issue>5956</issue>
<fpage>1112</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">19965430</pub-id>
</element-citation>
</ref>
<ref id="CR52">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Paterson</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Bowers</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Bruggmann</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dubchak</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Grimwood</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gundlach</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Haberer</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Hellsten</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Mitros</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Poliakov</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Schmutz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Spannagl</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Wicker</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bharti</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Chapman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Feltus</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Gowik</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Grigoriev</surname>
<given-names>IV</given-names>
</name>
<name>
<surname>Lyons</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Maher</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Martis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Narechania</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Otillar</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Penning</surname>
<given-names>BW</given-names>
</name>
<name>
<surname>Salamov</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Carpita</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Freeling</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gingle</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Hash</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Klein</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kresovich</surname>
<given-names>S</given-names>
</name>
<name>
<surname>McCann</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Ming</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>DG</given-names>
</name>
<collab>Mehboob-ur-Rahman</collab>
<name>
<surname>Ware</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Westhoff</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Mayer</surname>
<given-names>KFX</given-names>
</name>
<name>
<surname>Messing</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rokhsar</surname>
<given-names>DS</given-names>
</name>
</person-group>
<article-title>The sorghum bicolor genome and the diversification of grasses</article-title>
<source>Nature</source>
<year>2009</year>
<volume>457</volume>
<issue>7229</issue>
<fpage>551</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">19189423</pub-id>
</element-citation>
</ref>
<ref id="CR53">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>25</fpage>
</element-citation>
</ref>
<ref id="CR54">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Handsaker</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wysoker</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Fennell</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Homer</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Marth</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Abecasis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<collab>1000 Genome Project Data Processing Subgroup</collab>
</person-group>
<article-title>The sequence Alignment/Map format and SAMtools</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>16</issue>
<fpage>2078</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">19505943</pub-id>
</element-citation>
</ref>
<ref id="CR55">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Eeckhoute</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Bernstein</surname>
<given-names>BE</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>XS</given-names>
</name>
</person-group>
<article-title>Model-based analysis of ChIP-Seq (MACS)</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>9</issue>
<fpage>137</fpage>
</element-citation>
</ref>
<ref id="CR56">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pedregosa</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Varoquaux</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gramfort</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Michel</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Thirion</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Grisel</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Blondel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Prettenhofer</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dubourg</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Vanderplas</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Passos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cournapeau</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Brucher</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Perrot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Duchesnay</surname>
<given-names>É</given-names>
</name>
</person-group>
<article-title>Scikit-learn: Machine learning in python</article-title>
<source>J Mach Learn Res</source>
<year>2011</year>
<volume>12</volume>
<issue>Oct</issue>
<fpage>2825</fpage>
<lpage>30</lpage>
</element-citation>
</ref>
<ref id="CR57">
<label>57</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Rehurek</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sojka</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Software framework for topic modelling with large corpora</article-title>
<source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
<year>2010</year>
<publisher-loc>Valletta</publisher-loc>
<publisher-name>University of Malta</publisher-name>
</element-citation>
</ref>
<ref id="CR58">
<label>58</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hunter</surname>
<given-names>JD</given-names>
</name>
</person-group>
<article-title>Matplotlib: A 2D graphics environment</article-title>
<source>Comput Sci Eng</source>
<year>2007</year>
<volume>9</volume>
<issue>3</issue>
<fpage>90</fpage>
<lpage>5</lpage>
</element-citation>
</ref>
<ref id="CR59">
<label>59</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marçais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Coston</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Zimin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Mummer4: A fast and versatile genome alignment system</article-title>
<source>PLoS Comput Biol</source>
<year>2018</year>
<volume>14</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>14</lpage>
</element-citation>
</ref>
<ref id="CR60">
<label>60</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kulakovskiy</surname>
<given-names>IV</given-names>
</name>
<name>
<surname>Vorontsov</surname>
<given-names>IE</given-names>
</name>
<name>
<surname>Yevshin</surname>
<given-names>IS</given-names>
</name>
<name>
<surname>Soboleva</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Kasianov</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Ashoor</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ba-Alawi</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Bajic</surname>
<given-names>VB</given-names>
</name>
<name>
<surname>Medvedeva</surname>
<given-names>YA</given-names>
</name>
<name>
<surname>Kolpakov</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Makeev</surname>
<given-names>VJ</given-names>
</name>
</person-group>
<article-title>HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models</article-title>
<source>Nucleic Acids Res</source>
<year>2016</year>
<volume>44</volume>
<issue>D1</issue>
<fpage>116</fpage>
<lpage>25</lpage>
</element-citation>
</ref>
<ref id="CR61">
<label>61</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stamatoyannopoulos</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
</person-group>
<article-title>Quantifying similarity between motifs</article-title>
<source>Genome Biol</source>
<year>2007</year>
<volume>8</volume>
<issue>2</issue>
<fpage>24</fpage>
</element-citation>
</ref>
<ref id="CR62">
<label>62</label>
<mixed-citation publication-type="other">Jones E, Oliphant T, Peterson P, et al.SciPy: Open source scientific tools for Python. 2001.
<ext-link ext-link-type="uri" xlink:href="http://www.scipy.org/">http://www.scipy.org/</ext-link>
. Accessed 18 Jan 2017.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000328 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000328 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:6419808
   |texte=   A k-mer grammar analysis to uncover maize regulatory architecture
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:30876396" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021