MersV1, Pmc, Corpus, bibRecord, 000145

Informational laws of genome structures

Identifieur interne : 000145 ( Pmc/Corpus ); précédent : 000144; suivant : 000146

Informational laws of genome structures

Auteurs : Vincenzo Bonnici ; Vincenzo Manca

Source :

Scientific Reports [ 2045-2322 ] ; 2016.

RBID : PMC:4937431

Abstract

In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg₂(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937431

DOI: 10.1038/srep28840
PubMed: 27354155
PubMed Central: 4937431

Links to Exploration step

PMC:4937431

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Informational laws of genome structures</title>
<author><name sortKey="Bonnici, Vincenzo" sort="Bonnici, Vincenzo" uniqKey="Bonnici V" first="Vincenzo" last="Bonnici">Vincenzo Bonnici</name>
<affiliation><nlm:aff id="a1"><institution>University of Verona, Department of Computer Science, University of Verona</institution>
, Verona 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Center for BioMedical Computing, University of Verona</institution>
, Verona, 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Manca, Vincenzo" sort="Manca, Vincenzo" uniqKey="Manca V" first="Vincenzo" last="Manca">Vincenzo Manca</name>
<affiliation><nlm:aff id="a1"><institution>University of Verona, Department of Computer Science, University of Verona</institution>
, Verona 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Center for BioMedical Computing, University of Verona</institution>
, Verona, 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">27354155</idno>
<idno type="pmc">4937431</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937431</idno>
<idno type="RBID">PMC:4937431</idno>
<idno type="doi">10.1038/srep28840</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000145</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000145</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Informational laws of genome structures</title>
<author><name sortKey="Bonnici, Vincenzo" sort="Bonnici, Vincenzo" uniqKey="Bonnici V" first="Vincenzo" last="Bonnici">Vincenzo Bonnici</name>
<affiliation><nlm:aff id="a1"><institution>University of Verona, Department of Computer Science, University of Verona</institution>
, Verona 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Center for BioMedical Computing, University of Verona</institution>
, Verona, 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Manca, Vincenzo" sort="Manca, Vincenzo" uniqKey="Manca V" first="Vincenzo" last="Manca">Vincenzo Manca</name>
<affiliation><nlm:aff id="a1"><institution>University of Verona, Department of Computer Science, University of Verona</institution>
, Verona 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Center for BioMedical Computing, University of Verona</institution>
, Verona, 37134,<country>Italy</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint><date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>In recent years, the analysis of genomes by means of strings of length <italic>k</italic>
 occurring in the genomes, called <italic>k</italic>
-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of <italic>k</italic>
 for applying information theoretic concepts that express intrinsic aspects of genomes. The value <italic>k</italic>
 = lg<sub>2</sub>
(<italic>n</italic>
), where <italic>n</italic>
 is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances <italic>entropic</italic>
 and <italic>anti-entropic</italic>
 components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Conrad, M" uniqKey="Conrad M">M. Conrad</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Conrad, M" uniqKey="Conrad M">M. Conrad</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Holland, J" uniqKey="Holland J">J. Holland</name>
</author>
<author><name sortKey="Mallot, H" uniqKey="Mallot H">H. Mallot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cercignani, C" uniqKey="Cercignani C">C. Cercignani</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Shannon, C E" uniqKey="Shannon C">C. E. Shannon</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pincus, S M" uniqKey="Pincus S">S. M. Pincus</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Crochemore, M" uniqKey="Crochemore M">M. Crochemore</name>
</author>
<author><name sortKey="Verin, R" uniqKey="Verin R">R. Vérin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vinga, S" uniqKey="Vinga S">S. Vinga</name>
</author>
<author><name sortKey="Almeida, J S" uniqKey="Almeida J">J. S. Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koslicki, D" uniqKey="Koslicki D">D. Koslicki</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, D" uniqKey="Wang D">D. Wang</name>
</author>
<author><name sortKey="Xu, J" uniqKey="Xu J">J. Xu</name>
</author>
<author><name sortKey="Yu, J" uniqKey="Yu J">J. Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Head, T" uniqKey="Head T">T. Head</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deonier, R C" uniqKey="Deonier R">R. C. Deonier</name>
</author>
<author><name sortKey="Tavare, S" uniqKey="Tavare S">S. Tavaré</name>
</author>
<author><name sortKey="Waterman, M" uniqKey="Waterman M">M. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
<author><name sortKey="Franco, G" uniqKey="Franco G">G. Franco</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Searls, D B" uniqKey="Searls D">D. B. Searls</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vinga, S" uniqKey="Vinga S">S. Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gatlin, L L" uniqKey="Gatlin L">L. L. Gatlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kraskov, A" uniqKey="Kraskov A">A. Kraskov</name>
</author>
<author><name sortKey="Grassberger, P" uniqKey="Grassberger P">P. Grassberger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Campbell, A" uniqKey="Campbell A">A. Campbell</name>
</author>
<author><name sortKey="Mrazek, J" uniqKey="Mrazek J">J. Mrázek</name>
</author>
<author><name sortKey="Karlin, S" uniqKey="Karlin S">S. Karlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ebeling, W" uniqKey="Ebeling W">W. Ebeling</name>
</author>
<author><name sortKey="Jimenez Monta O, M A" uniqKey="Jimenez Monta O M">M. A. Jiménez-Montaño</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Weiss, O" uniqKey="Weiss O">O. Weiss</name>
</author>
<author><name sortKey="Jimenez Monta O, M A" uniqKey="Jimenez Monta O M">M. A. Jiménez-Montaño</name>
</author>
<author><name sortKey="Herzel, H" uniqKey="Herzel H">H. Herzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Holste, D" uniqKey="Holste D">D. Holste</name>
</author>
<author><name sortKey="Grosse, I" uniqKey="Grosse I">I. Grosse</name>
</author>
<author><name sortKey="Herzel, H" uniqKey="Herzel H">H. Herzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fofanov, Y" uniqKey="Fofanov Y">Y. Fofanov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S. Kurtz</name>
</author>
<author><name sortKey="Narechania, A" uniqKey="Narechania A">A. Narechania</name>
</author>
<author><name sortKey="Stein, J C" uniqKey="Stein J">J. C. Stein</name>
</author>
<author><name sortKey="Ware, D" uniqKey="Ware D">D. Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chor, B" uniqKey="Chor B">B. Chor</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Castellini, A" uniqKey="Castellini A">A. Castellini</name>
</author>
<author><name sortKey="Franco, G" uniqKey="Franco G">G. Franco</name>
</author>
<author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bonnici, V" uniqKey="Bonnici V">V. Bonnici</name>
</author>
<author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wen, J" uniqKey="Wen J">J. Wen</name>
</author>
<author><name sortKey="Chan, R H" uniqKey="Chan R">R. H. Chan</name>
</author>
<author><name sortKey="Yau, S C" uniqKey="Yau S">S.-C. Yau</name>
</author>
<author><name sortKey="He, R L" uniqKey="He R">R. L. He</name>
</author>
<author><name sortKey="Yau, S S" uniqKey="Yau S">S. S. Yau</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Almirantis, Y" uniqKey="Almirantis Y">Y. Almirantis</name>
</author>
<author><name sortKey="Arndt, P" uniqKey="Arndt P">P. Arndt</name>
</author>
<author><name sortKey="Li, W" uniqKey="Li W">W. Li</name>
</author>
<author><name sortKey="Provata, A" uniqKey="Provata A">A. Provata</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hashim, E K M" uniqKey="Hashim E">E. K. M. Hashim</name>
</author>
<author><name sortKey="Abdullah, R" uniqKey="Abdullah R">R. Abdullah</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bonnici, V" uniqKey="Bonnici V">V. Bonnici</name>
</author>
<author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manca, V" uniqKey="Manca V">V. Manca</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Knuth, D" uniqKey="Knuth D">D. Knuth</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kong, S G" uniqKey="Kong S">S. G. Kong</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jiang, Y" uniqKey="Jiang Y">Y. Jiang</name>
</author>
<author><name sortKey="Xu, C" uniqKey="Xu C">C. Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Witten, I H" uniqKey="Witten I">I. H. Witten</name>
</author>
<author><name sortKey="Moffat, A" uniqKey="Moffat A">A. Moffat</name>
</author>
<author><name sortKey="Bell, T C" uniqKey="Bell T">T. C. Bell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wiener, N" uniqKey="Wiener N">N. Wiener</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schrodinger, E" uniqKey="Schrodinger E">E. Schrödinger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Brillouin, L" uniqKey="Brillouin L">L. Brillouin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Volkenstein, M V" uniqKey="Volkenstein M">M. V. Volkenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Venter, J C" uniqKey="Venter J">J. C. Venter</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lynch, M" uniqKey="Lynch M">M. Lynch</name>
</author>
<author><name sortKey="Conery, J S" uniqKey="Conery J">J. S. Conery</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kullback, S" uniqKey="Kullback S">S. Kullback</name>
</author>
<author><name sortKey="Leibler, R A" uniqKey="Leibler R">R. A. Leibler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Feller, W" uniqKey="Feller W">W. Feller</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rozenberg, G" uniqKey="Rozenberg G">G. Rozenberg</name>
</author>
<author><name sortKey="Salomaa, A" uniqKey="Salomaa A">A. Salomaa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Abouelhoda, M I" uniqKey="Abouelhoda M">M. I. Abouelhoda</name>
</author>
<author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S. Kurtz</name>
</author>
<author><name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E. Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Federhen, S" uniqKey="Federhen S">S. Federhen</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group><journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher><publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">27354155</article-id>
<article-id pub-id-type="pmc">4937431</article-id>
<article-id pub-id-type="pii">srep28840</article-id>
<article-id pub-id-type="doi">10.1038/srep28840</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Informational laws of genome structures</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Bonnici</surname>
<given-names>Vincenzo</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Manca</surname>
<given-names>Vincenzo</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<aff id="a1"><label>1</label>
<institution>University of Verona, Department of Computer Science, University of Verona</institution>
, Verona 37134,<country>Italy</country>
</aff>
<aff id="a2"><label>2</label>
<institution>Center for BioMedical Computing, University of Verona</institution>
, Verona, 37134,<country>Italy</country>
</aff>
</contrib-group>
<author-notes><corresp id="c1"><label>a</label>
<email>vincenzo.manca@univr.it</email>
</corresp>
</author-notes>
<pub-date pub-type="epub"><day>29</day>
<month>06</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection"><year>2016</year>
</pub-date>
<volume>6</volume>
<elocation-id>28840</elocation-id>
<history><date date-type="received"><day>01</day>
<month>02</month>
<year>2016</year>
</date>
<date date-type="accepted"><day>09</day>
<month>06</month>
<year>2016</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2016, Macmillan Publishers Limited</copyright-statement>
<copyright-year>2016</copyright-year>
<copyright-holder>Macmillan Publishers Limited</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/"><pmc-comment>author-paid</pmc-comment>
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract><p>In recent years, the analysis of genomes by means of strings of length <italic>k</italic>
 occurring in the genomes, called <italic>k</italic>
-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of <italic>k</italic>
 for applying information theoretic concepts that express intrinsic aspects of genomes. The value <italic>k</italic>
 = lg<sub>2</sub>
(<italic>n</italic>
), where <italic>n</italic>
 is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances <italic>entropic</italic>
 and <italic>anti-entropic</italic>
 components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.</p>
</abstract>
</article-meta>
</front>
<body><p>The study of complexity in Biology is an old topic that often reemerges in theoretical biological investigations<xref ref-type="bibr" rid="b1">1</xref>
<xref ref-type="bibr" rid="b2">2</xref>
<xref ref-type="bibr" rid="b3">3</xref>
. The study of complexity has very important implications for any deep understanding of the informational organization that life chooses in the different species to realize their specific biological functionalities. Entropy is a fundamental scientific concept that is naturally related to complexity and was the basis of statistical physics founded by Ludwig Boltzmann and the essence of his famous H theorem, which related the arrow of time to Boltzmann’s equation, where entropy is expressed in terms of mechanical microstates<xref ref-type="bibr" rid="b4">4</xref>
. Essentially, the same function was the basis of the information theory founded by Claude Shannon in 1948<xref ref-type="bibr" rid="b5">5</xref>
, where entropy is defined on information sources, that is, probability distributions over finite sets of elements (symbols, words or signals). A genome is essentially a text; if read at pieces of length <italic>k</italic>
 (called <italic>k</italic>
-mers), a genome becomes an information source. Therefore genomic <italic>k</italic>
-entropies can be easily defined, and the concepts and results of information theory can be applied<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
<xref ref-type="bibr" rid="b8">8</xref>
<xref ref-type="bibr" rid="b9">9</xref>
<xref ref-type="bibr" rid="b10">10</xref>
.</p>
<p>In recent years, many studies have approached the investigation of DNA strings and genomes by means of algorithms, information theory and formal languages<xref ref-type="bibr" rid="b11">11</xref>
<xref ref-type="bibr" rid="b12">12</xref>
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b14">14</xref>
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b16">16</xref>
<xref ref-type="bibr" rid="b17">17</xref>
<xref ref-type="bibr" rid="b18">18</xref>
<xref ref-type="bibr" rid="b19">19</xref>
<xref ref-type="bibr" rid="b20">20</xref>
<xref ref-type="bibr" rid="b21">21</xref>
<xref ref-type="bibr" rid="b22">22</xref>
, and methods were developed for investigating whole genome structures. In particular, dictionaries of words occurring in genomes, distributions defined over genomes, and concepts related to word occurrences and frequencies have been very useful and seem to characterize important genomic features relevant in biological contexts<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b24">24</xref>
<xref ref-type="bibr" rid="b25">25</xref>
<xref ref-type="bibr" rid="b26">26</xref>
<xref ref-type="bibr" rid="b27">27</xref>
<xref ref-type="bibr" rid="b28">28</xref>
<xref ref-type="bibr" rid="b29">29</xref>
<xref ref-type="bibr" rid="b30">30</xref>
. Dictionaries are, in essence, finite formal languages. In genome analyses based on dictionaries, concepts from formal language theory, probability, and information theory are naturally combined by providing new perspectives in the investigation of genomes, which may disclose the internal logics of their structures.</p>
<p>The set of all <italic>k</italic>
-mers, occurring in a given genome is a particular dictionary. A point that is crucial in genome analyses based on <italic>k</italic>
-mers is the value of <italic>k</italic>
 that is more adequate for specific investigations. This issue becomes extremely evident when computing the entropy of a genome. We prove that preferential lengths exist for computing entropies, and in correspondence with these lengths, some informational indexes can be defined that exhibit “informational laws” and characterize an informational structure of genomes. As we have already noticed, there is a long tradition in investigating genomes by using <italic>k</italic>
-mers. However, comparing genomes of different lengths, by using the same value of <italic>k</italic>
 (usually less than 12) may result in the loss, in some cases, of important regularities. In fact, the genomic laws that we discover emerge when the values of <italic>k</italic>
 are suitably defined from the logarithmic length of the genomes.</p>
<p>When genomic complexity is considered, it is very soon clear that it cannot be easily measured by parameters such as genome length, number of genes, CG-content, basic repeatability indexes, or their combinations. Therefore, we follow an information theoretic line of investigation based on k-mer dictionaries and entropies<xref ref-type="bibr" rid="b16">16</xref>
<xref ref-type="bibr" rid="b26">26</xref>
<xref ref-type="bibr" rid="b27">27</xref>
<xref ref-type="bibr" rid="b31">31</xref>
<xref ref-type="bibr" rid="b32">32</xref>
<xref ref-type="bibr" rid="b33">33</xref>
, which is aimed at defining and computing informational indexes for a representative set of genomes. This task is not trivial when genome sizes increase, so a specific software package is used to this end<xref ref-type="bibr" rid="b31">31</xref>
. Moreover, an aspect that is missing in classical Shannon’s conceptual apparatus is relevant in our approach: random strings and pseudo-random generation algorithms, which now can be easily produced and analyzed<xref ref-type="bibr" rid="b34">34</xref>
. In fact, it is natural to assume that the complexity of a genome increases with its “distance” from randomness<xref ref-type="bibr" rid="b35">35</xref>
<xref ref-type="bibr" rid="b36">36</xref>
, as identified by means of a suitable comparison between the genome under investigation and random genomes of the same length. This idea alone provides important clues about the correct <italic>k</italic>
-mer length to consider in our genome analyses, because theoretical and experimental analyses show that random genomes reach their entropic maxima for <italic>k</italic>
-mers of length lg<sub>2</sub>
(<italic>n</italic>
), where <italic>n</italic>
 is the genome length. No assumption on the distribution of probability of <italic>k</italic>
-mers is assumed or inferred (as in Markov Models-based approaches); rather, data processing is developed on the basis of the empirical distributions of <italic>k</italic>
-mers computed over the investigated genomes.</p>
<p>To this end, two basic indexes are introduced, which we call <italic>entropic</italic>
 and <italic>anti-entropic components</italic>
. These indexes, and other related indexes, are computed over the chosen seventy genomes, ranging from prokaryotes to primates. The obtained values suggest some laws of genome structure. These laws hold in all of the investigated genomes and motivate the definition of the genomic complexity measure <italic>BB</italic>
 proposed in the paper. This measure depends on the entire structure of a genome and considers, together, the components of genomes (e. g., repeats, CpG, long range correlations, surely affecting entropies) without considering them separately. Moreover, as demonstrated below, <italic>BB</italic>
 is related to phylogeny but does not coincide with phylogenetic ordering. Certainly, primate genomes are usually more complex than, say, bacterial or insect genomes, but the situation is surely more critical because evolution is always active and a bacterium that we sequence today is not a type of bacteria that firstly arose in the tree of life. For this reason, genomes that are phylogenetically older can cumulate, even along different paths, “distances” from their corresponding random genomes comparable with those gained by “more evolved” genomes.</p>
<sec disp-level="1"><title>Results</title>
<p>The results presented in this paper are based on comparing real genomes with random genomes of the same length. As we show, any genome <inline-formula id="d33e423"><inline-graphic id="d33e424" xlink:href="srep28840-m1.jpg"></inline-graphic>
</inline-formula>
 of length <italic>n</italic>
 defines a partition of lg<sub>4</sub>
(<italic>n</italic>
) in two addends <inline-formula id="d33e435"><inline-graphic id="d33e436" xlink:href="srep28840-m2.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e439"><inline-graphic id="d33e440" xlink:href="srep28840-m3.jpg"></inline-graphic>
</inline-formula>
 such that <inline-formula id="d33e442"><inline-graphic id="d33e443" xlink:href="srep28840-m4.jpg"></inline-graphic>
</inline-formula>
.</p>
<sec disp-level="2"><title>The fundamental informational components of genomes</title>
<p>We denote by <inline-formula id="d33e450"><inline-graphic id="d33e451" xlink:href="srep28840-m5.jpg"></inline-graphic>
</inline-formula>
 the value <inline-formula id="d33e453"><inline-graphic id="d33e454" xlink:href="srep28840-m6.jpg"></inline-graphic>
</inline-formula>
. Of course, <inline-formula id="d33e456"><inline-graphic id="d33e457" xlink:href="srep28840-m7.jpg"></inline-graphic>
</inline-formula>
. We call <inline-formula id="d33e459"><inline-graphic id="d33e460" xlink:href="srep28840-m8.jpg"></inline-graphic>
</inline-formula>
 the logarithmic length of <inline-formula id="d33e462"><inline-graphic id="d33e463" xlink:href="srep28840-m9.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e466"><inline-graphic id="d33e467" xlink:href="srep28840-m10.jpg"></inline-graphic>
</inline-formula>
 the <italic>double logarithmic length</italic>
 of <inline-formula id="d33e472"><inline-graphic id="d33e473" xlink:href="srep28840-m11.jpg"></inline-graphic>
</inline-formula>
. When no possible confusion can arise, we avoid explicitly indicating <inline-formula id="d33e475"><inline-graphic id="d33e476" xlink:href="srep28840-m12.jpg"></inline-graphic>
</inline-formula>
, so we write in short <italic>LG</italic>
, and consequently we denote the entropy <inline-formula id="d33e481"><inline-graphic id="d33e482" xlink:href="srep28840-m13.jpg"></inline-graphic>
</inline-formula>
 over the <inline-formula id="d33e485"><inline-graphic id="d33e486" xlink:href="srep28840-m14.jpg"></inline-graphic>
</inline-formula>
-mers of <inline-formula id="d33e488"><inline-graphic id="d33e489" xlink:href="srep28840-m15.jpg"></inline-graphic>
</inline-formula>
 by <inline-formula id="d33e491"><inline-graphic id="d33e492" xlink:href="srep28840-m16.jpg"></inline-graphic>
</inline-formula>
 (analogous abbreviations are also adopted for other indexes). We also refer to the interval <inline-formula id="d33e494"><inline-graphic id="d33e495" xlink:href="srep28840-m17.jpg"></inline-graphic>
</inline-formula>
 as the <italic>critical entropic interval</italic>
. In the following, when 2<italic>LG</italic>
 is not integer, <inline-formula id="d33e504"><inline-graphic id="d33e505" xlink:href="srep28840-m18.jpg"></inline-graphic>
</inline-formula>
 denotes the linear interpolation between <inline-formula id="d33e507"><inline-graphic id="d33e508" xlink:href="srep28840-m19.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e510"><inline-graphic id="d33e511" xlink:href="srep28840-m20.jpg"></inline-graphic>
</inline-formula>
, where <italic>k</italic>
<sub>1</sub>
, <italic>k</italic>
<sub>2</sub>
 are the smallest integers such that <italic>k</italic>
<sub>1</sub>
 < 2<italic>LG</italic>
 < <italic>k</italic>
<sub>2</sub>
. In the case of the human genome, 2<italic>LG</italic>
 is between 31 and 32; in the genomes considered in this paper (from microbes to primates), it ranges between 16 and 36.</p>
<p>We prove, by using well-known results of information theory, that the values <italic>LG</italic>
 and 2<italic>LG</italic>
 have the following properties (see section <italic>Methods</italic>
):<list id="l1" list-type="roman-lower"><list-item><p><inline-formula id="d33e553"><inline-graphic id="d33e554" xlink:href="srep28840-m21.jpg"></inline-graphic>
</inline-formula>
 is an upper bound to the values that entropy can reach over the genomes with the same length of <inline-formula id="d33e556"><inline-graphic id="d33e557" xlink:href="srep28840-m22.jpg"></inline-graphic>
</inline-formula>
;</p>
</list-item>
<list-item><p>if <italic>k</italic>
 belongs to the critical interval <inline-formula id="d33e564"><inline-graphic id="d33e565" xlink:href="srep28840-m23.jpg"></inline-graphic>
</inline-formula>
, and <inline-formula id="d33e567"><inline-graphic id="d33e568" xlink:href="srep28840-m24.jpg"></inline-graphic>
</inline-formula>
, then entropies <italic>E</italic>
<sub><italic>k</italic>
</sub>
, for <italic>k</italic>
 ≤ <italic>n</italic>
, reach, on suitable genomes, the best approximations to <inline-formula id="d33e583"><inline-graphic id="d33e584" xlink:href="srep28840-m25.jpg"></inline-graphic>
</inline-formula>
 with an error close to zero, which is inferior to <inline-formula id="d33e586"><inline-graphic id="d33e587" xlink:href="srep28840-m26.jpg"></inline-graphic>
</inline-formula>
, being <inline-formula id="d33e589"><inline-graphic id="d33e590" xlink:href="srep28840-m27.jpg"></inline-graphic>
</inline-formula>
 the closest integer greater than <italic>x</italic>
.</p>
</list-item>
<list-item><p>entropy <inline-formula id="d33e597"><inline-graphic id="d33e598" xlink:href="srep28840-m28.jpg"></inline-graphic>
</inline-formula>
 reaches its maximum in random genomes of length <inline-formula id="d33e600"><inline-graphic id="d33e601" xlink:href="srep28840-m29.jpg"></inline-graphic>
</inline-formula>
. This result follows from the fact that in random genomes of length <italic>n</italic>
 all lg<sub>2</sub>
(<italic>n</italic>
)-mers are hapaxes, that is, they occur once in the whole genome<xref ref-type="bibr" rid="b37">37</xref>
.</p>
</list-item>
</list>
</p>
<p>In conclusion, the maximum of <inline-formula id="d33e617"><inline-graphic id="d33e618" xlink:href="srep28840-m30.jpg"></inline-graphic>
</inline-formula>
 is almost equal to <inline-formula id="d33e620"><inline-graphic id="d33e621" xlink:href="srep28840-m31.jpg"></inline-graphic>
</inline-formula>
, and this maximum is reached by random genomes of length <inline-formula id="d33e623"><inline-graphic id="d33e624" xlink:href="srep28840-m32.jpg"></inline-graphic>
</inline-formula>
. It was realized that for all of the investigated genomes the following inequality immediately holds:</p>
<p><disp-formula id="eq33"><inline-graphic id="d33e628" xlink:href="srep28840-m33.jpg"></inline-graphic>
</disp-formula>
</p>
<p>Therefore, we know that <inline-formula id="d33e631"><inline-graphic id="d33e632" xlink:href="srep28840-m34.jpg"></inline-graphic>
</inline-formula>
 belongs to the (open) real interval of bounds <inline-formula id="d33e634"><inline-graphic id="d33e635" xlink:href="srep28840-m35.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e637"><inline-graphic id="d33e638" xlink:href="srep28840-m36.jpg"></inline-graphic>
</inline-formula>
. Then, we can define the following values <inline-formula id="d33e640"><inline-graphic id="d33e641" xlink:href="srep28840-m37.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e643"><inline-graphic id="d33e644" xlink:href="srep28840-m38.jpg"></inline-graphic>
</inline-formula>
, which we call <italic>Entropic Component</italic>
 and <italic>anti-entropic Component</italic>
 of <inline-formula id="d33e653"><inline-graphic id="d33e654" xlink:href="srep28840-m39.jpg"></inline-graphic>
</inline-formula>
, respectively:</p>
<p><disp-formula id="eq40"><inline-graphic id="d33e658" xlink:href="srep28840-m40.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq41"><inline-graphic id="d33e661" xlink:href="srep28840-m41.jpg"></inline-graphic>
</disp-formula>
</p>
<p>Summing Equations (2) and (3), we obtain <inline-formula id="d33e664"><inline-graphic id="d33e665" xlink:href="srep28840-m42.jpg"></inline-graphic>
</inline-formula>
. The value <inline-formula id="d33e667"><inline-graphic id="d33e668" xlink:href="srep28840-m43.jpg"></inline-graphic>
</inline-formula>
 corresponds to the gap between the double logarithmic entropy <inline-formula id="d33e670"><inline-graphic id="d33e671" xlink:href="srep28840-m44.jpg"></inline-graphic>
</inline-formula>
 and the logarithmic length <inline-formula id="d33e673"><inline-graphic id="d33e674" xlink:href="srep28840-m45.jpg"></inline-graphic>
</inline-formula>
, which is always positive according to the equations above. Moreover, <inline-formula id="d33e676"><inline-graphic id="d33e677" xlink:href="srep28840-m46.jpg"></inline-graphic>
</inline-formula>
 is the gap between the double logarithmic length <inline-formula id="d33e680"><inline-graphic id="d33e681" xlink:href="srep28840-m47.jpg"></inline-graphic>
</inline-formula>
 and the entropy <inline-formula id="d33e683"><inline-graphic id="d33e684" xlink:href="srep28840-m48.jpg"></inline-graphic>
</inline-formula>
, which is positive because <inline-formula id="d33e686"><inline-graphic id="d33e687" xlink:href="srep28840-m49.jpg"></inline-graphic>
</inline-formula>
 is an upper bound to the entropies in the critical entropic interval. The term “anti-entropic” stresses an important difference with the analogous concept of <italic>neghentropy</italic>
, which is frequently used to denote the other side of the order/disorder dichotomy associated with entropy (and its time arrow)<xref ref-type="bibr" rid="b38">38</xref>
<xref ref-type="bibr" rid="b39">39</xref>
<xref ref-type="bibr" rid="b40">40</xref>
<xref ref-type="bibr" rid="b41">41</xref>
. In fact, in <italic>anti-entropy</italic>
, no change of sign is involved, but a difference from an upper bound of the entropy is instead considered.</p>
</sec>
<sec disp-level="2"><title>Informational genomic laws</title>
<p>Let us define <inline-formula id="d33e702"><inline-graphic id="d33e703" xlink:href="srep28840-m50.jpg"></inline-graphic>
</inline-formula>
, called <italic>lexical index</italic>
, as the ratio:</p>
<p><disp-formula id="eq51"><inline-graphic id="d33e710" xlink:href="srep28840-m51.jpg"></inline-graphic>
</disp-formula>
</p>
<p>The numerator is essentially the number of words of length 2<italic>LG</italic>
 occurring in random genomes, which as we already noticed are all hapaxes, and therefore, coincides with the number of possible occurrences of 2<italic>LG</italic>
-mers in <inline-formula id="d33e719"><inline-graphic id="d33e720" xlink:href="srep28840-m52.jpg"></inline-graphic>
</inline-formula>
. The denominator is the number of words of length 2<italic>LG</italic>
 occurring in <inline-formula id="d33e725"><inline-graphic id="d33e726" xlink:href="srep28840-m53.jpg"></inline-graphic>
</inline-formula>
. This ratio is related to the degree of order that <inline-formula id="d33e729"><inline-graphic id="d33e730" xlink:href="srep28840-m54.jpg"></inline-graphic>
</inline-formula>
 gains with respect to random genomes. In fact, in a random genome <italic>R</italic>
, we have <italic>LX</italic>
(<italic>R</italic>
) = 1; therefore, in a real genome <inline-formula id="d33e741"><inline-graphic id="d33e742" xlink:href="srep28840-m55.jpg"></inline-graphic>
</inline-formula>
, <inline-formula id="d33e744"><inline-graphic id="d33e745" xlink:href="srep28840-m56.jpg"></inline-graphic>
</inline-formula>
. The lexical index is smaller than the ratio <inline-formula id="d33e748"><inline-graphic id="d33e749" xlink:href="srep28840-m57.jpg"></inline-graphic>
</inline-formula>
 but is greater than <inline-formula id="d33e751"><inline-graphic id="d33e752" xlink:href="srep28840-m58.jpg"></inline-graphic>
</inline-formula>
. Moreover, by dividing and multiplying <italic>LX</italic>
 by <inline-formula id="d33e757"><inline-graphic id="d33e758" xlink:href="srep28840-m59.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e760"><inline-graphic id="d33e761" xlink:href="srep28840-m60.jpg"></inline-graphic>
</inline-formula>
, it is possible to obtain lower and upper bounds to <inline-formula id="d33e763"><inline-graphic id="d33e764" xlink:href="srep28840-m61.jpg"></inline-graphic>
</inline-formula>
. The value <inline-formula id="d33e767"><inline-graphic id="d33e768" xlink:href="srep28840-m62.jpg"></inline-graphic>
</inline-formula>
, given by <inline-formula id="d33e770"><inline-graphic id="d33e771" xlink:href="srep28840-m63.jpg"></inline-graphic>
</inline-formula>
, corresponds to the eccentricity of an ellipse associated with <inline-formula id="d33e773"><inline-graphic id="d33e774" xlink:href="srep28840-m64.jpg"></inline-graphic>
</inline-formula>
 (see <xref ref-type="supplementary-material" rid="S1">Supplementary Information</xref>
, <xref ref-type="supplementary-material" rid="S1">Sup. Fig. 3</xref>
). The product of <inline-formula id="d33e782"><inline-graphic id="d33e783" xlink:href="srep28840-m65.jpg"></inline-graphic>
</inline-formula>
 with <inline-formula id="d33e786"><inline-graphic id="d33e787" xlink:href="srep28840-m66.jpg"></inline-graphic>
</inline-formula>
 differs by 1 less than <inline-formula id="d33e789"><inline-graphic id="d33e790" xlink:href="srep28840-m67.jpg"></inline-graphic>
</inline-formula>
. In conclusion, the following laws hold for all seventy investigated genomes:</p>
<p><disp-formula id="eq68"><inline-graphic id="d33e794" xlink:href="srep28840-m68.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq69"><inline-graphic id="d33e797" xlink:href="srep28840-m69.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq70"><inline-graphic id="d33e800" xlink:href="srep28840-m70.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq71"><inline-graphic id="d33e803" xlink:href="srep28840-m71.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq72"><inline-graphic id="d33e806" xlink:href="srep28840-m72.jpg"></inline-graphic>
</disp-formula>
</p>
</sec>
<sec disp-level="2"><title>Biobit: a measure of genomic complexity</title>
<p>As we already noticed, <italic>AC</italic>
 is an index measuring the informational distance between genomes and random genomes with the same length. This means that the more biological functions a genome <inline-formula id="d33e815"><inline-graphic id="d33e816" xlink:href="srep28840-m73.jpg"></inline-graphic>
</inline-formula>
 has acquired, the further the genome is from randomness. However, if we directly identify the complexity of <inline-formula id="d33e818"><inline-graphic id="d33e819" xlink:href="srep28840-m74.jpg"></inline-graphic>
</inline-formula>
 with <inline-formula id="d33e821"><inline-graphic id="d33e822" xlink:href="srep28840-m75.jpg"></inline-graphic>
</inline-formula>
, we obtain some biologically inconsistent results. For example, <italic>Zea mays</italic>
 has an <italic>LG</italic>
 value of 15.4701 but an <italic>AC</italic>
 value of 3.6678 (primates have <italic>AC</italic>
 less than 1). These types of anomalies suggested to us that <italic>AC</italic>
 is surely related to the biological complexity of a genome, but this complexity is not a linear function of <italic>AC</italic>
 because also the <italic>EC</italic>
 component also has to be considered in a more comprehensive definition of complexity. Our search focused on a function that combines <italic>AC</italic>
 with <italic>EH</italic>
, which is strictly related to <italic>EC</italic>
. If <italic>x</italic>
 briefly denotes the <italic>anti-entropic fraction AF</italic>
 = <italic>AC</italic>
/<italic>LG</italic>
, it is easy to verify that because <italic>EC</italic>
 = <italic>LG</italic>
 − <italic>AC</italic>
, then <italic>EH</italic>
 = (<italic>EC</italic>
 − <italic>AC</italic>
)/<italic>LG</italic>
 = (1 − 2<italic>x</italic>
); therefore, the product <italic>AC</italic>
 * <italic>EH</italic>
 can be represented by:</p>
<p><disp-formula id="eq76"><inline-graphic id="d33e902" xlink:href="srep28840-m76.jpg"></inline-graphic>
</disp-formula>
</p>
<p>This function (after a simple change of variables) is a type of logistic map <italic>ax</italic>
(1 − <italic>x</italic>
), with <italic>a</italic>
 constant, and <italic>x</italic>
 variable ranging in [0, 1], which is very important in population dynamics.</p>
<p>If we generalize <italic>x</italic>
(1 − 2<italic>x</italic>
) in the class of functions <italic>x</italic>
<sup><italic>γ</italic>
</sup>
(1 − 2<italic>x</italic>
)<sup><italic>δ</italic>
</sup>
, with <italic>γ</italic>
 and <italic>δ</italic>
 positive rationals weighting the two factors, then we discover that these functions have maxima for values approaching to zero when <italic>γ</italic>
 ≤ 1 decreases and <italic>δ</italic>
 increases. Therefore, because <italic>AC</italic>
 is supposed to have a predominant role in the complexity measure, we define <italic>BB</italic>
<sub><italic>γ</italic>
,<italic>δ</italic>
</sub>
 as <italic>BB</italic>
<sub><italic>γ</italic>
,<italic>δ</italic>
</sub>
 = <italic>x</italic>
<sup><italic>γ</italic>
</sup>
(1 − 2<italic>x</italic>
)<sup><italic>δ</italic>
</sup>
 by choosing the values of the exponents in such a way that maxima of <italic>BB</italic>
<sub><italic>γ</italic>
,<italic>δ</italic>
</sub>
 fall close to the values that the anti-entropic fraction <italic>AF</italic>
 assumes for the most part in genomes with high values of <italic>AC</italic>
 (almost all of them have medium horizontal eccentricity; see <xref ref-type="supplementary-material" rid="S1">Supplementary Information</xref>
, <xref ref-type="supplementary-material" rid="S1">Sup. Table 2</xref>
). No genome on our list reaches the maximum of the chosen function because their <italic>AF</italic>
 value is always smaller (suboptimal genomes) or greater (super-optimal genomes) than the value where the maximum is reached.</p>
<p>In conclusion, we conjecture that the genomic complexity is a non-linear function of <italic>AC</italic>
 having the form (apart from a multiplicative constant):</p>
<p><disp-formula id="eq77"><inline-graphic id="d33e1019" xlink:href="srep28840-m77.jpg"></inline-graphic>
</disp-formula>
</p>
<p>In particular, the following definition, which is an instance of (10), was supposed to be the most appropriate (<inline-formula id="d33e1022"><inline-graphic id="d33e1023" xlink:href="srep28840-m78.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e1025"><inline-graphic id="d33e1026" xlink:href="srep28840-m79.jpg"></inline-graphic>
</inline-formula>
):</p>
<p><disp-formula id="eq80"><inline-graphic id="d33e1030" xlink:href="srep28840-m80.jpg"></inline-graphic>
</disp-formula>
</p>
<p>In <xref ref-type="fig" rid="f1">Fig. 1</xref>
, the biobit values, together with the other described informational indexes, of the seventy genomes are visualized in a diagram. In <xref ref-type="fig" rid="f2">Fig. 2</xref>
 a flowchart is given that, in general terms, expresses the main stages for computing the <italic>BB</italic>
 measure of a given genome.</p>
<p>A further law could be associated with the biobit index, according to which genomes <italic>evolve</italic>
 by increasing the value of the <italic>BB</italic>
 function. This means that an ordering, denoted by <inline-formula id="d33e1050"><inline-graphic id="d33e1051" xlink:href="srep28840-m81.jpg"></inline-graphic>
</inline-formula>
 (a reflexive, antisymmetric, and transitive relation), can be defined such that:</p>
<p><disp-formula id="eq82"><inline-graphic id="d33e1056" xlink:href="srep28840-m82.jpg"></inline-graphic>
</disp-formula>
</p>
<p><xref ref-type="table" rid="t1">Table 1</xref>
 reports the main informational indexes based on the two entropic components of the logarithmic length of genomes. <xref ref-type="fig" rid="f3">Figure 3</xref>
 depicts graphically the values of these informational indexes for all of the investigated genomes (see <xref ref-type="supplementary-material" rid="S1">Supplementary Information, Sup. Table 4</xref>
, for the exact numerical values). The lengths of genomes are naturally linearly ordered, thus allowing us to arrange them along the <italic>x</italic>
-axis. Apart from the <italic>EC</italic>
 curve, which is quite coincident with <italic>LG</italic>
, the other indexes presents peaks that correspond to the genomes differing only slightly in lengths but differing greatly in other indexes.</p>
<p>It is interesting that, in essence, biological evolution is anti-entropic because the <italic>AC</italic>
 component, representing the tendency toward order, increases with the increase of biological functionalities, under the constraint of keeping the ratio <italic>AC</italic>
/<italic>EC</italic>
 under a threshold, as expressed by the factor (1 − 2<italic>x</italic>
)<sup>3</sup>
 of <italic>BB</italic>
.</p>
<p>A 3D-visualization of our seventy genomes, by means of the <italic>AC</italic>
, <italic>LX</italic>
, <italic>BB</italic>
 informational indexes (see <xref ref-type="supplementary-material" rid="S1">Supplementary Information, Sup. Fig. 2</xref>
), reveals that genomic complexity does not coincide with classical phylogenetic classifications, as argued in the next section.</p>
</sec>
</sec>
<sec disp-level="1"><title>Discussion</title>
<p>We think that our informational indexes, and the laws relating them, confirm a very simple and general intuition. If life is information represented and elaborated by means of (organic) molecules, then the laws of information necessarily have to reveal the deep logic of genome structures.</p>
<p>The laws presented in the previous section represent universal aspects of genome structure and may rarely hold for strings of the same lengths that are not genomes. Therefore, the genomic complexity measure BB, obtained by means of informational indexes, is not a mathematical trick but must to be related to the way genomes are organized and to the way in which the genomes were generated. <xref ref-type="fig" rid="f2">Figure 2</xref>
 shows the values of BB along the 70 investigated genomes, and it is clear that BB is related to the evolutionary positions of organisms. However, our approach has an important biological implication in clarifying the difference between phylogenesis and genomic complexity, which are related but different concepts. In fact, several cases have been found (see <xref ref-type="fig" rid="f2">Fig. 2</xref>
 and <xref ref-type="supplementary-material" rid="S1">Sup. Fig. 2</xref>
 in Supplementary Information) where organisms that are phylogenetically more primitive than others, for example bacteria, have biobit values greater than those of “more evolved” organisms. The reason could be the following. A bacterium that we sequence today is an evolutionary product of some primitive bacterium. Even if we do not know the path from the bacterium’s (possibly unknown) ancestor to the bacterium, its complexity along this path grew over time because its evolutionary age is the same as H. sapiens (even along different branches). The genomic complexity of <inline-formula id="d33e1127"><inline-graphic id="d33e1128" xlink:href="srep28840-m83.jpg"></inline-graphic>
</inline-formula>
 is, in a sense, a measure of the relevant steps from random genomes to <inline-formula id="d33e1130"><inline-graphic id="d33e1131" xlink:href="srep28840-m84.jpg"></inline-graphic>
</inline-formula>
. Surely, these steps reflect the evolutionary pressure and the biological interactions and competitions among species. However, if we forget this perspective, we lose an important aspect of evolutionary dynamics. This is why complexity-driven classifications that completely agree with phylogenesis are almost impossible. For example, we found that bacteria associated with human diseases have BB values significantly higher than others phylogenetically comparable to them. The BB measure is a sort of absolute distance from random, whereas phylogenesis concerns similarity or dissimilarity between species. Therefore, a very natural question arises, which suggests the development of the presented theory. Can entropic divergences (Kullback-Leibler divergence or similar concepts) be applied to phylogenetic analysis of genomes by means of “common words” and their probability distributions in the compared genomes? Finally, what is the applicability of our indexes in the identification of informational features that are relevant in specific pathological genetic disorders? Of course, these questions deserve specific investigations; however, our informational indexes with the related laws, and computational tools, provide a framework on which these informational analyses may be fruitfully set. We argue that it is almost impossible that functional changes do not correspond to precise informational alterations in the relationships expressed by the genomic laws. The challenge is in discovering the specific keys of these correspondences.</p>
<p>We developed some computational experiments showing a direct applicability of informational indexes and related genomic laws to the emergent field of synthetic biology. In fact, recent experiments on minimal bacteria<xref ref-type="bibr" rid="b42">42</xref>
 are based on the search for genome sequences obtained by manipulating and reducing some real genomes. It has been proved that after removing some parts of the <italic>M</italic>
. <italic>mycoides</italic>
 genome, the resulting organism, JCVI-syn3.0 (531 kilobase pairs, 473 genes), is able to survive and has a genome smaller than that of any autonomously replicating cell found in nature (very close to <italic>M</italic>
. <italic>genitalium</italic>
). Of course, in this manner a better understanding of biological basic functions is gained, which directly relates with the investigated genome (removing essential portions results in life disruption). On the basis of this principle, we considered <italic>M</italic>
. <italic>genitalium</italic>
 and removed some portions of its genome through a greedy exploration of the huge space of possibilities. At every step of our genome modifications (of many different types), we checked the validity of our genomic laws. We found that, after removing portions of the genome, some of our laws do not hold in the resulting sequences (see Supplementary Information, <xref ref-type="supplementary-material" rid="S1">Sup. Table 6</xref>
). Of course, these methods need to be carefully analyzed and validated with other examples and comparisons. However, a clear indication seems to emerge about the applicability of informational indexes and laws, possibly after suitable improvements to support and complement the development of genome synthesis and analysis, in the spirit of new trends in synthetic biology.</p>
<p>The starting point of our investigation was the comparisons of real genomes with random genomes of the same length. To accomplish this purpose, the right length of <italic>k</italic>
-mers equal to the double logarithmic length of genomes was identified as being more appropriate for this comparison because for this length random genomes reach their maximum entropy. The difference between entropies was considered a measure of the order acquired by real genomes and corresponded to their capability of realizing biological functions. This intuition was supported by the values of indexes that we computed for an initial list of genomes. In fact, <xref ref-type="supplementary-material" rid="S1">Sup. Table 3</xref>
 in Supplementary Information provides <italic>AC</italic>
 values that, apart from two evident exceptions, seem to confirm the increasing of the <italic>AC</italic>
 value in accordance with the macroscopic biological complexity of organisms (independently from length, number of genes, or other typical genomic parameters). However, when we extended our analysis by including other genomes<xref ref-type="bibr" rid="b43">43</xref>
, we found <italic>AC</italic>
 values that were anomalous with respect to those already collected. In particular, plants provided extreme values, with no coherence with our interpretation of the <italic>AC</italic>
 index. To solve this puzzle, we considered a more comprehensive framework where <italic>AC</italic>
 and <italic>EC</italic>
 values interact in a trade-off between order and randomness. Genomes deviate from randomness, though to some extent, because genomes need a level of randomness that is sufficient to keep their evolutionary nature, based on a random exploration of new possibilities of life (filtered by natural selection).</p>
<p>In this picture, the two quantities <inline-formula id="d33e1190"><inline-graphic id="d33e1191" xlink:href="srep28840-m85.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e1193"><inline-graphic id="d33e1194" xlink:href="srep28840-m86.jpg"></inline-graphic>
</inline-formula>
 seem to correspond to the informational measure of two important aspects of genomes: <italic>evolvability</italic>
 and <italic>programmability</italic>
 (in the sense of<xref ref-type="bibr" rid="b2">2</xref>
). Evolvability measures the random component of genomes, whereas programmability measures the order that genomes gain with respect to pure random genomes by acquiring biological functions. The non-random meaning of <italic>AC</italic>
 can be mathematically characterized in terms of Kullback-Leibler entropic divergence between the probability distribution of words of <inline-formula id="d33e1208"><inline-graphic id="d33e1209" xlink:href="srep28840-m87.jpg"></inline-graphic>
</inline-formula>
 and the probability distribution of the same words in random genomes<xref ref-type="bibr" rid="b44">44</xref>
.</p>
<p>Genome evolution is realized through an interplay of programmability and evolvability. The anti-entropic component <italic>AC</italic>
 cannot increase beyond a percentage of the logarithmic length because <italic>LG</italic>
 = <italic>AC</italic>
 + <italic>EC</italic>
 and therefore increase of <italic>AC</italic>
 implies a decrease of <italic>EC</italic>
 by reducing the evolutionary ability. Therefore, the only way to increase <italic>AC</italic>
, by keeping a good balance of the two components, is to increase the value of <italic>LG</italic>
, i. e., the genome length, which explains why genomes increase their length during evolution. However, this increase is only indirectly correlated with biological complexity, as apparent in <xref ref-type="fig" rid="f1">Fig. 1</xref>
 (see also Supplementary Information, <xref ref-type="supplementary-material" rid="S1">Sup. Table 3</xref>
).</p>
<p>The definition of genomic complexity, in terms of a nonlinear function of <italic>AC</italic>
, is related to the balance between <italic>AC</italic>
 and <italic>EC</italic>
 values. Some of the genome entropic laws continue to also hold for <italic>k</italic>
-mers with <inline-formula id="d33e1260"><inline-graphic id="d33e1261" xlink:href="srep28840-m88.jpg"></inline-graphic>
</inline-formula>
, but almost none of the laws continue to hold when <inline-formula id="d33e1264"><inline-graphic id="d33e1265" xlink:href="srep28840-m89.jpg"></inline-graphic>
</inline-formula>
. For example, for <italic>k</italic>
 = 6 and <inline-formula id="d33e1270"><inline-graphic id="d33e1271" xlink:href="srep28840-m90.jpg"></inline-graphic>
</inline-formula>
, the values of <italic>AC</italic>
 completely lose the logic that they have for <inline-formula id="d33e1276"><inline-graphic id="d33e1277" xlink:href="srep28840-m91.jpg"></inline-graphic>
</inline-formula>
, by showing dramatic changes with respect to <inline-formula id="d33e1279"><inline-graphic id="d33e1280" xlink:href="srep28840-m92.jpg"></inline-graphic>
</inline-formula>
, on which our indexes are based (see Supplementary Information, <xref ref-type="supplementary-material" rid="S1">Sup. Table 4</xref>
). Of course, we could compare real and random genomes also for values shorter than <inline-formula id="d33e1286"><inline-graphic id="d33e1287" xlink:href="srep28840-m93.jpg"></inline-graphic>
</inline-formula>
, but in this case, we need to generate random genomes and compute the corresponding entropies, whereas for <inline-formula id="d33e1289"><inline-graphic id="d33e1290" xlink:href="srep28840-m94.jpg"></inline-graphic>
</inline-formula>
, we do not need such generations and computations, because we know, by theoretical arguments (see Proposition 3) that in random genomes, entropies at double logarithmic lengths can be assumed to be equal to <inline-formula id="d33e1292"><inline-graphic id="d33e1293" xlink:href="srep28840-m95.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>Our investigation can be compared to the astronomical observations measuring positions and times in the orbits of celestial objects. Kepler’s laws arose from the regularities found in planetary motions, and from Kepler’s laws, the laws of mechanics emerged. This astronomical comparison, which was an inspiring analogy, revealed a surprising coincidence when ellipses were introduced in the representation of entropic and anti-entropic components. Kepler’s laws were explained by Newton’s dynamical and gravitational principles. Continuing our analogy, probably deeper informational principles are the ultimate reason for the laws that we found.</p>
</sec>
<sec disp-level="1"><title>Methods</title>
<p>The seventy investigated genomes include prokaryotes, algae, amoebae, fungi, plants, and animals of different types. In <xref ref-type="supplementary-material" rid="S1">Sup. Table 5</xref>
 of Supplementary Information, source data bases, assembly identifiers, genome lengths, and percentages of unknown nucleotides are given. Basic concepts from information theory, probability theory, and formal language theory can be found in classical texts in these fields<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b45">45</xref>
<xref ref-type="bibr" rid="b46">46</xref>
.</p>
<sec disp-level="2"><title>Basic definitions and notation</title>
<p>Strings are finite sequences of contiguous symbols. Mathematically, strings are functions from a set of positions, viewed as a subset of the set <inline-formula id="d33e1312"><inline-graphic id="d33e1313" xlink:href="srep28840-m96.jpg"></inline-graphic>
</inline-formula>
 of natural numbers, <inline-formula id="d33e1315"><inline-graphic id="d33e1316" xlink:href="srep28840-m97.jpg"></inline-graphic>
</inline-formula>
 to a set of symbols, called <italic>alphabet</italic>
. The number <italic>n</italic>
 is called the length of the string. We denote generic strings with Greek letters (possibly with subscripts) and reserve <italic>λ</italic>
 for the empty string (useful for expressing mathematical properties of strings). The length of a string <italic>α</italic>
 is denoted by |<italic>α</italic>
|, and <italic>α</italic>
[<italic>i</italic>
] is the symbol occurring in <italic>α</italic>
 at position <italic>i</italic>
, whereas <italic>α</italic>
[<italic>i</italic>
, <italic>j</italic>
] is the string occurring in <italic>α</italic>
 between the positions <italic>i</italic>
 and <italic>j</italic>
 (both included).</p>
<p>Let us consider the genomic alphabet of four symbols (characters, or letters, associated with nucleotides) {<italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
}. The set {<italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
}*, as usual, denotes the set of all possible strings over {<italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
}. A genome <inline-formula id="d33e1405"><inline-graphic id="d33e1406" xlink:href="srep28840-m98.jpg"></inline-graphic>
</inline-formula>
 is representable by a string of {<italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
}*, where symbols that occur, from the first to the last position, are written in the order that they occur, from left to right, according to the standard writing system of Western languages, and according to the chemical orientation 5′–3′ of DNA molecules.</p>
<p>Substrings <inline-formula id="d33e1422"><inline-graphic id="d33e1423" xlink:href="srep28840-m99.jpg"></inline-graphic>
</inline-formula>
 of length <italic>k</italic>
, where <inline-formula id="d33e1428"><inline-graphic id="d33e1429" xlink:href="srep28840-m100.jpg"></inline-graphic>
</inline-formula>
, are also called <italic>k-words</italic>
, <italic>k-factors</italic>
, <italic>k-mers</italic>
 of <inline-formula id="d33e1441"><inline-graphic id="d33e1442" xlink:href="srep28840-m101.jpg"></inline-graphic>
</inline-formula>
 (<italic>k</italic>
 may be omitted, when it is not relevant). We remark that the absolute value notation |−| used for string length has different meaning when applied to sets or multisets. In fact, for a finite set <italic>A</italic>
, then |<italic>A</italic>
| denotes its cardinality, whereas for a finite multiset <italic>X</italic>
 (set of elements that possibly occur in many “identical” copies, with no relevance for occurrence order) |<italic>X</italic>
| denotes its size (the sum of the elements of <italic>X</italic>
 each counted all the times that the element occurs).</p>
<p>A <italic>dictionary of</italic>
<inline-formula id="d33e1468"><inline-graphic id="d33e1469" xlink:href="srep28840-m102.jpg"></inline-graphic>
</inline-formula>
 is a set of strings occurring in <inline-formula id="d33e1471"><inline-graphic id="d33e1472" xlink:href="srep28840-m103.jpg"></inline-graphic>
</inline-formula>
. We denote by <inline-formula id="d33e1474"><inline-graphic id="d33e1475" xlink:href="srep28840-m104.jpg"></inline-graphic>
</inline-formula>
 the dictionary of all <italic>k</italic>
-mers occurring in <italic>G</italic>
. It is easy to verify that the number of occurrences of <italic>k</italic>
-mers in <inline-formula id="d33e1487"><inline-graphic id="d33e1488" xlink:href="srep28840-m105.jpg"></inline-graphic>
</inline-formula>
 is <inline-formula id="d33e1490"><inline-graphic id="d33e1491" xlink:href="srep28840-m106.jpg"></inline-graphic>
</inline-formula>
 (<inline-formula id="d33e1493"><inline-graphic id="d33e1494" xlink:href="srep28840-m107.jpg"></inline-graphic>
</inline-formula>
 is the length of <inline-formula id="d33e1496"><inline-graphic id="d33e1497" xlink:href="srep28840-m108.jpg"></inline-graphic>
</inline-formula>
) and corresponds to the maximum cardinality <inline-formula id="d33e1500"><inline-graphic id="d33e1501" xlink:href="srep28840-m109.jpg"></inline-graphic>
</inline-formula>
 reachable by a dictionary of <italic>k</italic>
-mers within genomes of the same length of <inline-formula id="d33e1506"><inline-graphic id="d33e1507" xlink:href="srep28840-m110.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>A word <italic>α</italic>
 of <italic>D</italic>
 can occur in <inline-formula id="d33e1517"><inline-graphic id="d33e1518" xlink:href="srep28840-m111.jpg"></inline-graphic>
</inline-formula>
 many times. We denote by <inline-formula id="d33e1520"><inline-graphic id="d33e1521" xlink:href="srep28840-m112.jpg"></inline-graphic>
</inline-formula>
 its <italic>multiplicity</italic>
 in <inline-formula id="d33e1527"><inline-graphic id="d33e1528" xlink:href="srep28840-m113.jpg"></inline-graphic>
</inline-formula>
, that is, the number of times <italic>α</italic>
 occurs in <inline-formula id="d33e1533"><inline-graphic id="d33e1534" xlink:href="srep28840-m114.jpg"></inline-graphic>
</inline-formula>
. A word of <inline-formula id="d33e1536"><inline-graphic id="d33e1537" xlink:href="srep28840-m115.jpg"></inline-graphic>
</inline-formula>
 with multiplicity greater than 1 is called a <italic>repeat</italic>
 of <inline-formula id="d33e1542"><inline-graphic id="d33e1543" xlink:href="srep28840-m116.jpg"></inline-graphic>
</inline-formula>
, whereas a word with multiplicity equal to 1 is called a <italic>hapax</italic>
 of <inline-formula id="d33e1549"><inline-graphic id="d33e1550" xlink:href="srep28840-m117.jpg"></inline-graphic>
</inline-formula>
. This term is used in philological investigation of texts, but it is also adopted in document indexing and compression<xref ref-type="bibr" rid="b37">37</xref>
. The values of word multiplicities can be normalized if we divide the word multiplicities by the sum of the multiplicities of all the words occurring in <inline-formula id="d33e1554"><inline-graphic id="d33e1555" xlink:href="srep28840-m118.jpg"></inline-graphic>
</inline-formula>
. This normalization corresponds to replacing multiplicities with frequencies, which can be seen as percentages of multiplicity.</p>
<p>Many important indexes related to characteristics of genome dictionaries can be defined on genomes. For example, <inline-formula id="d33e1559"><inline-graphic id="d33e1560" xlink:href="srep28840-m119.jpg"></inline-graphic>
</inline-formula>
 is the length of the longest repeats of <inline-formula id="d33e1562"><inline-graphic id="d33e1563" xlink:href="srep28840-m120.jpg"></inline-graphic>
</inline-formula>
. Of course, <inline-formula id="d33e1565"><inline-graphic id="d33e1566" xlink:href="srep28840-m121.jpg"></inline-graphic>
</inline-formula>
 is the minimum length, such that <italic>k</italic>
-mers with <italic>k</italic>
 greater than <inline-formula id="d33e1575"><inline-graphic id="d33e1576" xlink:href="srep28840-m122.jpg"></inline-graphic>
</inline-formula>
 are all hapaxes.</p>
<p>Shannon used the term <italic>information source</italic>
 as synonymous with discrete probability distribution to introduce the notion of (information) <italic>entropy</italic>
. Given a distribution of probability <italic>p</italic>
, over a finite set <italic>A</italic>
, its entropy is given by <inline-formula id="d33e1592"><inline-graphic id="d33e1593" xlink:href="srep28840-m123.jpg"></inline-graphic>
</inline-formula>
. We remark that if −lg<sub>2</sub>
(<italic>p</italic>
(<italic>x</italic>
)) is considered to be the information associated with the occurrence of <italic>x</italic>
∈<italic>A</italic>
 (the more improbable <italic>x</italic>
 is, the more its occurrence is informative), then entropy is the mean (in a probabilistic sense) quantity of information emitted by the information source (<italic>A</italic>
, <italic>p</italic>
).</p>
<p>An intrinsic property of entropy is its <italic>Equipartition Property</italic>
, that is, in the finite case, the fact that entropy reaches its maximum value lg<sub>2</sub>
(|<italic>A</italic>
|), when <italic>p</italic>
 is equally distributed, that is, when <italic>p</italic>
(<italic>x</italic>
) = 1/|<italic>A</italic>
|, for all <italic>x</italic>
 ∈ <italic>A</italic>
 (|<italic>A</italic>
| is the number of elements of <italic>A</italic>
).</p>
<p>A genome <inline-formula id="d33e1659"><inline-graphic id="d33e1660" xlink:href="srep28840-m124.jpg"></inline-graphic>
</inline-formula>
 is any sequence over the alphabet {<italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
}. This definition includes real genomes and ideal genomes, with no biological meaning, which are important in the mathematical analysis of genomes, as “material points” are essential in physics for discovering motion laws. Any subsequence of contiguous symbols of <inline-formula id="d33e1675"><inline-graphic id="d33e1676" xlink:href="srep28840-m125.jpg"></inline-graphic>
</inline-formula>
 is called a string, word, or <italic>k</italic>
-mer of <inline-formula id="d33e1681"><inline-graphic id="d33e1682" xlink:href="srep28840-m126.jpg"></inline-graphic>
</inline-formula>
 (<italic>k</italic>
 explicitly expresses the length).</p>
<p>The <italic>empirical k-entropy</italic>
<inline-formula id="d33e1692"><inline-graphic id="d33e1693" xlink:href="srep28840-m127.jpg"></inline-graphic>
</inline-formula>
 of <inline-formula id="d33e1695"><inline-graphic id="d33e1696" xlink:href="srep28840-m128.jpg"></inline-graphic>
</inline-formula>
 is given by (the adjective empirical refers to the use of frequencies):</p>
<p><disp-formula id="eq129"><inline-graphic id="d33e1701" xlink:href="srep28840-m129.jpg"></inline-graphic>
</disp-formula>
</p>
<p>We remark that the entropy <inline-formula id="d33e1704"><inline-graphic id="d33e1705" xlink:href="srep28840-m130.jpg"></inline-graphic>
</inline-formula>
 is computed only with the <italic>k</italic>
-mers occurring in <inline-formula id="d33e1710"><inline-graphic id="d33e1711" xlink:href="srep28840-m131.jpg"></inline-graphic>
</inline-formula>
 (see definition of <inline-formula id="d33e1713"><inline-graphic id="d33e1714" xlink:href="srep28840-m132.jpg"></inline-graphic>
</inline-formula>
). The computation of <inline-formula id="d33e1716"><inline-graphic id="d33e1717" xlink:href="srep28840-m133.jpg"></inline-graphic>
</inline-formula>
 becomes prohibitive when <inline-formula id="d33e1720"><inline-graphic id="d33e1721" xlink:href="srep28840-m134.jpg"></inline-graphic>
</inline-formula>
 has length of order 10<sup>9</sup>
 and <italic>k</italic>
 > 20. Therefore, we used suffix arrays<xref ref-type="bibr" rid="b47">47</xref>
 in the computation of genomic dictionaries.</p>
<p>A <italic>Bernoullian</italic>
, or random, genome is a synthetic genome generated by means of casual (blind) extractions (with insertion after extraction) from an urn containing four types of balls, in equal numbers of copies, completely identical apart from their colors, denoted by the genomic letters <italic>a</italic>
, <italic>c</italic>
, <italic>g</italic>
, <italic>t</italic>
. Pseudo-Bernoullian genomes can be generated by means of (pseudo) random generators available in programming languages (by suitable encoding of genomic symbols). We denote by <italic>RND</italic>
<sub><italic>n</italic>
</sub>
 the class of Bernoullian genomes of length <italic>n</italic>
.</p>
<p>The computations of the main informational indexes, given in <xref ref-type="table" rid="t1">Table 1</xref>
, extract the set of <inline-formula id="d33e1763"><inline-graphic id="d33e1764" xlink:href="srep28840-m135.jpg"></inline-graphic>
</inline-formula>
-mers occurring in the considered genomes, where <inline-formula id="d33e1766"><inline-graphic id="d33e1767" xlink:href="srep28840-m136.jpg"></inline-graphic>
</inline-formula>
 varies from 16 to 36, by means of a dedicated software, based on suffix arrays, called InfoGenomics Tools (shortly IGTools)<xref ref-type="bibr" rid="b31">31</xref>
, which is an efficient suite of interactive tools mainly designed for extracting <italic>k</italic>
-dictionaries, computing on them distributions and set-theoretic operations, and finally evaluating empirical entropies <italic>E</italic>
<sub><italic>k</italic>
</sub>
, and informational indexes, for different and even very large values of <italic>k</italic>
.</p>
<p>In <xref ref-type="supplementary-material" rid="S1">Supplementary Information</xref>
, a 3D-visualization (<xref ref-type="supplementary-material" rid="S1">Sup. Fig. 2</xref>
) of 70 genomes is given with respect to <italic>BB</italic>
, <italic>AC</italic>
, <italic>LX</italic>
 axes, where Principal Component Analysis is applied for a better visualization. A taxonomy tree of the 70 genomes has been built via the NCBI taxonomy<xref ref-type="bibr" rid="b48">48</xref>
 (see <xref ref-type="supplementary-material" rid="S1">Supplementary Information, Sup. Fig. 1</xref>
).</p>
</sec>
<sec disp-level="2"><title>Mathematical Backgrounds</title>
<p>In the following, some propositions are given, which were essential to the identification of parameters on which information entropies are computed. Let us start with the following question. Given a genome length <italic>n</italic>
 and a value <italic>k</italic>
 ≤ <italic>n</italic>
, which is the maximum value of <inline-formula id="d33e1821"><inline-graphic id="d33e1822" xlink:href="srep28840-m137.jpg"></inline-graphic>
</inline-formula>
 in the class of genomes of length <italic>n</italic>
? We answer to the question above with Proposition 3, which is based on two Lemmas.</p>
<p><bold>Lemma 1</bold>
<italic>Given a genome</italic>
<inline-formula id="d33e1834"><inline-graphic id="d33e1835" xlink:href="srep28840-m138.jpg"></inline-graphic>
</inline-formula>
<italic>of length n</italic>
, <italic>if</italic>
<inline-formula id="d33e1843"><inline-graphic id="d33e1844" xlink:href="srep28840-m139.jpg"></inline-graphic>
</inline-formula>
, <italic>then</italic>
<inline-formula id="d33e1850"><inline-graphic id="d33e1851" xlink:href="srep28840-m140.jpg"></inline-graphic>
</inline-formula>
<italic>is the maximum value that E</italic>
<sub><italic>k</italic>
</sub>
<italic>can reach in the class of all possible genomes of length n</italic>
.</p>
<p><bold>Proof.</bold>
 The minimum value of <italic>k</italic>
 such that all <italic>k</italic>
-mers are hapaxes of <inline-formula id="d33e1872"><inline-graphic id="d33e1873" xlink:href="srep28840-m141.jpg"></inline-graphic>
</inline-formula>
 is <inline-formula id="d33e1875"><inline-graphic id="d33e1876" xlink:href="srep28840-m142.jpg"></inline-graphic>
</inline-formula>
. Therefore, if <inline-formula id="d33e1878"><inline-graphic id="d33e1879" xlink:href="srep28840-m143.jpg"></inline-graphic>
</inline-formula>
, then <inline-formula id="d33e1882"><inline-graphic id="d33e1883" xlink:href="srep28840-m144.jpg"></inline-graphic>
</inline-formula>
 is maximum, according to the entropy Equipartition Property, because we have the maximum number of words occurring once in <inline-formula id="d33e1885"><inline-graphic id="d33e1886" xlink:href="srep28840-m145.jpg"></inline-graphic>
</inline-formula>
, and all these words have the same probability of occurring in <inline-formula id="d33e1888"><inline-graphic id="d33e1889" xlink:href="srep28840-m146.jpg"></inline-graphic>
</inline-formula>
. ☐</p>
<p><bold>Lemma 2</bold>
<italic>If R is a random genome of length n</italic>
, <italic>then</italic>
</p>
<p><disp-formula id="eq147"><inline-graphic id="d33e1902" xlink:href="srep28840-m147.jpg"></inline-graphic>
</disp-formula>
</p>
<p><bold>Proof.</bold>
 Let <italic>RND</italic>
<sub><italic>n</italic>
</sub>
 the class of random genomes of length <italic>n</italic>
. If <italic>k</italic>
 = <italic>mrl</italic>
(<italic>R</italic>
) + 1, the probability that a <italic>k</italic>
-mer occurs in <italic>R</italic>
∈<italic>RND</italic>
<sub><italic>n</italic>
</sub>
 is (<italic>n</italic>
 − <italic>k</italic>
 + 1)/4<sup><italic>k</italic>
</sup>
, and the probability that it occurs exactly once in <italic>R</italic>
 (being all <italic>k</italic>
-mer hapaxes) is 1/(<italic>n</italic>
 − <italic>k</italic>
 + 1). Therefore, by equating these two probabilities we get:</p>
<p><disp-formula id="eq148"><inline-graphic id="d33e1963" xlink:href="srep28840-m148.jpg"></inline-graphic>
</disp-formula>
</p>
<p>that is:</p>
<p><disp-formula id="eq149"><inline-graphic id="d33e1968" xlink:href="srep28840-m149.jpg"></inline-graphic>
</disp-formula>
</p>
<p>that implies (<italic>k</italic>
 has to be an integer) that the minimum length <italic>k</italic>
 for having all hapaxes in <italic>R</italic>
 is:</p>
<p><disp-formula id="eq150"><inline-graphic id="d33e1983" xlink:href="srep28840-m150.jpg"></inline-graphic>
</disp-formula>
</p>
<p>whence</p>
<p><disp-formula id="eq151"><inline-graphic id="d33e1988" xlink:href="srep28840-m151.jpg"></inline-graphic>
</disp-formula>
</p>
<p>that is</p>
<p><disp-formula id="eq152"><inline-graphic id="d33e1993" xlink:href="srep28840-m152.jpg"></inline-graphic>
</disp-formula>
</p>
<p>therefore</p>
<p><disp-formula id="eq153"><inline-graphic id="d33e1998" xlink:href="srep28840-m153.jpg"></inline-graphic>
</disp-formula>
</p>
<p>that implies the asserted inequality.☐</p>
<p><xref ref-type="table" rid="t2">Table 2</xref>
 shows an experimental validation of Lemma 2. It confirms that lg<sub>2</sub>
(|<italic>R</italic>
|) results to be a good estimation of the average of <italic>mrl</italic>
(<italic>R</italic>
) + 1 in <inline-formula id="d33e2017"><inline-graphic id="d33e2018" xlink:href="srep28840-m154.jpg"></inline-graphic>
</inline-formula>
.</p>
<p><bold>Proposition 3</bold>
<italic>In the class of genomes of length n</italic>
, <italic>for every k</italic>
 < <italic>n</italic>
, <italic>the following relation holds</italic>
</p>
<p><disp-formula id="eq155"><inline-graphic id="d33e2037" xlink:href="srep28840-m155.jpg"></inline-graphic>
</disp-formula>
</p>
<p><italic>Moreover, random genomes of length n have entropies differing from the upper bound</italic>
 lg<sub>2</sub>
(<italic>n</italic>
) <italic>less than</italic>
<inline-formula id="d33e2052"><inline-graphic id="d33e2053" xlink:href="srep28840-m156.jpg"></inline-graphic>
</inline-formula>
 (<italic>close to zero</italic>
).</p>
<p><bold>Proof.</bold>
 According to Lemma 1, <inline-formula id="d33e2062"><inline-graphic id="d33e2063" xlink:href="srep28840-m157.jpg"></inline-graphic>
</inline-formula>
 reaches its maximum, when <inline-formula id="d33e2065"><inline-graphic id="d33e2066" xlink:href="srep28840-m158.jpg"></inline-graphic>
</inline-formula>
. In this case:</p>
<p><disp-formula id="eq159"><inline-graphic id="d33e2070" xlink:href="srep28840-m159.jpg"></inline-graphic>
</disp-formula>
</p>
<p>therefore, the difference <inline-formula id="d33e2073"><inline-graphic id="d33e2074" xlink:href="srep28840-m160.jpg"></inline-graphic>
</inline-formula>
 is given by:</p>
<p><disp-formula id="eq161"><inline-graphic id="d33e2078" xlink:href="srep28840-m161.jpg"></inline-graphic>
</disp-formula>
</p>
<p>If <inline-formula id="d33e2081"><inline-graphic id="d33e2082" xlink:href="srep28840-m162.jpg"></inline-graphic>
</inline-formula>
 belongs to the class of random genomes of length <italic>n</italic>
, according to Lemmas 1 and 2, the maximum entropy is given by <inline-formula id="d33e2087"><inline-graphic id="d33e2088" xlink:href="srep28840-m163.jpg"></inline-graphic>
</inline-formula>
, for <inline-formula id="d33e2090"><inline-graphic id="d33e2091" xlink:href="srep28840-m164.jpg"></inline-graphic>
</inline-formula>
, with <inline-formula id="d33e2093"><inline-graphic id="d33e2094" xlink:href="srep28840-m165.jpg"></inline-graphic>
</inline-formula>
. Therefore, by substituting in equation (22) the upper bound of <italic>k</italic>
, giving the upper bound of lg<sub>2</sub>
(<italic>n</italic>
/(<italic>n</italic>
 − <italic>k</italic>
 + 1)), we get: <inline-formula id="d33e2112"><inline-graphic id="d33e2113" xlink:href="srep28840-m166.jpg"></inline-graphic>
</inline-formula>
.☐</p>
</sec>
</sec>
<sec disp-level="1"><title>Additional Information</title>
<p><bold>How to cite this article</bold>
: Bonnici, V. and Manca, V. Informational laws of genome structures. <italic>Sci. Rep</italic>
. <bold>6</bold>
, 28840; doi: 10.1038/srep28840 (2016).</p>
</sec>
<sec sec-type="supplementary-material" id="S1"><title>Supplementary Material</title>
<supplementary-material id="d33e220" content-type="local-data"><caption><title>Supplementary Information</title>
</caption>
<media xlink:href="srep28840-s1.pdf"></media>
</supplementary-material>
</sec>
</body>
<back><ack><p>The authors are grateful to Andres Moya for his support and help, and to Rosalba Giugno for her important suggestions.</p>
</ack>
<ref-list><ref id="b1"><mixed-citation publication-type="journal"><name><surname>Conrad</surname>
<given-names>M.</given-names>
</name>
<source>Adaptability</source>
 (PlenumPress, <year>2001)</year>
.</mixed-citation>
</ref>
<ref id="b2"><mixed-citation publication-type="journal"><name><surname>Conrad</surname>
<given-names>M.</given-names>
</name>
<article-title>The price of programmability</article-title>
. In <source>A half-century survey on The Universal Turing Machine</source>
, <fpage>285</fpage>
–<lpage>307</lpage>
 (Oxford University Press, <year>1988</year>
).</mixed-citation>
</ref>
<ref id="b3"><mixed-citation publication-type="journal"><name><surname>Holland</surname>
<given-names>J.</given-names>
</name>
 & <name><surname>Mallot</surname>
<given-names>H.</given-names>
</name>
<article-title>Emergence: from chaos to order</article-title>
. <source>Nature</source>
<volume>395</volume>
, <fpage>342</fpage>
–<lpage>342</lpage>
 (<year>1998</year>
).</mixed-citation>
</ref>
<ref id="b4"><mixed-citation publication-type="journal"><name><surname>Cercignani</surname>
<given-names>C.</given-names>
</name>
<source>The Boltzmann Equation and Its Application</source>
 (Springer, <year>1988)</year>
.</mixed-citation>
</ref>
<ref id="b5"><mixed-citation publication-type="journal"><name><surname>Shannon</surname>
<given-names>C. E.</given-names>
</name>
<article-title>A mathematical theory of communication</article-title>
. <source>Bell Sys Tech J</source>
<volume>27</volume>
, <fpage>623</fpage>
–<lpage>656</lpage>
 (<year>1948</year>
).</mixed-citation>
</ref>
<ref id="b6"><mixed-citation publication-type="journal"><name><surname>Pincus</surname>
<given-names>S. M.</given-names>
</name>
<article-title>Approximate entropy as a measure of system complexity</article-title>
. <source>P Nat Acad Sci</source>
<volume>88</volume>
, <fpage>2297</fpage>
–<lpage>2301</lpage>
 (<year>1991</year>
).</mixed-citation>
</ref>
<ref id="b7"><mixed-citation publication-type="journal"><name><surname>Crochemore</surname>
<given-names>M.</given-names>
</name>
 & <name><surname>Vérin</surname>
<given-names>R.</given-names>
</name>
<article-title>Zones of low entropy in genomic sequences</article-title>
. <source>Computers & chemistry</source>
<volume>23</volume>
, <fpage>275</fpage>
–<lpage>282</lpage>
 (<year>1999</year>
).<pub-id pub-id-type="pmid">10404620</pub-id>
</mixed-citation>
</ref>
<ref id="b8"><mixed-citation publication-type="journal"><name><surname>Vinga</surname>
<given-names>S.</given-names>
</name>
 & <name><surname>Almeida</surname>
<given-names>J. S.</given-names>
</name>
<article-title>Local Renyi entropic profiles of DNA sequences</article-title>
. <source>BMC bioinformatics</source>
<volume>8</volume>
, <fpage>393</fpage>
 (<year>2007</year>
).<pub-id pub-id-type="pmid">17939871</pub-id>
</mixed-citation>
</ref>
<ref id="b9"><mixed-citation publication-type="journal"><name><surname>Koslicki</surname>
<given-names>D.</given-names>
</name>
<article-title>Topological entropy of dna sequences</article-title>
. <source>Bioinformatics</source>
<volume>27</volume>
, <fpage>1061</fpage>
–<lpage>1067</lpage>
 (<year>2011</year>
).<pub-id pub-id-type="pmid">21317142</pub-id>
</mixed-citation>
</ref>
<ref id="b10"><mixed-citation publication-type="journal"><name><surname>Wang</surname>
<given-names>D.</given-names>
</name>
, <name><surname>Xu</surname>
<given-names>J.</given-names>
</name>
 & <name><surname>Yu</surname>
<given-names>J.</given-names>
</name>
<article-title>KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation</article-title>
. <source>Biol direct</source>
<volume>10</volume>
(1), <fpage>1</fpage>
–<lpage>5</lpage>
 (<year>2015</year>
).<pub-id pub-id-type="pmid">25564011</pub-id>
</mixed-citation>
</ref>
<ref id="b11"><mixed-citation publication-type="journal"><name><surname>Head</surname>
<given-names>T.</given-names>
</name>
<article-title>Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors</article-title>
. <source>B Math Biol</source>
<volume>49</volume>
, <fpage>737</fpage>
–<lpage>759</lpage>
 (<year>1987</year>
).</mixed-citation>
</ref>
<ref id="b12"><mixed-citation publication-type="journal"><name><surname>Deonier</surname>
<given-names>R. C.</given-names>
</name>
, <name><surname>Tavaré</surname>
<given-names>S.</given-names>
</name>
 & <name><surname>Waterman</surname>
<given-names>M.</given-names>
</name>
<source>Computational genome analysis: an introduction</source>
 (Springer, <year>2005)</year>
.</mixed-citation>
</ref>
<ref id="b13"><mixed-citation publication-type="journal"><name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
 & <name><surname>Franco</surname>
<given-names>G.</given-names>
</name>
<article-title>Computing by polymerase chain reaction</article-title>
. <source>Math Biosci</source>
<volume>211</volume>
, <fpage>282</fpage>
–<lpage>298</lpage>
 (<year>2008</year>
).<pub-id pub-id-type="pmid">17931667</pub-id>
</mixed-citation>
</ref>
<ref id="b14"><mixed-citation publication-type="journal"><name><surname>Searls</surname>
<given-names>D. B.</given-names>
</name>
<article-title>Molecules, languages and automata</article-title>
. In <source>Grammatical Inference: Theoretical Results and Applications</source>
, <fpage>5</fpage>
–<lpage>10</lpage>
 (Springer, <year>2010)</year>
.</mixed-citation>
</ref>
<ref id="b15"><mixed-citation publication-type="journal"><name><surname>Vinga</surname>
<given-names>S.</given-names>
</name>
<article-title>Information theory applications for biological sequence analysis</article-title>
. <source>Brief Bioinform</source>
, doi: <pub-id pub-id-type="doi">10.1093/bib/bbt068</pub-id>
 (<year>2013</year>
).</mixed-citation>
</ref>
<ref id="b16"><mixed-citation publication-type="journal"><name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<source>Infobiotics: information in biotic systems</source>
 (Springer, <year>2013)</year>
.</mixed-citation>
</ref>
<ref id="b17"><mixed-citation publication-type="journal"><name><surname>Gatlin</surname>
<given-names>L. L.</given-names>
</name>
<article-title>The information content of DNA</article-title>
. <source>J Theor Biol</source>
<volume>10</volume>
(2), <fpage>281</fpage>
–<lpage>300</lpage>
 (<year>1966</year>
).<pub-id pub-id-type="pmid">5964394</pub-id>
</mixed-citation>
</ref>
<ref id="b18"><mixed-citation publication-type="journal"><name><surname>Kraskov</surname>
<given-names>A.</given-names>
</name>
 & <name><surname>Grassberger</surname>
<given-names>P.</given-names>
</name>
<article-title>MIC: mutual information based hierarchical clustering</article-title>
. <source>Info Theor Stat Learn</source>
, <fpage>101</fpage>
–<lpage>123</lpage>
 (Springer, <year>2009)</year>
.</mixed-citation>
</ref>
<ref id="b19"><mixed-citation publication-type="journal"><name><surname>Campbell</surname>
<given-names>A.</given-names>
</name>
, <name><surname>Mrázek</surname>
<given-names>J.</given-names>
</name>
 & <name><surname>Karlin</surname>
<given-names>S.</given-names>
</name>
<article-title>Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA</article-title>
. <source>P Nat Acad Sci</source>
<volume>96</volume>
(16), <fpage>9184</fpage>
–<lpage>9189</lpage>
 (<year>1999</year>
).</mixed-citation>
</ref>
<ref id="b20"><mixed-citation publication-type="journal"><name><surname>Ebeling</surname>
<given-names>W.</given-names>
</name>
 & <name><surname>Jiménez-Montaño</surname>
<given-names>M. A.</given-names>
</name>
<article-title>On grammars, complexity, and information measures of biological macromolecules</article-title>
. <source>Math Biosci</source>
<volume>52</volume>
(1), <fpage>53</fpage>
–<lpage>71</lpage>
 (<year>1980</year>
).</mixed-citation>
</ref>
<ref id="b21"><mixed-citation publication-type="journal"><name><surname>Weiss</surname>
<given-names>O.</given-names>
</name>
, <name><surname>Jiménez-Montaño</surname>
<given-names>M. A.</given-names>
</name>
 & <name><surname>Herzel</surname>
<given-names>H.</given-names>
</name>
<article-title>Information content of protein sequences</article-title>
. <source>J Theor Biol</source>
<volume>206</volume>
(3), <fpage>379</fpage>
–<lpage>386</lpage>
 (<year>2000</year>
).<pub-id pub-id-type="pmid">10988023</pub-id>
</mixed-citation>
</ref>
<ref id="b22"><mixed-citation publication-type="journal"><name><surname>Holste</surname>
<given-names>D.</given-names>
</name>
, <name><surname>Grosse</surname>
<given-names>I.</given-names>
</name>
 & <name><surname>Herzel</surname>
<given-names>H.</given-names>
</name>
<article-title>Statistical analysis of the DNA sequence of human chromosome 22</article-title>
. <source>Phys Rev E</source>
<volume>64</volume>
(4), <fpage>041917</fpage>
 (<year>2001</year>
).</mixed-citation>
</ref>
<ref id="b23"><mixed-citation publication-type="journal"><name><surname>Fofanov</surname>
<given-names>Y.</given-names>
</name>
<etal></etal>
. <article-title>How independent are the appearances of n-mers in different genomes?</article-title>
<source>Bioinformatics</source>
<volume>20</volume>
, <fpage>2421</fpage>
–<lpage>2428</lpage>
 (<year>2004</year>
).<pub-id pub-id-type="pmid">15087315</pub-id>
</mixed-citation>
</ref>
<ref id="b24"><mixed-citation publication-type="journal"><name><surname>Kurtz</surname>
<given-names>S.</given-names>
</name>
, <name><surname>Narechania</surname>
<given-names>A.</given-names>
</name>
, <name><surname>Stein</surname>
<given-names>J. C.</given-names>
</name>
 & <name><surname>Ware</surname>
<given-names>D.</given-names>
</name>
<article-title>A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
. <source>BMC genomics</source>
<volume>9</volume>
(1), <fpage>517</fpage>
 (<year>2008</year>
).<pub-id pub-id-type="pmid">18976482</pub-id>
</mixed-citation>
</ref>
<ref id="b25"><mixed-citation publication-type="journal"><name><surname>Chor</surname>
<given-names>B.</given-names>
</name>
<etal></etal>
. <article-title>Genomic dna k-mer spectra: models and modalities</article-title>
. <source>Genome Biol</source>
<volume>10</volume>
, <fpage>R108</fpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19814784</pub-id>
</mixed-citation>
</ref>
<ref id="b26"><mixed-citation publication-type="journal"><name><surname>Castellini</surname>
<given-names>A.</given-names>
</name>
, <name><surname>Franco</surname>
<given-names>G.</given-names>
</name>
 & <name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<article-title>A dictionary based informational genome analysis</article-title>
. <source>BMC genomics</source>
<volume>13</volume>
, <fpage>485</fpage>
 (<year>2012</year>
).<pub-id pub-id-type="pmid">22985068</pub-id>
</mixed-citation>
</ref>
<ref id="b27"><mixed-citation publication-type="journal"><name><surname>Bonnici</surname>
<given-names>V.</given-names>
</name>
 & <name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<article-title>Recurrence distance distributions in computational genomics</article-title>
. <source>Am J Bioinformat Comput Biol</source>
<volume>3</volume>
, <fpage>5</fpage>
–<lpage>23</lpage>
 (<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b28"><mixed-citation publication-type="journal"><name><surname>Wen</surname>
<given-names>J.</given-names>
</name>
, <name><surname>Chan</surname>
<given-names>R. H.</given-names>
</name>
, <name><surname>Yau</surname>
<given-names>S.-C.</given-names>
</name>
, <name><surname>He</surname>
<given-names>R. L.</given-names>
</name>
 & <name><surname>Yau</surname>
<given-names>S. S.</given-names>
</name>
<article-title>k-mer natural vector and its application to the phylogenetic analysis of genetic sequences</article-title>
. <source>Gene</source>
<volume>546</volume>
, <fpage>25</fpage>
–<lpage>34</lpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">24858075</pub-id>
</mixed-citation>
</ref>
<ref id="b29"><mixed-citation publication-type="journal"><name><surname>Almirantis</surname>
<given-names>Y.</given-names>
</name>
, <name><surname>Arndt</surname>
<given-names>P.</given-names>
</name>
, <name><surname>Li</surname>
<given-names>W.</given-names>
</name>
 & <name><surname>Provata</surname>
<given-names>A.</given-names>
</name>
<article-title>Editorial: Complexity in genomes</article-title>
. <source>Comp Biol Chem</source>
<volume>53</volume>
, <fpage>1</fpage>
–<lpage>4</lpage>
 (<year>2014</year>
).</mixed-citation>
</ref>
<ref id="b30"><mixed-citation publication-type="journal"><name><surname>Hashim</surname>
<given-names>E. K. M.</given-names>
</name>
 & <name><surname>Abdullah</surname>
<given-names>R.</given-names>
</name>
<article-title>Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter</article-title>
. <source>J Theor Biol</source>
<volume>387</volume>
, <fpage>88</fpage>
–<lpage>100</lpage>
 (<year>2015</year>
).<pub-id pub-id-type="pmid">26427337</pub-id>
</mixed-citation>
</ref>
<ref id="b31"><mixed-citation publication-type="journal"><name><surname>Bonnici</surname>
<given-names>V.</given-names>
</name>
 & <name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<article-title>Infogenomics tools: A computational suite for informational analysis of genomes</article-title>
. <source>J Bioinfo Proteomics Rev</source>
<volume>1</volume>
, <fpage>8</fpage>
–<lpage>14</lpage>
 (<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b32"><mixed-citation publication-type="journal"><name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<source>Infogenomics: genomes as information sources.</source>
 Chap. 21, <fpage>317</fpage>
–<lpage>324</lpage>
 (Elsevier, Morgan Kauffman, <year>2016)</year>
.</mixed-citation>
</ref>
<ref id="b33"><mixed-citation publication-type="journal"><name><surname>Manca</surname>
<given-names>V.</given-names>
</name>
<article-title>Information theory in genome analysis</article-title>
. In <source>Membrane Computing</source>
, LNCS 9504, <fpage>3</fpage>
–<lpage>18</lpage>
 (Springer, <year>2015)</year>
.</mixed-citation>
</ref>
<ref id="b34"><mixed-citation publication-type="journal"><name><surname>Knuth</surname>
<given-names>D.</given-names>
</name>
<source>The art of computer programming</source>
, volume <volume>2</volume>
: Seminumerical algorithms (Addison-Wesley, <year>1998)</year>
.</mixed-citation>
</ref>
<ref id="b35"><mixed-citation publication-type="journal"><name><surname>Kong</surname>
<given-names>S. G.</given-names>
</name>
<etal></etal>
. <article-title>Quantitative measure of randomness and order for complete genomes</article-title>
. <source>Phys Rev E</source>
<volume>79</volume>
(6), <fpage>061911</fpage>
 (<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b36"><mixed-citation publication-type="journal"><name><surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
 & <name><surname>Xu</surname>
<given-names>C.</given-names>
</name>
<article-title>The calculation of information and organismal complexity</article-title>
. <source>Biol Direct</source>
<volume>5</volume>
(59), <fpage>565</fpage>
 (<year>2010</year>
).</mixed-citation>
</ref>
<ref id="b37"><mixed-citation publication-type="journal"><name><surname>Witten</surname>
<given-names>I. H.</given-names>
</name>
, <name><surname>Moffat</surname>
<given-names>A.</given-names>
</name>
 & <name><surname>Bell</surname>
<given-names>T. C.</given-names>
</name>
<source>Managing gigabytes: compressing and indexingdocuments and images</source>
 (Morgan Kaufmann, <year>1999)</year>
.</mixed-citation>
</ref>
<ref id="b38"><mixed-citation publication-type="journal"><name><surname>Wiener</surname>
<given-names>N.</given-names>
</name>
<source>Cybernetics or control and communication in the animal and the machine</source>
 (Hermann, Paris, <year>1948)</year>
.</mixed-citation>
</ref>
<ref id="b39"><mixed-citation publication-type="journal"><name><surname>Schrödinger</surname>
<given-names>E.</given-names>
</name>
<source>What Is Life? The Physical Aspect of the Living Cell and Mind</source>
 (Cambridge University Press, <year>1944)</year>
.</mixed-citation>
</ref>
<ref id="b40"><mixed-citation publication-type="journal"><name><surname>Brillouin</surname>
<given-names>L.</given-names>
</name>
<article-title>The negentropy principle of information</article-title>
. <source>J Appl Phys</source>
<volume>24</volume>
, <fpage>1152</fpage>
–<lpage>1163</lpage>
 (<year>1953</year>
).</mixed-citation>
</ref>
<ref id="b41"><mixed-citation publication-type="journal"><name><surname>Volkenstein</surname>
<given-names>M. V.</given-names>
</name>
<source>Entropy and information</source>
 (Springer, <year>2009)</year>
.</mixed-citation>
</ref>
<ref id="b42"><mixed-citation publication-type="journal"><name><surname>Venter</surname>
<given-names>J. C.</given-names>
</name>
<etal></etal>
. <article-title>Design and synthesis of a minimal bacterial genome</article-title>
. <source>Science</source>
<volume>351</volume>
, <fpage>6280</fpage>
 (<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b43"><mixed-citation publication-type="journal"><name><surname>Lynch</surname>
<given-names>M.</given-names>
</name>
 & <name><surname>Conery</surname>
<given-names>J. S.</given-names>
</name>
<article-title>The origins of genome complexity</article-title>
. <source>Science</source>
<volume>302</volume>
, <fpage>1401</fpage>
–<lpage>1404</lpage>
 (<year>2003</year>
).<pub-id pub-id-type="pmid">14631042</pub-id>
</mixed-citation>
</ref>
<ref id="b44"><mixed-citation publication-type="journal"><name><surname>Kullback</surname>
<given-names>S.</given-names>
</name>
 & <name><surname>Leibler</surname>
<given-names>R. A.</given-names>
</name>
<article-title>On information and sufficiency</article-title>
. <source>Ann Math Stat</source>
, <fpage>79</fpage>
–<lpage>86</lpage>
 (<year>1951</year>
).</mixed-citation>
</ref>
<ref id="b45"><mixed-citation publication-type="journal"><name><surname>Feller</surname>
<given-names>W.</given-names>
</name>
<source>An Introduction to Probability Theory and Its Applications</source>
 (Wiley & sons, <year>1968)</year>
.</mixed-citation>
</ref>
<ref id="b46"><mixed-citation publication-type="journal"><name><surname>Rozenberg</surname>
<given-names>G.</given-names>
</name>
 & <name><surname>Salomaa</surname>
<given-names>A.</given-names>
</name>
<source>Handbook of Formal Languages: Beyonds words</source>
 vol. <volume>3</volume>
 (Springer, <year>1997)</year>
.</mixed-citation>
</ref>
<ref id="b47"><mixed-citation publication-type="journal"><name><surname>Abouelhoda</surname>
<given-names>M. I.</given-names>
</name>
, <name><surname>Kurtz</surname>
<given-names>S.</given-names>
</name>
 & <name><surname>Ohlebusch</surname>
<given-names>E.</given-names>
</name>
<article-title>Replacing suffix trees with enhanced suffix arrays</article-title>
. <source>J Discrete Algorithms</source>
<volume>2</volume>
, <fpage>53</fpage>
–<lpage>86</lpage>
 (<year>2004</year>
).</mixed-citation>
</ref>
<ref id="b48"><mixed-citation publication-type="journal"><name><surname>Federhen</surname>
<given-names>S.</given-names>
</name>
<article-title>The NCBI taxonomy database</article-title>
. <source>Nucleic acids res</source>
<volume>40</volume>
, <fpage>D136</fpage>
–<lpage>D143</lpage>
 (<year>2012</year>
).<pub-id pub-id-type="pmid">22139910</pub-id>
</mixed-citation>
</ref>
</ref-list>
<fn-group><fn><p><bold>Author Contributions</bold>
 V.M. conceived the theoretical and mathematical setting of the paper, V.B. developed the software and computations, and V.M. and V.B. analyzed the results. V.M. wrote the paper, and both authors reviewed the manuscript.</p>
</fn>
</fn-group>
</back>
<floats-group><fig id="f1"><label>Figure 1</label>
<caption><title>The left side of the figure shows the 70 analyzed genomes plotted on a Cartesian plane with their logarithmic length <inline-formula id="d33e2138"><inline-graphic id="d33e2139" xlink:href="srep28840-m167.jpg"></inline-graphic>
</inline-formula>
 as the abscissa and their biobit value <inline-formula id="d33e2141"><inline-graphic id="d33e2142" xlink:href="srep28840-m168.jpg"></inline-graphic>
</inline-formula>
 as the ordinate.</title>
<p>An enlargement of the top-right region, which is highlighted with a dashed line, is shown on the right side of the image.</p>
</caption>
<graphic xlink:href="srep28840-f1"></graphic>
</fig>
<fig id="f2"><label>Figure 2</label>
<caption><title>A flowchart of the computational steps involved in calculating <inline-formula id="d33e2150"><inline-graphic id="d33e2151" xlink:href="srep28840-m169.jpg"></inline-graphic>
</inline-formula>
.</title>
<p>Given an input genome <inline-formula id="d33e2156"><inline-graphic id="d33e2157" xlink:href="srep28840-m170.jpg"></inline-graphic>
</inline-formula>
, an upper bound of maximum entropy is calculated, its value equals <inline-formula id="d33e2159"><inline-graphic id="d33e2160" xlink:href="srep28840-m171.jpg"></inline-graphic>
</inline-formula>
, and the value also defines the appropriate word length. Then the entropic and anti-entropic components are computed as, respectively, <inline-formula id="d33e2162"><inline-graphic id="d33e2163" xlink:href="srep28840-m172.jpg"></inline-graphic>
</inline-formula>
 and <inline-formula id="d33e2165"><inline-graphic id="d33e2166" xlink:href="srep28840-m173.jpg"></inline-graphic>
</inline-formula>
 and are successively normalized and combined by a weighted product into <inline-formula id="d33e2168"><inline-graphic id="d33e2169" xlink:href="srep28840-m174.jpg"></inline-graphic>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="srep28840-f2"></graphic>
</fig>
<fig id="f3"><label>Figure 3</label>
<caption><title>A chart of the main informational indexes.</title>
<p>Some measures have been rescaled, by applying a factor of ten (×10) or one hundred (×100) to their value, to obtain a comprehensive overview. Species are arranged on the horizontal axis according to their genome length (increasing from left to right).</p>
</caption>
<graphic xlink:href="srep28840-f3"></graphic>
</fig>
<table-wrap position="float" id="t1"><label>Table 1</label>
<caption><title>Main informational genomic indexes.</title>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="left"></col>
</colgroup>
<tbody valign="top"><tr><td align="left" valign="top" charoff="50"><inline-formula id="d33e2188"><inline-graphic id="d33e2189" xlink:href="srep28840-m175.jpg"></inline-graphic>
</inline-formula>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">Logarithmic length</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><inline-formula id="d33e2196"><inline-graphic id="d33e2197" xlink:href="srep28840-m176.jpg"></inline-graphic>
</inline-formula>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">Entropic component</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><inline-formula id="d33e2204"><inline-graphic id="d33e2205" xlink:href="srep28840-m177.jpg"></inline-graphic>
</inline-formula>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">anti-entropic component</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><inline-formula id="d33e2212"><inline-graphic id="d33e2213" xlink:href="srep28840-m178.jpg"></inline-graphic>
</inline-formula>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">Lexical index</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><italic>AF</italic>
 = <italic>AC</italic>
/<italic>LG</italic>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">anti-entropic fraction</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><italic>EH</italic>
 = (<italic>EC</italic>
 − <italic>AC</italic>
)/<italic>LG</italic>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">Horizontal eccentricity</td>
</tr>
<tr><td align="left" valign="top" charoff="50"><inline-formula id="d33e2251"><inline-graphic id="d33e2252" xlink:href="srep28840-m179.jpg"></inline-graphic>
</inline-formula>
</td>
<td align="center" valign="top" charoff="50">=</td>
<td align="left" valign="top" charoff="50">Biobit</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="t1-fn1"><p>|<italic>D</italic>
<sub>2<italic>LG</italic>
</sub>
| is the number of 2<italic>LG</italic>
-mers occurring in <inline-formula id="d33e2272"><inline-graphic id="d33e2273" xlink:href="srep28840-m180.jpg"></inline-graphic>
</inline-formula>
, and <inline-formula id="d33e2275"><inline-graphic id="d33e2276" xlink:href="srep28840-m181.jpg"></inline-graphic>
</inline-formula>
 is the length of <inline-formula id="d33e2278"><inline-graphic id="d33e2279" xlink:href="srep28840-m182.jpg"></inline-graphic>
</inline-formula>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="t2"><label>Table 2</label>
<caption><title>For each genome length, 100 trials were performed.</title>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center" char="."></col>
<col align="center" char="."></col>
<col align="center" char="."></col>
</colgroup>
<thead valign="bottom"><tr><th align="left" valign="top" charoff="50"><italic>length</italic>
</th>
<th align="center" valign="top" charoff="50"><italic>min</italic>
</th>
<th align="center" valign="top" charoff="50"><italic>max</italic>
</th>
<th align="center" valign="top" char="." charoff="50"><italic>sd</italic>
</th>
<th align="center" valign="top" char="." charoff="50"><italic>avg</italic>
</th>
<th align="center" valign="top" char="." charoff="50"><italic>lg</italic>
<sub>2</sub>
(|<italic>R</italic>
|)</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="top" charoff="50">1,000</td>
<td align="center" valign="top" charoff="50">9</td>
<td align="center" valign="top" charoff="50">15</td>
<td align="center" valign="top" char="." charoff="50">1.07</td>
<td align="center" valign="top" char="." charoff="50">10.2</td>
<td align="center" valign="top" char="." charoff="50">9.97</td>
</tr>
<tr><td align="left" valign="top" charoff="50">100,000</td>
<td align="center" valign="top" charoff="50">15</td>
<td align="center" valign="top" charoff="50">20</td>
<td align="center" valign="top" char="." charoff="50">0.95</td>
<td align="center" valign="top" char="." charoff="50">16.67</td>
<td align="center" valign="top" char="." charoff="50">16.61</td>
</tr>
<tr><td align="left" valign="top" charoff="50">200,000</td>
<td align="center" valign="top" charoff="50">16</td>
<td align="center" valign="top" charoff="50">21</td>
<td align="center" valign="top" char="." charoff="50">0.86</td>
<td align="center" valign="top" char="." charoff="50">17.78</td>
<td align="center" valign="top" char="." charoff="50">17.61</td>
</tr>
<tr><td align="left" valign="top" charoff="50">500,000</td>
<td align="center" valign="top" charoff="50">18</td>
<td align="center" valign="top" charoff="50">23</td>
<td align="center" valign="top" char="." charoff="50">0.91</td>
<td align="center" valign="top" char="." charoff="50">19.09</td>
<td align="center" valign="top" char="." charoff="50">18.93</td>
</tr>
<tr><td align="left" valign="top" charoff="50">1,000,000</td>
<td align="center" valign="top" charoff="50">18</td>
<td align="center" valign="top" charoff="50">24</td>
<td align="center" valign="top" char="." charoff="50">0.96</td>
<td align="center" valign="top" char="." charoff="50">20.14</td>
<td align="center" valign="top" char="." charoff="50">19.93</td>
</tr>
<tr><td align="left" valign="top" charoff="50">10,000,000</td>
<td align="center" valign="top" charoff="50">22</td>
<td align="center" valign="top" charoff="50">26</td>
<td align="center" valign="top" char="." charoff="50">0.97</td>
<td align="center" valign="top" char="." charoff="50">23.49</td>
<td align="center" valign="top" char="." charoff="50">23.25</td>
</tr>
<tr><td align="left" valign="top" charoff="50">20,000,000</td>
<td align="center" valign="top" charoff="50">23</td>
<td align="center" valign="top" charoff="50">27</td>
<td align="center" valign="top" char="." charoff="50">0.93</td>
<td align="center" valign="top" char="." charoff="50">24.31</td>
<td align="center" valign="top" char="." charoff="50">24.25</td>
</tr>
<tr><td align="left" valign="top" charoff="50">30,000,000</td>
<td align="center" valign="top" charoff="50">24</td>
<td align="center" valign="top" charoff="50">30</td>
<td align="center" valign="top" char="." charoff="50">1.14</td>
<td align="center" valign="top" char="." charoff="50">25.08</td>
<td align="center" valign="top" char="." charoff="50">24.84</td>
</tr>
<tr><td align="left" valign="top" charoff="50">50,000,000</td>
<td align="center" valign="top" charoff="50">24</td>
<td align="center" valign="top" charoff="50">31</td>
<td align="center" valign="top" char="." charoff="50">1.17</td>
<td align="center" valign="top" char="." charoff="50">25.86</td>
<td align="center" valign="top" char="." charoff="50">25.58</td>
</tr>
<tr><td align="left" valign="top" charoff="50">75,000,000</td>
<td align="center" valign="top" charoff="50">25</td>
<td align="center" valign="top" charoff="50">29</td>
<td align="center" valign="top" char="." charoff="50">0.85</td>
<td align="center" valign="top" char="." charoff="50">26.44</td>
<td align="center" valign="top" char="." charoff="50">26.16</td>
</tr>
<tr><td align="left" valign="top" charoff="50">100,000,000</td>
<td align="center" valign="top" charoff="50">25</td>
<td align="center" valign="top" charoff="50">30</td>
<td align="center" valign="top" char="." charoff="50">1.02</td>
<td align="center" valign="top" char="." charoff="50">26.89</td>
<td align="center" valign="top" char="." charoff="50">26.58</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="t2-fn1"><p>The minimum, the maximum and the average, together with the standard deviation, of <italic>mrl</italic>
 + 1 was computed for each trial set. With a good approximation lg<sub>2</sub>
(|<italic>R</italic>
|) ≈ <italic>avg</italic>
(<italic>mrl</italic>
(<italic>R</italic>
) + 1).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000145 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000145 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4937431
   |texte=   Informational laws of genome structures
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27354155" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Informational laws of genome structures

Informational laws of genome structures

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki