Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000C92 ( Pmc/Corpus ); précédent : 000C919; suivant : 000C930 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">
<italic>K</italic>
-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features</title>
<author>
<name sortKey="Sievers, Aaron" sort="Sievers, Aaron" uniqKey="Sievers A" first="Aaron" last="Sievers">Aaron Sievers</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bosiek, Katharina" sort="Bosiek, Katharina" uniqKey="Bosiek K" first="Katharina" last="Bosiek">Katharina Bosiek</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bisch, Marc" sort="Bisch, Marc" uniqKey="Bisch M" first="Marc" last="Bisch">Marc Bisch</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dreessen, Chris" sort="Dreessen, Chris" uniqKey="Dreessen C" first="Chris" last="Dreessen">Chris Dreessen</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Riedel, Jascha" sort="Riedel, Jascha" uniqKey="Riedel J" first="Jascha" last="Riedel">Jascha Riedel</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fro, Patrick" sort="Fro, Patrick" uniqKey="Fro P" first="Patrick" last="Fro">Patrick Fro</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hausmann, Michael" sort="Hausmann, Michael" uniqKey="Hausmann M" first="Michael" last="Hausmann">Michael Hausmann</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hildenbrand, Georg" sort="Hildenbrand, Georg" uniqKey="Hildenbrand G" first="Georg" last="Hildenbrand">Georg Hildenbrand</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="af2-genes-08-00122">Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28422050</idno>
<idno type="pmc">5406869</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406869</idno>
<idno type="RBID">PMC:5406869</idno>
<idno type="doi">10.3390/genes8040122</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000C92</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000C92</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">
<italic>K</italic>
-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features</title>
<author>
<name sortKey="Sievers, Aaron" sort="Sievers, Aaron" uniqKey="Sievers A" first="Aaron" last="Sievers">Aaron Sievers</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bosiek, Katharina" sort="Bosiek, Katharina" uniqKey="Bosiek K" first="Katharina" last="Bosiek">Katharina Bosiek</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bisch, Marc" sort="Bisch, Marc" uniqKey="Bisch M" first="Marc" last="Bisch">Marc Bisch</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dreessen, Chris" sort="Dreessen, Chris" uniqKey="Dreessen C" first="Chris" last="Dreessen">Chris Dreessen</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Riedel, Jascha" sort="Riedel, Jascha" uniqKey="Riedel J" first="Jascha" last="Riedel">Jascha Riedel</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fro, Patrick" sort="Fro, Patrick" uniqKey="Fro P" first="Patrick" last="Fro">Patrick Fro</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hausmann, Michael" sort="Hausmann, Michael" uniqKey="Hausmann M" first="Michael" last="Hausmann">Michael Hausmann</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hildenbrand, Georg" sort="Hildenbrand, Georg" uniqKey="Hildenbrand G" first="Georg" last="Hildenbrand">Georg Hildenbrand</name>
<affiliation>
<nlm:aff id="af1-genes-08-00122">Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="af2-genes-08-00122">Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Genes</title>
<idno type="eISSN">2073-4425</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>In genome analysis,
<italic>k-mer</italic>
-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve
<italic>k</italic>
-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local
<italic>k</italic>
-mer spectra (frequency distribution of
<italic>k</italic>
-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤
<italic>k</italic>
≤ 4) on relatively small viral genomes of
<italic>Papillomaviridae</italic>
and
<italic>Herpesviridae</italic>
, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in
<italic>Papillomaviridae</italic>
and
<italic>Herpesviridae</italic>
formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the
<italic>k</italic>
-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown
<italic>k</italic>
-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest
<italic>k</italic>
-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard
<italic>k</italic>
-mer analysis.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S F" uniqKey="Altschul S">S.F. Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W. Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W. Miller</name>
</author>
<author>
<name sortKey="Myers, E W" uniqKey="Myers E">E.W. Myers</name>
</author>
<author>
<name sortKey="Lipman, D J" uniqKey="Lipman D">D.J. Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C.X. Chan</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M.A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alsop, E B" uniqKey="Alsop E">E.B. Alsop</name>
</author>
<author>
<name sortKey="Raymond, J" uniqKey="Raymond J">J. Raymond</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brendel, V" uniqKey="Brendel V">V. Brendel</name>
</author>
<author>
<name sortKey="Beckmann, J S" uniqKey="Beckmann J">J.S. Beckmann</name>
</author>
<author>
<name sortKey="Trifonov, E N" uniqKey="Trifonov E">E.N. Trifonov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, F" uniqKey="Zhou F">F. Zhou</name>
</author>
<author>
<name sortKey="Olman, V" uniqKey="Olman V">V. Olman</name>
</author>
<author>
<name sortKey="Xu, Y" uniqKey="Xu Y">Y. Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bultrini, E" uniqKey="Bultrini E">E. Bultrini</name>
</author>
<author>
<name sortKey="Pizzi, E" uniqKey="Pizzi E">E. Pizzi</name>
</author>
<author>
<name sortKey="Del Giudice, P" uniqKey="Del Giudice P">P. Del Giudice</name>
</author>
<author>
<name sortKey="Frontali, C" uniqKey="Frontali C">C. Frontali</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pizzi, E" uniqKey="Pizzi E">E. Pizzi</name>
</author>
<author>
<name sortKey="Frontali, C" uniqKey="Frontali C">C. Frontali</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hacker, J" uniqKey="Hacker J">J. Hacker</name>
</author>
<author>
<name sortKey="Kaper, J B" uniqKey="Kaper J">J.B. Kaper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Navarre, W W" uniqKey="Navarre W">W.W. Navarre</name>
</author>
<author>
<name sortKey="Porwollik, S" uniqKey="Porwollik S">S. Porwollik</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
<author>
<name sortKey="Mcclelland, M" uniqKey="Mcclelland M">M. McClelland</name>
</author>
<author>
<name sortKey="Rosen, H" uniqKey="Rosen H">H. Rosen</name>
</author>
<author>
<name sortKey="Libby, S J" uniqKey="Libby S">S.J. Libby</name>
</author>
<author>
<name sortKey="Fang, F C" uniqKey="Fang F">F.C. Fang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pizzi, E" uniqKey="Pizzi E">E. Pizzi</name>
</author>
<author>
<name sortKey="Frontali, C" uniqKey="Frontali C">C. Frontali</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pozzoli, U" uniqKey="Pozzoli U">U. Pozzoli</name>
</author>
<author>
<name sortKey="Menozzi, G" uniqKey="Menozzi G">G. Menozzi</name>
</author>
<author>
<name sortKey="Fumagalli, M" uniqKey="Fumagalli M">M. Fumagalli</name>
</author>
<author>
<name sortKey="Cereda, M" uniqKey="Cereda M">M. Cereda</name>
</author>
<author>
<name sortKey="Comi, G P" uniqKey="Comi G">G.P. Comi</name>
</author>
<author>
<name sortKey="Cagliani, R" uniqKey="Cagliani R">R. Cagliani</name>
</author>
<author>
<name sortKey="Bresolin, N" uniqKey="Bresolin N">N. Bresolin</name>
</author>
<author>
<name sortKey="Sironi, M" uniqKey="Sironi M">M. Sironi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chae, H" uniqKey="Chae H">H. Chae</name>
</author>
<author>
<name sortKey="Jinwoo, P" uniqKey="Jinwoo P">P. Jinwoo</name>
</author>
<author>
<name sortKey="Seong Whan, L" uniqKey="Seong Whan L">L. Seong-Whan</name>
</author>
<author>
<name sortKey="Kenneth, P N" uniqKey="Kenneth P">P.N. Kenneth</name>
</author>
<author>
<name sortKey="Sun, K" uniqKey="Sun K">K. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, D A" uniqKey="Benson D">D.A. Benson</name>
</author>
<author>
<name sortKey="Ilene, K M" uniqKey="Ilene K">K.M. Ilene</name>
</author>
<author>
<name sortKey="Lipman, D J" uniqKey="Lipman D">D.J. Lipman</name>
</author>
<author>
<name sortKey="Ostell, J" uniqKey="Ostell J">J. Ostell</name>
</author>
<author>
<name sortKey="Wheeler, D L" uniqKey="Wheeler D">D.L. Wheeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, K" uniqKey="Pearson K">K. Pearson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C. Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S. Karlin</name>
</author>
<author>
<name sortKey="Mrazek, J" uniqKey="Mrazek J">J. Mrázek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hunter, J D" uniqKey="Hunter J">J.D. Hunter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Acland, A" uniqKey="Acland A">A. Acland</name>
</author>
<author>
<name sortKey="Agarwala, R" uniqKey="Agarwala R">R. Agarwala</name>
</author>
<author>
<name sortKey="Barrett, T" uniqKey="Barrett T">T. Barrett</name>
</author>
<author>
<name sortKey="Beck, J" uniqKey="Beck J">J. Beck</name>
</author>
<author>
<name sortKey="Benson, D A" uniqKey="Benson D">D.A. Benson</name>
</author>
<author>
<name sortKey="Bollin, C" uniqKey="Bollin C">C. Bollin</name>
</author>
<author>
<name sortKey="Bolton, E" uniqKey="Bolton E">E. Bolton</name>
</author>
<author>
<name sortKey="Bryant, S H" uniqKey="Bryant S">S.H. Bryant</name>
</author>
<author>
<name sortKey="Canese, K" uniqKey="Canese K">K. Canese</name>
</author>
<author>
<name sortKey="Church, D M" uniqKey="Church D">D.M. Church</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zheng, Z M" uniqKey="Zheng Z">Z.M. Zheng</name>
</author>
<author>
<name sortKey="Baker, C C" uniqKey="Baker C">C.C. Baker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davison, A J" uniqKey="Davison A">A.J. Davison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elson, D" uniqKey="Elson D">D. Elson</name>
</author>
<author>
<name sortKey="Chargaff, E" uniqKey="Chargaff E">E. Chargaff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dominguez, G" uniqKey="Dominguez G">G. Dominguez</name>
</author>
<author>
<name sortKey="Dambaugh, T R" uniqKey="Dambaugh T">T.R. Dambaugh</name>
</author>
<author>
<name sortKey="Stamey, F R" uniqKey="Stamey F">F.R. Stamey</name>
</author>
<author>
<name sortKey="Dewhurst, S N" uniqKey="Dewhurst S">S.N. Dewhurst</name>
</author>
<author>
<name sortKey="Inoue, S" uniqKey="Inoue S">S. Inoue</name>
</author>
<author>
<name sortKey="Pellett, P E" uniqKey="Pellett P">P.E. Pellett</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dolan, A" uniqKey="Dolan A">A. Dolan</name>
</author>
<author>
<name sortKey="Addison, C" uniqKey="Addison C">C. Addison</name>
</author>
<author>
<name sortKey="Gatherer, D" uniqKey="Gatherer D">D. Gatherer</name>
</author>
<author>
<name sortKey="Davison, A J" uniqKey="Davison A">A.J. Davison</name>
</author>
<author>
<name sortKey="Mcgeoch, D J" uniqKey="Mcgeoch D">D.J. McGeoch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Megaw, A G" uniqKey="Megaw A">A.G. Megaw</name>
</author>
<author>
<name sortKey="Rapaport, D" uniqKey="Rapaport D">D. Rapaport</name>
</author>
<author>
<name sortKey="Avidor, B" uniqKey="Avidor B">B. Avidor</name>
</author>
<author>
<name sortKey="Frenkel, N" uniqKey="Frenkel N">N. Frenkel</name>
</author>
<author>
<name sortKey="Davison, A J" uniqKey="Davison A">A.J. Davison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yunis, J J" uniqKey="Yunis J">J.J. Yunis</name>
</author>
<author>
<name sortKey="Sawyer, J R" uniqKey="Sawyer J">J.R. Sawyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pratas, D" uniqKey="Pratas D">D. Pratas</name>
</author>
<author>
<name sortKey="Silva, R M" uniqKey="Silva R">R.M. Silva</name>
</author>
<author>
<name sortKey="Pinho, A J" uniqKey="Pinho A">A.J. Pinho</name>
</author>
<author>
<name sortKey="Ferreira, P J S G" uniqKey="Ferreira P">P.J.S.G. Ferreira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Winzeler, E A" uniqKey="Winzeler E">E.A. Winzeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoelzer, K" uniqKey="Hoelzer K">K. Hoelzer</name>
</author>
<author>
<name sortKey="Shackelton, L A" uniqKey="Shackelton L">L.A. Shackelton</name>
</author>
<author>
<name sortKey="Parrish, C R" uniqKey="Parrish C">C.R. Parrish</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clay, O" uniqKey="Clay O">O. Clay</name>
</author>
<author>
<name sortKey="Caccio, S" uniqKey="Caccio S">S. Caccio</name>
</author>
<author>
<name sortKey="Zoubak, S" uniqKey="Zoubak S">S. Zoubak</name>
</author>
<author>
<name sortKey="Mouchiroud, D" uniqKey="Mouchiroud D">D. Mouchiroud</name>
</author>
<author>
<name sortKey="Bernardi, G" uniqKey="Bernardi G">G. Bernardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Duret, L" uniqKey="Duret L">L. Duret</name>
</author>
<author>
<name sortKey="Mouchiroud, D" uniqKey="Mouchiroud D">D. Mouchiroud</name>
</author>
<author>
<name sortKey="Gautier, C" uniqKey="Gautier C">C. Gautier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fullerton, S M" uniqKey="Fullerton S">S.M. Fullerton</name>
</author>
<author>
<name sortKey="Carvalho, A B" uniqKey="Carvalho A">A.B. Carvalho</name>
</author>
<author>
<name sortKey="Clark, A G" uniqKey="Clark A">A.G. Clark</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Genes (Basel)</journal-id>
<journal-id journal-id-type="iso-abbrev">Genes (Basel)</journal-id>
<journal-id journal-id-type="publisher-id">genes</journal-id>
<journal-title-group>
<journal-title>Genes</journal-title>
</journal-title-group>
<issn pub-type="epub">2073-4425</issn>
<publisher>
<publisher-name>MDPI</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28422050</article-id>
<article-id pub-id-type="pmc">5406869</article-id>
<article-id pub-id-type="doi">10.3390/genes8040122</article-id>
<article-id pub-id-type="publisher-id">genes-08-00122</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>
<italic>K</italic>
-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Sievers</surname>
<given-names>Aaron</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bosiek</surname>
<given-names>Katharina</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bisch</surname>
<given-names>Marc</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Dreessen</surname>
<given-names>Chris</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Riedel</surname>
<given-names>Jascha</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Froß</surname>
<given-names>Patrick</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hausmann</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hildenbrand</surname>
<given-names>Georg</given-names>
</name>
<xref ref-type="aff" rid="af1-genes-08-00122">1</xref>
<xref ref-type="aff" rid="af2-genes-08-00122">2</xref>
<xref rid="c1-genes-08-00122" ref-type="corresp">*</xref>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Corominas</surname>
<given-names>Montserrat</given-names>
</name>
<role>Academic Editor</role>
</contrib>
</contrib-group>
<aff id="af1-genes-08-00122">
<label>1</label>
Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany;
<email>Sievers_Aaron@web.de</email>
(A.S.);
<email>KatharinaBosiek@gmx.de</email>
(K.B.);
<email>MarcBisch@gmx.de</email>
(M.B.);
<email>chrisdreessen@yahoo.de</email>
(C.D.);
<email>jaschelite@googlemail.com</email>
(J.R.);
<email>Fross@stud.uni-heidelberg.de</email>
(P.F.);
<email>hausmann@kip.uni-heidelberg.de</email>
(M.H.)</aff>
<aff id="af2-genes-08-00122">
<label>2</label>
Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany</aff>
<author-notes>
<corresp id="c1-genes-08-00122">
<label>*</label>
Correspondence:
<email>hilden@kip.uni-heidelberg.de</email>
; Tel.: +49-151-559-63919</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>4</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<month>4</month>
<year>2017</year>
</pub-date>
<volume>8</volume>
<issue>4</issue>
<elocation-id>122</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>2</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>4</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© 2017 by the authors.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="open-access">
<license-p>Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
).</license-p>
</license>
</permissions>
<abstract>
<p>In genome analysis,
<italic>k-mer</italic>
-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve
<italic>k</italic>
-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local
<italic>k</italic>
-mer spectra (frequency distribution of
<italic>k</italic>
-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤
<italic>k</italic>
≤ 4) on relatively small viral genomes of
<italic>Papillomaviridae</italic>
and
<italic>Herpesviridae</italic>
, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in
<italic>Papillomaviridae</italic>
and
<italic>Herpesviridae</italic>
formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the
<italic>k</italic>
-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown
<italic>k</italic>
-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest
<italic>k</italic>
-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard
<italic>k</italic>
-mer analysis.</p>
</abstract>
<kwd-group>
<kwd>
<italic>k</italic>
-mer</kwd>
<kwd>
<italic>k</italic>
-mer analysis</kwd>
<kwd>sequence analysis</kwd>
<kwd>alignment-free</kwd>
<kwd>positional features</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="sec1-genes-08-00122">
<title>1. Introduction</title>
<p>In recent years,
<italic>k</italic>
-mer-based analysis and comparison methods have become standard tools for the analysis of large DNA sequences such as chromosomes, whole genomes, or even metagenomes. The big advantage of
<italic>k</italic>
-mer-based methods compared to alignment-based methods such as the well-established National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST) software family [
<xref rid="B1-genes-08-00122" ref-type="bibr">1</xref>
], is the shorter computation times or more precisely, the better scaling of computation times with sequence length [
<xref rid="B2-genes-08-00122" ref-type="bibr">2</xref>
].</p>
<p>For many purposes, standard
<italic>k</italic>
-mer methods deliver reliable results. However, there are also cases where they are still very unsatisfying when compared with those of other methods, for example, during the determination of the phylogenetic distance between two genomes, where the alignment of short DNA motifs such as highly conserved ribosomal RNA genes delivers very reliable results, while the results of
<italic>k</italic>
-mer methods are often uncertain [
<xref rid="B3-genes-08-00122" ref-type="bibr">3</xref>
]. Perhaps the main difference between the results of an alignment-based and
<italic>k</italic>
-mer-based method when used on the same data set (e.g., comparison of two genome sequences), is that the results of the alignment method can include the exact position (in bp) and quality of similarity of every part of the sequence within the data set. In contrast, the standard
<italic>k</italic>
-mer-method only interprets the sequences as “bags of words”, therefore neglecting any positional information [
<xref rid="B4-genes-08-00122" ref-type="bibr">4</xref>
]. The result of a standard
<italic>k</italic>
-mer analysis is mostly only comprised of a number representing the similarity between each pair of sequences.</p>
<p>To improve the results of
<italic>k</italic>
-mer methods, it seems reasonable to add positional resolution to the analysis, while maintaining the advantage of a faster computation time. Algorithms performing such a local
<italic>k</italic>
-mer analysis have been developed and published [
<xref rid="B5-genes-08-00122" ref-type="bibr">5</xref>
,
<xref rid="B6-genes-08-00122" ref-type="bibr">6</xref>
]. A mapping of local
<italic>k</italic>
-mer spectra on chromosomes (called “genomic barcode”) was used to successfully correlate uncommon (with respect to the rest of the sequence) local
<italic>k</italic>
-mer structures with regions underlying horizontal gene transfer [
<xref rid="B5-genes-08-00122" ref-type="bibr">5</xref>
]. In another study, the coding and non-coding parts of the genome were compared between and inside of different organisms [
<xref rid="B6-genes-08-00122" ref-type="bibr">6</xref>
]. A correlation of local sequence features and
<italic>k</italic>
-mer-related data (namely local A/T ratio) was accomplished for parts of smaller eukaryote genomes [
<xref rid="B7-genes-08-00122" ref-type="bibr">7</xref>
]. Beside these positive prior results, it is a long-known fact that certain DNA motifs of major scientific interest, like pathogenicity islands [
<xref rid="B8-genes-08-00122" ref-type="bibr">8</xref>
], target regions for gene silencing [
<xref rid="B9-genes-08-00122" ref-type="bibr">9</xref>
], low complexity regions [
<xref rid="B7-genes-08-00122" ref-type="bibr">7</xref>
], non-globular domains [
<xref rid="B10-genes-08-00122" ref-type="bibr">10</xref>
], transposons, or simply genes in general [
<xref rid="B11-genes-08-00122" ref-type="bibr">11</xref>
], are associated with peculiar local G+C content or monomer contents, respectively. Knowing this, we ask two questions in this paper. First: Is it possible to use local
<italic>k</italic>
-mer analysis results to effectively detect local DNA features formerly only visible through synteny- or alignment-based methods, or even those which are completely unknown? Second, since global
<italic>k</italic>
-mer features are known to be evolutionary conserved [
<xref rid="B12-genes-08-00122" ref-type="bibr">12</xref>
]: Is it possible to correlate the presence, position, or characteristics of such features with evolutionary constraints, in order to show whether they are evolutionary conserved?</p>
</sec>
<sec id="sec2-genes-08-00122">
<title>2. Materials and Methods</title>
<p>For our analysis, we used publicly available unmasked nucleotide sequences from the NCBI website in FASTA format (
<xref ref-type="table" rid="genes-08-00122-t001">Table 1</xref>
) [
<xref rid="B13-genes-08-00122" ref-type="bibr">13</xref>
]. In some cases, data in the GenBank format were also used, if a comparison with gene positions was required. Unmasked versions were used because one of the main goals was to limit prior computation and the use of prior knowledge. This allows also the identification of DNA features like low complexity regions, which are known to show peculiar
<italic>k</italic>
-mer patterns in some organisms [
<xref rid="B7-genes-08-00122" ref-type="bibr">7</xref>
] and which would most likely be affected by the masking of, e.g., repetitions.</p>
<sec id="sec2dot1-genes-08-00122">
<title>2.1. K-
<italic>mer</italic>
Analysis</title>
<p>In this article, a
<italic>k</italic>
-mer analysis of a DNA sequence is considered as the extraction and counting of every DNA word with length
<italic>k</italic>
(
<italic>k</italic>
bases along one strand), using a “sliding window” approach [
<xref rid="B4-genes-08-00122" ref-type="bibr">4</xref>
] to eliminate the influence of an arbitrary chosen starting point. Therefore, we extract one word for every position within the analyzed sequence. We have chosen word lengths of 1 ≤
<italic>k</italic>
≤ 4 due to the fact that many monomer (
<italic>k</italic>
= 1) features were well described, and because
<italic>k</italic>
= 4 seems to be a reasonable limit to produce low computation times while preventing a bias resulting from the influence of a potentially underlying amino acid code (for
<italic>k</italic>
= 3). We generated results for a range of
<italic>k</italic>
(1 ≤
<italic>k</italic>
≤ 4), instead of just using, e.g.,
<italic>k</italic>
= 4, because we wanted to analyze whether a detected feature is only present for a certain value of k or for a range of different values of k. This could provide insights into the word lengths responsible for generating these features, which could lead to different interpretations.</p>
<p>The result of such a
<italic>k</italic>
-mer analysis of a single sequence, meaning the frequencies/contents of each DNA word of length k, is called the associated “
<italic>k</italic>
-mer spectrum”. Such a
<italic>k</italic>
-mer spectrum can also be interpreted as a 4
<italic>k</italic>
dimensional vector, and as such, vector distances can be applied to compare two or more
<italic>k</italic>
-mer spectra, in order to calculate a value which can be interpreted as a measurement of similarity between associated DNA sequences. For a better normalization, the Pearson correlation function [
<xref rid="B14-genes-08-00122" ref-type="bibr">14</xref>
] was used over the Euclidean distance. Using the Spearman correlation function to prevent issues due to extreme values/contents did not result in any significant advantages. The Pearson correlation function is a vector difference mapped to values between −1 and +1, where +1 stands for a perfect correlation and −1 is a perfect anticorrelation. Accordingly, values near one were considered as a “good/high” correlation and values near zero as a “bad/low” correlation.</p>
<p>For the allocation of the individual
<italic>k</italic>
-mers, the bp position numbers (of the first base of each word) in the respective genomes were used. To prevent confusion when DNA words of different lengths are discussed, we will always use a slash if two (or more) separated sequences are meant. For example, “A/C” means the two different lengths of the
<italic>k</italic>
= 1 words with one “A“ and “C”. “AC” means the single
<italic>k</italic>
= 2 word, consisting of one “A”, followed by one “C”.</p>
</sec>
<sec id="sec2dot2-genes-08-00122">
<title>2.2. Local k-
<italic>mer</italic>
Analysis</title>
<p>To obtain a local
<italic>k</italic>
-mer spectrum from a large DNA sequence, a two-step approach was applied. First, we wrote an efficient program in C/C++ to save the positions of every single
<italic>k</italic>
-mer of a given length
<italic>k</italic>
using a hash map (using the DNA words as keys for the mapping). Our program saves the positions in a binary file and simultaneously creates an index structure for fast access at a later point in time. In a second step, a simple binning algorithm discretized our data by position. In the end, the results are equivalent to cutting the whole sequence into segments of equal length (corresponding to the bin width) and performing a
<italic>k</italic>
-mer analysis as described above for each of the resulting segments/cuts. We use this two-step approach instead of a more direct method because it allows us to generate results for different segment sizes (respectively bin widths) more effectively, by saving intermediate results to a hard disc.</p>
<p>The computation times of other
<italic>k</italic>
-mer tools, such as JellyFish [
<xref rid="B15-genes-08-00122" ref-type="bibr">15</xref>
], are not comparable to our tool, because we need to use a more complex algorithm to perform a local
<italic>k</italic>
-mer analysis (especially concerning writing operations to store the position information). However, the analysis of large chromosomes (e.g.,
<italic>Homo Sapiens</italic>
c2) is possible in less than 20 min on ordinary desktop machines.</p>
</sec>
<sec id="sec2dot3-genes-08-00122">
<title>2.3. Relative Spectra</title>
<p>We created artificial spectra with a DNA word length of
<italic>k</italic>
+ 1, based on the extracted spectra with word length
<italic>k</italic>
using Zero-Order Markov models for
<italic>k</italic>
= 1 and higher order Markov chain models for
<italic>k</italic>
> 1.</p>
<p>By dividing the frequencies of the extracted spectra by the corresponding values of the artificially created spectra with the same word length
<italic>k</italic>
, one obtains relative spectra where the influence of the
<italic>k</italic>
= 1 frequencies, e.g., the influence of the G/C-content for
<italic>k</italic>
= 2, is removed. A similar method was developed by [
<xref rid="B16-genes-08-00122" ref-type="bibr">16</xref>
]. In the following, such spectra will be called “relative spectra” and their DNA word contents will be known as “relative contents”. If the spectra or DNA words without the prefix “relative” are mentioned, we always refer to the directly extracted spectra/words. A relative analysis separates the visible features obtained by a word length
<italic>k</italic>
from features originating from a word length
<italic>k</italic>
= 1. For example, the relative
<italic>k</italic>
= 3 spectrum shows additional information to the
<italic>k</italic>
= 2 spectrum, because it was corrected for any trivial correlations.</p>
</sec>
<sec id="sec2dot4-genes-08-00122">
<title>2.4. Mapping</title>
<p>The results of a local
<italic>k</italic>
-mer analysis were visualized by mapping comparable to the method of [
<xref rid="B5-genes-08-00122" ref-type="bibr">5</xref>
]. A linear representation of the sequence from 5′ at the top, to 3′ at the bottom, was set up (even if some analyzed genome sequences are circular molecules in reality), with a column for each DNA word (sorted alphabetically). A linear mapping correlating color and
<italic>k</italic>
-mer frequency were used. White corresponds to the minimum value and a clear/pure color to the maximum value found within the bins of the local
<italic>k</italic>
-mer analysis (see
<xref ref-type="sec" rid="sec2dot2-genes-08-00122">Section 2.2</xref>
). Using this approach, we ensure to always exploit the maximum color range, and thus, the maximum color resolution available. To produce quantitative results concerning such a feature, we used the visualization to identify and localize particular features. This means that we always extracted and analyzed the exact local
<italic>k</italic>
-mer spectrum of a specified region, even if we do not always explicitly show the associated data.</p>
</sec>
<sec id="sec2dot5-genes-08-00122">
<title>2.5. Correlation Heatmaps and Mean Correlations</title>
<p>The results of a local
<italic>k</italic>
-mer analysis of two or more sequences (e.g., genomes or parts of genomes) can be seen as a list of
<italic>k</italic>
-mer spectra, ordered by their position inside of the sequence. The Pearson correlation function, as mentioned above, was used to calculate a correlation value between each single (local)
<italic>k</italic>
-mer spectrum of one set with each
<italic>k</italic>
-mer spectrum of the other set (or number of sets). The result is a two-dimensional matrix-like structure, where each row is associated with a region in one sequence (more precisely, with a bin associated with a certain region) and each column is associated with a position in the other sequence (or sequences). Similar structures were analyzed by [
<xref rid="B6-genes-08-00122" ref-type="bibr">6</xref>
]. Such a matrix can be displayed as a heatmap, by mapping the correlation value to a color scale. An area inside of such a heatmap is then equivalent to the correlation of local
<italic>k</italic>
-mer structures.</p>
<p>The mean values of such areas or complete heatmaps were taken to quantify the level of correlation between specific regions, while the standard deviation was used for an error estimation. Two mean values with overlapping error ranges, however, were not assumed to be significantly different in a strict statistical sense. For example, comparing a mean value of 0.75 ± 0.5 with a mean value of 0.99 ± 0.01 would not be significantly different in the given error ranges. However, referring to the relative accuracy of the given data indicates a huge difference in the spread of the analyzed data sets.</p>
<p>For creating the image files (not only the maps and heatmaps), we used simple python scripts written by the authors, using the well-known matplotlib library for Python 2.7 [
<xref rid="B17-genes-08-00122" ref-type="bibr">17</xref>
].</p>
</sec>
<sec id="sec2dot6-genes-08-00122">
<title>2.6. Heatmap Summary Images</title>
<p>Since showing many such heatmaps is not suitable, and since heatmaps are often hard to interpret and compare, we use a visualization method producing “heatmap summary images”. To generate these images, we took the mean values and standard deviations of the correlation values in the heatmaps (see
<xref ref-type="sec" rid="sec2dot5-genes-08-00122">Section 2.5</xref>
), or of interesting areas within the heatmaps. We generated an image with these mean values as data points and the standard error of this mean value (derived from the standard deviation) as error bars. Therefore, we created an easy to compare, two-variable summary for many heatmaps within one image.</p>
<p>We should mention that the correlation values do not necessarily follow a normal distribution, and therefore, such a representation is not guaranteed to be a good summary of heatmap data, but since it objectively reproduces our observations made within the heatmaps, it seems to be a reliable and objective method.</p>
</sec>
<sec id="sec2dot7-genes-08-00122">
<title>2.7. Software</title>
<p>All of the codes and scrips (including visualization) used for this article, as well as an English manual, are freely available online at
<uri xlink:href="http://www.kip.uni-heidelberg.de/biophysik/software">http://www.kip.uni-heidelberg.de/biophysik/software</uri>
.</p>
</sec>
</sec>
<sec id="sec3-genes-08-00122">
<title>3. Results</title>
<sec id="sec3dot1-genes-08-00122">
<title>3.1. Viral Genomes</title>
<p>As a proof of principle to our approach, several viral genomes were chosen. Viral genomes are generally very short and therefore very fast to analyze using our
<italic>k</italic>
-mer algorithms. They contain only a small absolute number of genes and other DNA motifs, while nearly their complete genomic sequences hold known and well-understood biological functions. This allows us to verify the results of a local
<italic>k</italic>
-mer analysis, as well as to gain new insights into the currently unknown functional correlations of characteristics of
<italic>k</italic>
-mer spectra.</p>
<sec id="sec3dot1dot1-genes-08-00122">
<title>3.1.1.
<italic>Papillomaviridae</italic>
</title>
<p>Human Papillomavirus (HPV) is a double stranded nonenveloped circular DNA virus. We analyzed and compared genomic sequences of 11 different HPV types covering a wide phylogenetic range [
<xref rid="B18-genes-08-00122" ref-type="bibr">18</xref>
], also including some very close relatives.</p>
<p>All HPV genomes analyzed show at least three regions with distinguishable
<italic>k</italic>
-mer structures (
<xref ref-type="fig" rid="genes-08-00122-f001">Figure 1</xref>
). For an analysis of these regions, the content of C and T were chosen as criteria. The first region is then located at the top ~40% of each genome showing a relatively low C content (~20% lower than HPV average), associated with the genes labeled E1, E6, and E7 for each HPV genome shown (
<xref ref-type="fig" rid="genes-08-00122-f002">Figure 2</xref>
). This is followed by a relatively small region (~10%–20% in size) at the center of each genome showing a very high content of C (~20% over HPV average) and low content of T (~18% lower than HPV average) associated with the genes E2, E4, and E5 for every HPV type shown. Lastly, a large region (~40% in size) with a slight increase in the C content, but otherwise without clear monomer preferences located at the bottom, is associated with the late genes (L1, L2). For higher word lengths (1 ≤
<italic>k</italic>
≤ 4), the stated regions are also visible and show similar word contents for all HPV types (
<xref ref-type="app" rid="app1-genes-08-00122">Figures S1 and S2</xref>
).</p>
<p>The local mean correlation values between HPV types support the impression of distinctive
<italic>k</italic>
-mer structures inside each of the described regions (e.g.,
<xref ref-type="fig" rid="genes-08-00122-f003">Figure 3</xref>
shows a correlation heatmap between HPV 4 and HPV 5). According to the
<italic>k</italic>
-mer content in the map of
<xref ref-type="fig" rid="genes-08-00122-f001">Figure 1</xref>
and heatmap values of
<xref ref-type="fig" rid="genes-08-00122-f003">Figure 3</xref>
and
<xref ref-type="fig" rid="genes-08-00122-f004">Figure 4</xref>
, the relation between the top (E1, E6, E7) and bottom region (L1, L2) is dominated by good correlation values, while the central region (E2, E4, E5) is well recognizable by a bigger region of low correlation with any of the other two. These observations are supported and displayed for all of the HPV types in
<xref ref-type="fig" rid="genes-08-00122-f005">Figure 5</xref>
. The high correlation values (0.6–0.9) in
<xref ref-type="fig" rid="genes-08-00122-f005">Figure 5</xref>
A,D confirm that the high similarity within the first and third identified region is present for, and also between, all HPV types analyzed, respectively. The presence of the low correlation between the first and the second region is confirmed for and between all HPV types by the relatively low values (−0.5–0.1) displayed in
<xref ref-type="fig" rid="genes-08-00122-f005">Figure 5</xref>
B. The correlation within the second region (see
<xref ref-type="fig" rid="genes-08-00122-f005">Figure 5</xref>
C) is not as high as within the first or third region, but is still significantly higher than between the first and second region for most HPV types.</p>
<p>This means that the gene-related regions described in [
<xref rid="B19-genes-08-00122" ref-type="bibr">19</xref>
] are visible through
<italic>k</italic>
-mer features. The relations between their
<italic>k</italic>
-mer structures differ from the relations between the genes when seen under the aspect of early and late genes. It is also remarkable that the central C rich region is the only region containing overlaps of a significant size between genes (in a range from 28% up to 100% for E2, E4, and E5, whereas all other genes are below 10%, and most of them are below a 5% overlap). This could mean that the identified
<italic>k</italic>
-mer features are not directly associated with the time of gene activity, but with the gene density or gene overlap.</p>
<p>The longer we set the word length
<italic>k</italic>
, the more extreme the correlation values become. This means, for example, for
<italic>k</italic>
= 4, only very similar regions show a good correlation, while the mean correlation value becomes almost zero (visible in
<xref ref-type="fig" rid="genes-08-00122-f003">Figure 3</xref>
and
<xref ref-type="fig" rid="genes-08-00122-f004">Figure 4</xref>
). Therefore, we found a feature for
<italic>k</italic>
= 4 that was not visible at a smaller
<italic>k</italic>
value, namely a slightly better correlation of a linearly distributed number of bins, e.g., visible between HPV 4 and HPV 5 as diagonal elements in
<xref ref-type="fig" rid="genes-08-00122-f003">Figure 3</xref>
B. This linear structure is equivalent to the fact that the linear positional distribution pattern of
<italic>k</italic>
-mers is conserved between two sequences, and is therefore related to similar sequences of DNA words and probably related to a good alignment result. Such linear structures are existent in any HPV genome analyzed. It is remarkable that the linear structure is the weakest for any correlation of the genomes with HPV7, which is the only analyzed virus classified as
<italic>Alpha Papillomaviridae</italic>
. This may represent the evolutionary distance to all other types, classified as
<italic>Beta</italic>
and
<italic>Gamma Papillomaviridae</italic>
. All of the
<italic>Beta Papillomaviridae</italic>
(HPV 5, HPV 9, HPV 49, HPV 92, and HPV 96) show a strong correlation among themselves and a similar but slightly weaker correlation with HPV 4, the only
<italic>Gamma Papillomaviridae</italic>
. HPV 4 itself shows a very strong linear structure when correlated with any of the unclassified species HPV 136, HPV 140, HPV 154, and HPV 178 (examples given in
<xref ref-type="fig" rid="genes-08-00122-f004">Figure 4</xref>
). All unclassified species show strong linear structures when compared to each other or HPV 4. Slight correlations might be found if HPV 136 and HPV 140 are seen in comparison with HPV 92. This could mean that the unclassified (with respect to HPV subphyla) HPV types are not related to HPV 7 and are presumably more linked in the subphylum of
<italic>Gamma Papillomaviridae</italic>
, namely HPV 4, than in the
<italic>Beta Papillomaviridae</italic>
.</p>
</sec>
<sec id="sec3dot1dot2-genes-08-00122">
<title>3.1.2.
<italic>Herpesviridae</italic>
</title>
<p>As a second group of representatives, we chose a wide phylogenetic range of the Human Herpesvirus (HHV) types [
<xref rid="B20-genes-08-00122" ref-type="bibr">20</xref>
]. HHV contains an enveloped linear genome and is well known to have one of the largest and most complex genomes of all viruses.</p>
<p>The analyzed HHV types cover a relatively wide range of global G/C contents, from G+C > 64% for HHV1 and HHV2 to < 40% for HHV7. All of the HHV genomes analyzed obey Chargaff's second rule [
<xref rid="B21-genes-08-00122" ref-type="bibr">21</xref>
] on a global scale. There is a clear tendency for higher C/G values at a region near to, but not necessary always directly at, the ends and the beginnings of all HHV genomes. Moreover, every HHV type seems to have a small part, 1500 bp–15,000 bp, with an extremely high C and/or G content (low A/T content respectively) in the 3’ part, accounting for 10%–25% of their genomes (
<xref ref-type="fig" rid="genes-08-00122-f006">Figure 6</xref>
). The local contents of G/C and A/T, especially in these regions, often clearly violate Chargaff's second rule, therefore justifying the separated treatment of G and C (A/T respectively) for HHV. These regions are also visible for higher
<italic>k</italic>
values (
<xref ref-type="app" rid="app1-genes-08-00122">Figures S3 and S4</xref>
) and are the only feature clearly visible on relative
<italic>k</italic>
-mer maps (
<xref ref-type="fig" rid="genes-08-00122-f007">Figure 7</xref>
). In contrast to
<italic>k</italic>
= 1, the features on the relative maps do not share a very uniform structure considering relative
<italic>k</italic>
-mer contents, except for very close relatives (HHV6A, HHV6B, and HHV4 type 1 and 2).</p>
<p>HHV6A and HHV6B are closely related, but due to differences discovered through alignment methods, are not considered as a single species [
<xref rid="B22-genes-08-00122" ref-type="bibr">22</xref>
]. Therefore, their genome sequences should not be as similar as the two HHV4 types are. HHV6A and HHV6B show very similar structures in
<xref ref-type="fig" rid="genes-08-00122-f006">Figure 6</xref>
and
<xref ref-type="fig" rid="genes-08-00122-f007">Figure 7</xref>
, but are far from identical if compared with the two types of HHV4. Again, this confirms former classifications (see
<xref ref-type="fig" rid="genes-08-00122-f008">Figure 8</xref>
and Figure 11B). A comparison between HHV6A and 6B using an alignment method was made in [
<xref rid="B22-genes-08-00122" ref-type="bibr">22</xref>
]. They identified regions of extreme low, high, and extremely high conservation, and again associated their location with certain genes (all marked with bars in
<xref ref-type="fig" rid="genes-08-00122-f009">Figure 9</xref>
).</p>
<p>When looking at the maps, the
<italic>k</italic>
= 1 patterns of HHV6A and 6B look quite similar , whereas a correlation visualized in
<xref ref-type="fig" rid="genes-08-00122-f009">Figure 9</xref>
shows that both genome sequences show complex structural relations. At least five different regions with a relative high degree of self-similarity (regions of red in correlation map) become visible for the monomer structures (
<xref ref-type="fig" rid="genes-08-00122-f009">Figure 9</xref>
A). The regions at 5′ and 3′ show similarities among themselves, both being present in low conservation regions. The region between U3 and U41 shows a structure clearly distinguishable from the structure between U41 and U90 for HHV6A and 6B. The first region is associated with regions of high identity (blue and green bars) in [
<xref rid="B22-genes-08-00122" ref-type="bibr">22</xref>
]. The second is also mostly associated with regions of high identify, but also spans over the region of extremely low identity around U90. In the map, this region does not have a very clear representation on a monomer level (
<xref ref-type="fig" rid="genes-08-00122-f006">Figure 6</xref>
) and is inhomogenous in its correlation values for
<italic>k</italic>
= 1. One should also remark that [
<xref rid="B22-genes-08-00122" ref-type="bibr">22</xref>
] mentioned issues with the alignment in this specific region and therefore changed the parameters of their analysis.</p>
<p>Other than in the monomer structures, relative
<italic>k</italic>
= 2 structures (
<xref ref-type="fig" rid="genes-08-00122-f009">Figure 9</xref>
B) of HHV6A and 6B clearly show differences between low and high conserved regions. The bigger regions, with a relative high self-similarity, seem to fit with regions of a high conservation, to a higher degree. The region around U90 is especially visible a very small area of self-similarity, but does not strictly exhibit a strong relation with any other area of self-similarity. This supports the usefulness of the consideration of relative
<italic>k</italic>
-mers, while also proving that not all
<italic>k</italic>
-mer features are simply based on the monomer content (or even C+G content).</p>
<p>HHV7 is also a close relative of 6A and 6B [
<xref rid="B24-genes-08-00122" ref-type="bibr">24</xref>
]. Therefore, it is expected to show similar
<italic>k</italic>
-mer structures. It was shown in [
<xref rid="B24-genes-08-00122" ref-type="bibr">24</xref>
] that the differences between 6A, 6B, and 7 are located mainly at the so called “repeat regions” at the beginnings and ends of the genomes. Correlations of
<italic>k</italic>
-mer structures confirm these results (
<xref ref-type="fig" rid="genes-08-00122-f010">Figure 10</xref>
). Besides the fact that the beginnings and ends show low mean conservation values, it is remarkable that their linear relative
<italic>k</italic>
-mer structure within one genome is strictly conserved when comparing the 3′ region with the 5′ region (
<xref ref-type="fig" rid="genes-08-00122-f011">Figure 11</xref>
). This feature is only existent for 6A, 6B, and 7. For 6A and 6B, the structure is also conserved between the two genomes, although this is not true when compared with 7 (
<xref ref-type="fig" rid="genes-08-00122-f011">Figure 11</xref>
).</p>
<p>Other than for the analyzed HPV, only a small number of closely related groups of HHV show a conserved overall linear
<italic>k</italic>
-mer structure (
<xref ref-type="fig" rid="genes-08-00122-f011">Figure 11</xref>
). Conserved linear structures were only visible between HHV1 and 2, which have a relation similar to HHV6A/B and 7 [
<xref rid="B20-genes-08-00122" ref-type="bibr">20</xref>
], between both types of HHV4, and between HHV6A/6B and 7. Most of them are also only clearly visible for
<italic>k</italic>
= 4. To check if the absence of conserved linear
<italic>k</italic>
-mer structures depends on the larger bin width, we repeated the analysis with a bin width of 100 bp, without a deviating result.</p>
</sec>
</sec>
<sec id="sec3dot2-genes-08-00122">
<title>3.2. Homo sapiens</title>
<p>To check whether our methods are also reliable and efficiently usable on larger scale sequences, we created maps of the relatively large human (
<italic>Homo sapiens</italic>
) chromosome 2 (HSc2) and chimpanzee (
<italic>Pan troglodytes</italic>
) chromosomes 2A and 2B (PTc2A, PTc2B respectively). It is a well-known fact that HSc2 is the result of a fusion of chromosomes related to PTc2A and PTc2B of an ancestor of
<italic>H. sapiens</italic>
[
<xref rid="B25-genes-08-00122" ref-type="bibr">25</xref>
]. By using our methods on HSc2 and PTc2A/B, a number of linear conserved regions could be identified (
<xref ref-type="fig" rid="genes-08-00122-f012">Figure 12</xref>
). We should mention that in comparison to analyzed viral genomes, it was necessary to use a very high threshold for the color scale of the correlation heatmap to recognize linear structures diagonally, even for
<italic>k</italic>
= 4. This seems mostly based on the fact that we used a relatively large bin width of 2.5 Mbp to create the images. A value of 2.5 Mbp is arguably large enough for a segment to show a specific genomic
<italic>k</italic>
-mer structure. This effect of a specific genomic
<italic>k</italic>
-mer structure for segments of sufficient size was formerly described and used in [
<xref rid="B5-genes-08-00122" ref-type="bibr">5</xref>
]. An association between the human and chimpanzee chromosomal sequences to identify closely related regions was demonstrated in [
<xref rid="B26-genes-08-00122" ref-type="bibr">26</xref>
]. The regions which exhibited an association between HSc2 and the two PTc2 are visualized in
<xref ref-type="fig" rid="genes-08-00122-f012">Figure 12</xref>
by bars. Boxes based on these bars have been drawn on the correlation heatmap. The linearly conserved regions, identified by our method as diagonals in a correlation heatmap, seem to be comparable to the results of [
<xref rid="B26-genes-08-00122" ref-type="bibr">26</xref>
]. This illustrates and confirms the functionality of our methods for large sequences.</p>
</sec>
</sec>
<sec id="sec4-genes-08-00122">
<title>4. Discussion</title>
<p>In this article, two questions were addressed. We showed that local
<italic>k</italic>
-mer structures correspond to genome feature analysis. We identified several local
<italic>k</italic>
-mer features representing regions with high or low conservation, as confirmed by independent methods (mostly alignment or gene based). We distinguished the
<italic>k</italic>
-mer structures of the early and late gene regions for all HPV types and obtained a measure for the quality of conservation between different types of HPV genomes. Expanding our analysis to humans and chimpanzees correctly verified the predicted homologies. The second question focused on whether such features are evolutionary conserved. Regions identified as conserved by other methods were also visible in the
<italic>k</italic>
-mer-based results and were reasonably associated with phylogenetic distances. Some features (e.g., the linear conservation of the beginning and ending regions in HHV6A/6B and 7), were only visible for close relatives. The conservation of linear structures visible on correlation heatmaps seems to clearly indicate closely related sequence regions. Accordingly, a classification of some unclassified HPV types was feasible. Nevertheless, these results should be verified by other techniques, in order to demonstrate their usefulness for the classification of other genomes without the requirement of prior knowledge (e.g., gene function), which is even more probable for
<italic>k</italic>
-mer analysis with positional resolution as a whole.</p>
<p>Furthermore, a number of local
<italic>k</italic>
-mer features were identified without a description or explanation in any publication found, and thus, may not detectable by other methods or only with difficulty. This is the case for the repetitive region at the top of the HPV4 genomes and also for the linearly conserved beginnings and endings of HHV6A/6B and 7. Highly conserved sequences are often an indication of evolutionary pressure and therefore of biological or physical functionality, which leads to the assumption that the identified regions are of high functionality.</p>
<p>The hypothesis that the overall
<italic>k</italic>
-mer content, especially the content of G/C or local aberrations, from the global
<italic>k</italic>
-mer structure are involved in, for example, immune evasion, is not new [
<xref rid="B9-genes-08-00122" ref-type="bibr">9</xref>
,
<xref rid="B27-genes-08-00122" ref-type="bibr">27</xref>
,
<xref rid="B28-genes-08-00122" ref-type="bibr">28</xref>
]. Many other interesting genomic features and motifs like codon usage [
<xref rid="B29-genes-08-00122" ref-type="bibr">29</xref>
], gene length [
<xref rid="B30-genes-08-00122" ref-type="bibr">30</xref>
], and the distribution and classification of repetitive elements [
<xref rid="B31-genes-08-00122" ref-type="bibr">31</xref>
], are known to be associated with local G/C content. Higher word lengths k could give different perspectives on certain genomic features or even lead to the detection of currently unknown features. Therefore,
<italic>k</italic>
-mer analysis with positional resolution might deliver new insights into rules for genome organization, structuring, and evolution.</p>
<p>In summary,
<italic>k</italic>
-mer analysis with positional resolution, like standard
<italic>k</italic>
-mer analysis, is very fast, even for sequences far beyond the length scales suitable for alignment methods (e.g., human chromosomes), while generating reliable results even for sequences with a very low level of similarity (where alignment methods often fail). Most of the features detected are not visible by a standard
<italic>k</italic>
-mer analysis (without positional resolution), since their location and colocalization with other DNA motifs were essential for their detection and identification, and none of them required the detection of secondary biological or other information sources like gene databases. Of course, a local
<italic>k</italic>
-mer analysis does not outclass every other
<italic>k</italic>
-mer method and alignment methods like NCBI BLAST remain very powerful tools, but local
<italic>k</italic>
-mer delivers an interesting additional perspective on sequence data and may close a gap between alignment and alignment-free methods. We believe that further analysis of more DNA sequences and more specific identification and categorization for the cataloging of
<italic>k</italic>
-mer features might provide new insights into the mysteries of complex genomes.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>We acknowledge financial support by Deutsche Forschungsgemeinschaft and Ruprecht-Karls-Universität Heidelberg within the funding programme Open Access Publishing</p>
</ack>
<app-group>
<app id="app1-genes-08-00122">
<title>Supplementary Materials</title>
<p>The following are available online at
<uri xlink:href="www.mdpi.com/2073-4425/8/4/122/s1">www.mdpi.com/2073-4425/8/4/122/s1</uri>
. Table S1: HPV Region Boundaries. Figure S1: Map of HPV viruses for
<italic>k</italic>
= 2. From left to right: HPV4, 5, 7, 9, 49, 92, 96, 136, 140, 154, 178. A bin width of 100 bp was used. The order of words is alphabetically from left in each genome starting with an AA. Genes were represented by colored bars at the right side of the linear representation of the circular HPV genomes (E1 red, E2 blue, E4 green, E5 yellow, E6 orange, E7 purple, L1 magenta, L2 grey). The box indicates region 2, for borders of all regions see Table S1. Figure S2: Map of HPV viruses for
<italic>k</italic>
= 3. From left to right: HPV4, 5, 7, 9, 49, 92, 96, 136, 140, 154, 178. A bin width of 100 bp was used. The order of words is alphabetically from left in each genome starting with an AAA. The box indicates region 2, for borders of all regions see Table S1. Figure S3: Map of HHV genomes for
<italic>k</italic>
= 2, bin width of 500 bp. From left to right: HHV1, 2, 3, 5, 6A, 6B, 7, 4 type 1, 4 type 2, 8. The order of words is alphabetically from left in each genome starting with an AA. Figure S4: Map of HHV genomes for
<italic>k</italic>
= 3, bin width of 500 bp. From left to right: HHV1, 2, 3, 5, 6A, 6B, 7, 4 type 1, 4 type 2, 8. The order of words is alphabetically from left in each genome starting with an AAA.</p>
<supplementary-material content-type="local-data" id="genes-08-00122-s001">
<media xlink:href="genes-08-00122-s001.zip">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</app>
</app-group>
<notes>
<title>Author Contributions</title>
<p>A.S. developed and wrote software used for the conversion, analysis, storing, and visualization.; A.S. and G.H. designed and discussed the methods and algorithms, and interpreted and discussed the results.; A.S. and G.H. also wrote the paper in close cooperation with M.H.; A.S. did the necessary literature searches and other researches under the supervision of M.H. and G.H.; A.S., K.B., M.B., J.R., C.D., and P.F. tested the software developed by A.S. and applied it to the different data sets referred to within this paper.</p>
</notes>
<notes notes-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflict of interest.</p>
</notes>
<ref-list>
<title>References</title>
<ref id="B1-genes-08-00122">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>S.F.</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E.W.</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D.J.</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J. Mol. Biol.</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="B2-genes-08-00122">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chan</surname>
<given-names>C.X.</given-names>
</name>
<name>
<surname>Ragan</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<article-title>Next-generation phylogenetics</article-title>
<source>Biol. Direct</source>
<year>2013</year>
<volume>8</volume>
<pub-id pub-id-type="doi">10.1186/1745-6150-8-3</pub-id>
<pub-id pub-id-type="pmid">23339707</pub-id>
</element-citation>
</ref>
<ref id="B3-genes-08-00122">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alsop</surname>
<given-names>E.B.</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Resolving prokaryotic taxonomy without rRNA: Longer oligonucleotide word lengths improve genome and metagenome taxonomic classification</article-title>
<source>PLoS ONE</source>
<year>2013</year>
<volume>8</volume>
<elocation-id>e67337</elocation-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0067337</pub-id>
<pub-id pub-id-type="pmid">23840870</pub-id>
</element-citation>
</ref>
<ref id="B4-genes-08-00122">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brendel</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Beckmann</surname>
<given-names>J.S.</given-names>
</name>
<name>
<surname>Trifonov</surname>
<given-names>E.N.</given-names>
</name>
</person-group>
<article-title>Linguistics of nucleotide sequences: morphology and comparison of vocabularies</article-title>
<source>J. Biomol. Struct. Dyn.</source>
<year>1986</year>
<volume>4</volume>
<fpage>11</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="doi">10.1080/07391102.1986.10507643</pub-id>
<pub-id pub-id-type="pmid">3078230</pub-id>
</element-citation>
</ref>
<ref id="B5-genes-08-00122">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Olman</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>Barcodes for genomes and applications</article-title>
<source>BMC Bioinform.</source>
<year>2008</year>
<volume>9</volume>
<elocation-id>546</elocation-id>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-546</pub-id>
<pub-id pub-id-type="pmid">19091119</pub-id>
</element-citation>
</ref>
<ref id="B6-genes-08-00122">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bultrini</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Del Giudice</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Frontali</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Pentamer vocabularies characterizing introns and intron-like intergenic tracts from
<italic>Caenorhabditis elegans</italic>
and
<italic>Drosophila melanogaster</italic>
</article-title>
<source>Gene</source>
<year>2003</year>
<volume>304</volume>
<fpage>183</fpage>
<lpage>192</lpage>
<pub-id pub-id-type="doi">10.1016/S0378-1119(02)01206-4</pub-id>
<pub-id pub-id-type="pmid">12568727</pub-id>
</element-citation>
</ref>
<ref id="B7-genes-08-00122">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pizzi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Frontali</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Low-complexity regions in
<italic>Plasmodium falciparum</italic>
proteins</article-title>
<source>Genome Res.</source>
<year>2001</year>
<volume>11</volume>
<fpage>218</fpage>
<lpage>229</lpage>
<pub-id pub-id-type="doi">10.1101/gr.GR-1522R</pub-id>
<pub-id pub-id-type="pmid">11157785</pub-id>
</element-citation>
</ref>
<ref id="B8-genes-08-00122">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hacker</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kaper</surname>
<given-names>J.B.</given-names>
</name>
</person-group>
<article-title>Pathogenicity islands and the evolution of microbes</article-title>
<source>Annu. Rev. Microbiol.</source>
<year>2000</year>
<volume>54</volume>
<fpage>641</fpage>
<lpage>679</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.micro.54.1.641</pub-id>
<pub-id pub-id-type="pmid">11018140</pub-id>
</element-citation>
</ref>
<ref id="B9-genes-08-00122">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Navarre</surname>
<given-names>W.W.</given-names>
</name>
<name>
<surname>Porwollik</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>McClelland</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Rosen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Libby</surname>
<given-names>S.J.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>F.C.</given-names>
</name>
</person-group>
<article-title>Selective silencing of foreign DNA with low GC content by the H-NS protein in Salmonella</article-title>
<source>Science</source>
<year>2006</year>
<volume>313</volume>
<fpage>236</fpage>
<lpage>238</lpage>
<pub-id pub-id-type="doi">10.1126/science.1128794</pub-id>
<pub-id pub-id-type="pmid">16763111</pub-id>
</element-citation>
</ref>
<ref id="B10-genes-08-00122">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pizzi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Frontali</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Divergence of noncoding sequences and of insertions encoding nonglobular domains at a genomic region well conserved in plasmodia</article-title>
<source>J. Mol. Evolut.</source>
<year>2000</year>
<volume>50</volume>
<fpage>474</fpage>
<lpage>480</lpage>
<pub-id pub-id-type="doi">10.1007/s002390010050</pub-id>
<pub-id pub-id-type="pmid">10824091</pub-id>
</element-citation>
</ref>
<ref id="B11-genes-08-00122">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pozzoli</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Menozzi</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Fumagalli</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cereda</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Comi</surname>
<given-names>G.P.</given-names>
</name>
<name>
<surname>Cagliani</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Bresolin</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Sironi</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Both selective and neutral processes drive GC content evolution in the human genome</article-title>
<source>BMC Evolut. Biol.</source>
<year>2008</year>
<volume>8</volume>
<elocation-id>99</elocation-id>
<pub-id pub-id-type="doi">10.1186/1471-2148-8-99</pub-id>
<pub-id pub-id-type="pmid">18371205</pub-id>
</element-citation>
</ref>
<ref id="B12-genes-08-00122">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chae</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jinwoo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Seong-Whan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kenneth</surname>
<given-names>P.N.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>K.</given-names>
</name>
</person-group>
<article-title>Comparative analysis using k-mer and k-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes</article-title>
<source>Nucleic Acids Res.</source>
<year>2013</year>
<volume>41</volume>
<fpage>4783</fpage>
<lpage>4791</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt144</pub-id>
<pub-id pub-id-type="pmid">23519616</pub-id>
</element-citation>
</ref>
<ref id="B13-genes-08-00122">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>D.A.</given-names>
</name>
<name>
<surname>Ilene</surname>
<given-names>K.M.</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D.J.</given-names>
</name>
<name>
<surname>Ostell</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wheeler</surname>
<given-names>D.L.</given-names>
</name>
</person-group>
<article-title>GenBank</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>D34</fpage>
<lpage>D38</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gki063</pub-id>
<pub-id pub-id-type="pmid">15608212</pub-id>
</element-citation>
</ref>
<ref id="B14-genes-08-00122">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pearson</surname>
<given-names>K.</given-names>
</name>
</person-group>
<article-title>Note on regression and inheritance in the case of two parents</article-title>
<source>Proc. R. Soc. Lond.</source>
<year>1895</year>
<volume>58</volume>
<fpage>240</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1098/rspl.1895.0041</pub-id>
</element-citation>
</ref>
<ref id="B15-genes-08-00122">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marçais</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="B16-genes-08-00122">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karlin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mrázek</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Compositional differences within and between eukaryotic genomes</article-title>
<source>Proc. Natl. Acad. Sci. USA</source>
<year>1997</year>
<volume>94</volume>
<fpage>10227</fpage>
<lpage>10232</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.94.19.10227</pub-id>
<pub-id pub-id-type="pmid">9294192</pub-id>
</element-citation>
</ref>
<ref id="B17-genes-08-00122">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hunter</surname>
<given-names>J.D.</given-names>
</name>
</person-group>
<article-title>Matplotlib: A 2D graphics environment</article-title>
<source>Compt. Sci. Eng.</source>
<year>2007</year>
<volume>9</volume>
<fpage>90</fpage>
<lpage>95</lpage>
<pub-id pub-id-type="doi">10.1109/MCSE.2007.55</pub-id>
</element-citation>
</ref>
<ref id="B18-genes-08-00122">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Acland</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Agarwala</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Barrett</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Beck</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>D.A.</given-names>
</name>
<name>
<surname>Bollin</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bolton</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>S.H.</given-names>
</name>
<name>
<surname>Canese</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>D.M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Database resources of the National Center for Biotechnology Information</article-title>
<source>Nucleic Acids Res.</source>
<year>2009</year>
<volume>40</volume>
<fpage>D13</fpage>
<lpage>D25</lpage>
</element-citation>
</ref>
<ref id="B19-genes-08-00122">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname>
<given-names>Z.M.</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>C.C.</given-names>
</name>
</person-group>
<article-title>Papillomavirus genome structure, expression, and post-trascriptional regulation</article-title>
<source>Front. Biosci.</source>
<year>2006</year>
<volume>11</volume>
<fpage>2286</fpage>
<lpage>2302</lpage>
<pub-id pub-id-type="doi">10.2741/1971</pub-id>
<pub-id pub-id-type="pmid">16720315</pub-id>
</element-citation>
</ref>
<ref id="B20-genes-08-00122">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Davison</surname>
<given-names>A.J.</given-names>
</name>
</person-group>
<article-title>Evolution of sexually transmitted and sexually transmissible human herpesviruses</article-title>
<source>Ann. N. Y. Acad. Sci.</source>
<year>2011</year>
<volume>1230</volume>
<fpage>E37</fpage>
<lpage>E49</lpage>
<pub-id pub-id-type="doi">10.1111/j.1749-6632.2011.06358.x</pub-id>
<pub-id pub-id-type="pmid">22417106</pub-id>
</element-citation>
</ref>
<ref id="B21-genes-08-00122">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elson</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Chargaff</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>On the desoxyribonucleic acid content of sea urchin gametes</article-title>
<source>Expertientia</source>
<year>1952</year>
<volume>8</volume>
<fpage>143</fpage>
<lpage>145</lpage>
<pub-id pub-id-type="doi">10.1007/BF02170221</pub-id>
</element-citation>
</ref>
<ref id="B22-genes-08-00122">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dominguez</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dambaugh</surname>
<given-names>T.R.</given-names>
</name>
<name>
<surname>Stamey</surname>
<given-names>F.R.</given-names>
</name>
<name>
<surname>Dewhurst</surname>
<given-names>S.N.</given-names>
</name>
<name>
<surname>Inoue</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pellett</surname>
<given-names>P.E.</given-names>
</name>
</person-group>
<article-title>Human herpesvirus 6B genome sequence: Coding content and comparison with human herpesvirus 6A</article-title>
<source>J. Vorol.</source>
<year>1999</year>
<volume>73</volume>
<fpage>8040</fpage>
<lpage>8052</lpage>
</element-citation>
</ref>
<ref id="B23-genes-08-00122">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dolan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Addison</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gatherer</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Davison</surname>
<given-names>A.J.</given-names>
</name>
<name>
<surname>McGeoch</surname>
<given-names>D.J.</given-names>
</name>
</person-group>
<article-title>The genome of Epstein-Barr virus type 2 strain AG876</article-title>
<source>J. Virol.</source>
<year>2006</year>
<volume>350</volume>
<fpage>164</fpage>
<lpage>170</lpage>
<pub-id pub-id-type="doi">10.1016/j.virol.2006.01.015</pub-id>
<pub-id pub-id-type="pmid">16490228</pub-id>
</element-citation>
</ref>
<ref id="B24-genes-08-00122">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Megaw</surname>
<given-names>A.G.</given-names>
</name>
<name>
<surname>Rapaport</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Avidor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Frenkel</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Davison</surname>
<given-names>A.J.</given-names>
</name>
</person-group>
<article-title>The DNA sequence of the RK strain of human herpesvirus 7</article-title>
<source>J. Virol.</source>
<year>1998</year>
<volume>244</volume>
<fpage>119</fpage>
<lpage>132</lpage>
<pub-id pub-id-type="doi">10.1006/viro.1998.9105</pub-id>
<pub-id pub-id-type="pmid">9581785</pub-id>
</element-citation>
</ref>
<ref id="B25-genes-08-00122">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yunis</surname>
<given-names>J.J.</given-names>
</name>
<name>
<surname>Sawyer</surname>
<given-names>J.R.</given-names>
</name>
</person-group>
<article-title>The Striking Resemblance of high-resolution G-banded chromosomes of man and chimpanzee</article-title>
<source>Science</source>
<year>1980</year>
<volume>208</volume>
<fpage>1145</fpage>
<lpage>1148</lpage>
<pub-id pub-id-type="doi">10.1126/science.7375922</pub-id>
<pub-id pub-id-type="pmid">7375922</pub-id>
</element-citation>
</ref>
<ref id="B26-genes-08-00122">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pratas</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Silva</surname>
<given-names>R.M.</given-names>
</name>
<name>
<surname>Pinho</surname>
<given-names>A.J.</given-names>
</name>
<name>
<surname>Ferreira</surname>
<given-names>P.J.S.G.</given-names>
</name>
</person-group>
<article-title>An alignment-free method to find and visualise rearrangements between pairs of DNA sequences</article-title>
<source>Sci. Rep.</source>
<year>2015</year>
<volume>5</volume>
<pub-id pub-id-type="doi">10.1038/srep10203</pub-id>
<pub-id pub-id-type="pmid">25984837</pub-id>
</element-citation>
</ref>
<ref id="B27-genes-08-00122">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Winzeler</surname>
<given-names>E.A.</given-names>
</name>
</person-group>
<article-title>Malaria research in the post-genomic era</article-title>
<source>Nature</source>
<year>2008</year>
<volume>455</volume>
<fpage>751</fpage>
<lpage>756</lpage>
<pub-id pub-id-type="doi">10.1038/nature07361</pub-id>
<pub-id pub-id-type="pmid">18843360</pub-id>
</element-citation>
</ref>
<ref id="B28-genes-08-00122">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoelzer</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Shackelton</surname>
<given-names>L.A.</given-names>
</name>
<name>
<surname>Parrish</surname>
<given-names>C.R.</given-names>
</name>
</person-group>
<article-title>Presence and role of cytosine methylation in DNA viruses of animals</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>2825</fpage>
<lpage>2837</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn121</pub-id>
<pub-id pub-id-type="pmid">18367473</pub-id>
</element-citation>
</ref>
<ref id="B29-genes-08-00122">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clay</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Caccio</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zoubak</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mouchiroud</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bernardi</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Human coding and noncoding DNA: Compositional correlations</article-title>
<source>Mol. Phylogenet. Evolut.</source>
<year>1996</year>
<volume>5</volume>
<fpage>2</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="doi">10.1006/mpev.1996.0002</pub-id>
<pub-id pub-id-type="pmid">8673288</pub-id>
</element-citation>
</ref>
<ref id="B30-genes-08-00122">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Duret</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mouchiroud</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Gautier</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores</article-title>
<source>J. Mol. Evolut.</source>
<year>1995</year>
<volume>40</volume>
<fpage>308</fpage>
<lpage>317</lpage>
<pub-id pub-id-type="doi">10.1007/BF00163235</pub-id>
</element-citation>
</ref>
<ref id="B31-genes-08-00122">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fullerton</surname>
<given-names>S.M.</given-names>
</name>
<name>
<surname>Carvalho</surname>
<given-names>A.B.</given-names>
</name>
<name>
<surname>Clark</surname>
<given-names>A.G.</given-names>
</name>
</person-group>
<article-title>Local Rates of Recombination Are Positively Correlated with GC Content in the Human Genom</article-title>
<source>Mol. Biol. Evolut.</source>
<year>2001</year>
<volume>8</volume>
<fpage>1139</fpage>
<lpage>1142</lpage>
<pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a003886</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="genes-08-00122-f001" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>Map of Human Papillomaviruses (HPVs) for
<italic>k</italic>
= 1. From left to right: HPV4, 5, 7, 9, 49, 92, 96, 136, 140, 154, 178. A bin width of 100 bp was used. Genes were represented by colored bars at the right side of the linear representation of the circular HPV genomes (E1 red, E2 blue, E4 green, E5 yellow, E6 orange, E7 purple, L1 magenta, L2 grey). The orange boxes indicate the boundaries of the three regions with different
<italic>k</italic>
-mer structures (the regions above and below the middle region in the box do not have their own boxes for easier readability).</p>
</caption>
<graphic xlink:href="genes-08-00122-g001"></graphic>
</fig>
<fig id="genes-08-00122-f002" orientation="portrait" position="float">
<label>Figure 2</label>
<caption>
<p>Histogram of monomer contents of different regions relative to the monomer content of the whole genome of HPV4. The top region (
<bold>red</bold>
) is associated to the genes E1, E6, and E7. The central region (
<bold>blue</bold>
) is associated with E2, E4, and E5. The E region (
<bold>purple</bold>
) is the region covered by all early genes. The bottom/L region (
<bold>magenta</bold>
) is associated with the late genes (L1, L2). The two NC regions (
<bold>green</bold>
) on the left and right are the non-coding region at the top and bottom of the linear representation used in
<xref ref-type="fig" rid="genes-08-00122-f001">Figure 1</xref>
, respectively.</p>
</caption>
<graphic xlink:href="genes-08-00122-g002"></graphic>
</fig>
<fig id="genes-08-00122-f003" orientation="portrait" position="float">
<label>Figure 3</label>
<caption>
<p>Correlation heatmaps between HPV4 and 5 for
<italic>k</italic>
= 1 (
<bold>A</bold>
) and
<italic>k</italic>
= 4 (
<bold>B</bold>
) (bin width of 100 bp). The colored bars at the edges indicate the locations of the top region (
<bold>red</bold>
), central region (
<bold>blue</bold>
), bottom region (
<bold>magenta</bold>
), and NC region (
<bold>green</bold>
) according to gene annotation borders (not by
<italic>k</italic>
-mer content).</p>
</caption>
<graphic xlink:href="genes-08-00122-g003"></graphic>
</fig>
<fig id="genes-08-00122-f004" orientation="portrait" position="float">
<label>Figure 4</label>
<caption>
<p>Correlation heatmaps of different HPV types for different
<italic>k</italic>
values (
<italic>k</italic>
= 1 in
<bold>A</bold>
+
<bold>C</bold>
,
<italic>k</italic>
= 4 in
<bold>B</bold>
+
<bold>D</bold>
), with a bin width of 100 bp. The colored bars at the edges indicate the positions of the top region (red), central region (blue), bottom region (magenta), and NC region (green) according to the borders of associated genes (not by
<italic>k</italic>
-mer content). The peculiar structure of the NC region of HPV 7 is highlighted with a green border in (
<bold>A</bold>
,
<italic>k</italic>
= 1). The linear structures between HPV4 and HPV136 are considered as “strong” between HPV7 and HPV4, and are “weak” between HPV7 and HPV136.</p>
</caption>
<graphic xlink:href="genes-08-00122-g004"></graphic>
</fig>
<fig id="genes-08-00122-f005" orientation="portrait" position="float">
<label>Figure 5</label>
<caption>
<p>Heatmap Summary Images of HPV Regions. Shown are the summaries of some heatmaps (
<italic>k</italic>
= 1) of the three regions defined by different
<italic>k</italic>
-mer contents for all of the HPV types analyzed (positions of regions can be found in
<xref ref-type="app" rid="app1-genes-08-00122">Table S1</xref>
). (
<bold>A</bold>
) Good correlation within the first region between all HPV types; (
<bold>B</bold>
) Bad correlation (values around zero or lower) between the first and second region; (
<bold>C</bold>
) Relatively good correlation in the second region for most values; (
<bold>D</bold>
) Good correlation amongst all of the third regions of HPV. The data points are equally spaced and sorted numerically, therefore no additional information is provided by their horizontal alignment. The labels next to the data points indicate the corresponding HPV types whose regions were correlated.</p>
</caption>
<graphic xlink:href="genes-08-00122-g005"></graphic>
</fig>
<fig id="genes-08-00122-f006" orientation="portrait" position="float">
<label>Figure 6</label>
<caption>
<p>Map of Human Herpesvirus (HHV) genomes for
<italic>k</italic>
= 1, with a bin width of 500 bp. From left to right: HHV1, 2, 3, 5, 6A, 6B, 7, 4 type 1, 4 type 2, 8. Genes associated with low conserved regions on HHV6A, 6B, 4 type 1, and 4 type 2 were visualized with gray bars at the side of the genomes. At the right bottom corner, a small region around the EBNA genes for both HHV4 species is shown to illustrate small differences in the local
<italic>k</italic>
-mer structure.</p>
</caption>
<graphic xlink:href="genes-08-00122-g006"></graphic>
</fig>
<fig id="genes-08-00122-f007" orientation="portrait" position="float">
<label>Figure 7</label>
<caption>
<p>Map of HHV genomes for relative
<italic>k</italic>
= 2, with a bin width of 500 bp. From left to right: HHV1, 2, 3, 5, 6A, 6B, 7, 4 type 1, 4 type 2, 8. Peculiar patterns associated with high C or C/G content for
<italic>k</italic>
= 1 are marked orange, and the iterative structure at the top of 6A and 6B is marked green.</p>
</caption>
<graphic xlink:href="genes-08-00122-g007"></graphic>
</fig>
<fig id="genes-08-00122-f008" orientation="portrait" position="float">
<label>Figure 8</label>
<caption>
<p>Correlation heatmap between HHV4 type 1 and HHV4 type 2 (
<italic>k</italic>
= 1 and bin width of 500 bp). The gray bars indicate genes associated with regions of low conservation derived with alignment methods. Genes with an * were not found for the sequences used in our analysis. Therefore, their positions are only approximated by using the data from [
<xref rid="B23-genes-08-00122" ref-type="bibr">23</xref>
].</p>
</caption>
<graphic xlink:href="genes-08-00122-g008"></graphic>
</fig>
<fig id="genes-08-00122-f009" orientation="portrait" position="float">
<label>Figure 9</label>
<caption>
<p>Correlation heatmap between HHV6A and 6B for
<italic>k</italic>
= 1 (
<bold>A</bold>
) and relative
<italic>k</italic>
= 2 (
<bold>B</bold>
) (bin width 500 bp). Genes are represented by grey bars at the borders. Regions with a low identity score in [
<xref rid="B22-genes-08-00122" ref-type="bibr">22</xref>
] are marked orange, extremely low values are red, high identity scores are blue, and extremely high values are green.</p>
</caption>
<graphic xlink:href="genes-08-00122-g009"></graphic>
</fig>
<fig id="genes-08-00122-f010" orientation="portrait" position="float">
<label>Figure 10</label>
<caption>
<p>Correlation heatmap between HHV6A with HHV7 (
<bold>A</bold>
) and HHV6B with HHV7 (
<bold>B</bold>
) based on
<italic>k</italic>
= 1 (bin width 500 bp).</p>
</caption>
<graphic xlink:href="genes-08-00122-g010"></graphic>
</fig>
<fig id="genes-08-00122-f011" orientation="portrait" position="float">
<label>Figure 11</label>
<caption>
<p>Correlation heatmap between HHV7 with itself (
<bold>A</bold>
) and HHV6A with HHV6B (
<bold>B</bold>
) based on relative k = 4 (bin width 500 bp). Beginning and ending regions are highlighted with green boxes.</p>
</caption>
<graphic xlink:href="genes-08-00122-g011"></graphic>
</fig>
<fig id="genes-08-00122-f012" orientation="portrait" position="float">
<label>Figure 12</label>
<caption>
<p>Correlation heatmap between
<italic>Homo sapiens</italic>
chromosome 2 (HSc2) and
<italic>Pan troglodytes</italic>
chromosome 2A (PTc2A) and 2B. Based on
<italic>k</italic>
= 4 (bin width 2.5 Mbp) with a highly increased threshold (see scale on the right). The colored boxes indicate regions of association between the chromosomes.</p>
</caption>
<graphic xlink:href="genes-08-00122-g012"></graphic>
</fig>
<table-wrap id="genes-08-00122-t001" orientation="portrait" position="float">
<object-id pub-id-type="pii">genes-08-00122-t001_Table 1</object-id>
<label>Table 1</label>
<caption>
<p>Accession numbers of DNA sequences used.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Species</th>
<th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Accession Number</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 4 (HPV4)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001457.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 5 (HPV5)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001531.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 7 (HPV7)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001595.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 9 (HPV9)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001596.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 49 (HPV49)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001591.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 92 (HPV92)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_004500.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 96 (HPV96)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_005134.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 136 (HPV136)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_017994.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 140 (HPV140)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_017996.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 154 (HPV154)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_021483.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Papillomavirus 178 (HPV178)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_023891.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 1 (HHV1)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001806.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 2 (HHV2)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001798.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 3 (HHV3)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001348.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 4 type1 (HHV4 type1)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_007605.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 4 type2 (HHV4 type2)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_009334.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 5 (HHV5)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_006273.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 6A (HHV6A)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001664.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 6B (HHV6B)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_000898.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 7 (HHV7)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_001716.2</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Human Herpesvirus 8 (HHV8)</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_009333.1</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">
<italic>Homo Sapiens</italic>
Chromosome 2</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_000002.12</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">
<italic>Pan Troglodytes</italic>
Chromosome 2A</td>
<td align="center" valign="middle" rowspan="1" colspan="1">NC_006469.3</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">
<italic>Pan Troglodytes</italic>
Chromosome 2B</td>
<td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">NC_006470.3</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C92  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000C92  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021