Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000B959 ( Pmc/Corpus ); précédent : 000B958; suivant : 000B960 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">K-mer-Based Motif Analysis in Insect Species across
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
Genera and Its Application to Species Classification</title>
<author>
<name sortKey="Cserhati, Matyas" sort="Cserhati, Matyas" uniqKey="Cserhati M" first="Matyas" last="Cserhati">Matyas Cserhati</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xiao, Peng" sort="Xiao, Peng" uniqKey="Xiao P" first="Peng" last="Xiao">Peng Xiao</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Guda, Chittibabu" sort="Guda, Chittibabu" uniqKey="Guda C" first="Chittibabu" last="Guda">Chittibabu Guda</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31827584</idno>
<idno type="pmc">6881769</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6881769</idno>
<idno type="RBID">PMC:6881769</idno>
<idno type="doi">10.1155/2019/4259479</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000B95</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B95</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">K-mer-Based Motif Analysis in Insect Species across
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
Genera and Its Application to Species Classification</title>
<author>
<name sortKey="Cserhati, Matyas" sort="Cserhati, Matyas" uniqKey="Cserhati M" first="Matyas" last="Cserhati">Matyas Cserhati</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xiao, Peng" sort="Xiao, Peng" uniqKey="Xiao P" first="Peng" last="Xiao">Peng Xiao</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Guda, Chittibabu" sort="Guda, Chittibabu" uniqKey="Guda C" first="Chittibabu" last="Guda">Chittibabu Guda</name>
<affiliation>
<nlm:aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Computational and Mathematical Methods in Medicine</title>
<idno type="ISSN">1748-670X</idno>
<idno type="eISSN">1748-6718</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of
<italic>Diptera</italic>
that include
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two
<italic>Aedes</italic>
and one
<italic>Culex</italic>
species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Cserhati, M" uniqKey="Cserhati M">M. Cserháti</name>
</author>
<author>
<name sortKey="Tur Czy, Z" uniqKey="Tur Czy Z">Z. Turóczy</name>
</author>
<author>
<name sortKey="Dudits, D" uniqKey="Dudits D">D. Dudits</name>
</author>
<author>
<name sortKey="Gyorgyey, J" uniqKey="Gyorgyey J">J. Györgyey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cserhati, M" uniqKey="Cserhati M">M. Cserhati</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cserhati, M F" uniqKey="Cserhati M">M. F. Cserhati</name>
</author>
<author>
<name sortKey="Mooter, M E" uniqKey="Mooter M">M.-E. Mooter</name>
</author>
<author>
<name sortKey="Peterson, L" uniqKey="Peterson L">L. Peterson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S. Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J. Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pollard, D A" uniqKey="Pollard D">D. A. Pollard</name>
</author>
<author>
<name sortKey="Iyer, V N" uniqKey="Iyer V">V. N. Iyer</name>
</author>
<author>
<name sortKey="Moses, A M" uniqKey="Moses A">A. M. Moses</name>
</author>
<author>
<name sortKey="Eisen, M B" uniqKey="Eisen M">M. B. Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, K" uniqKey="Yang K">K. Yang</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L. Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mchardy, A C" uniqKey="Mchardy A">A. C. McHardy</name>
</author>
<author>
<name sortKey="Martin, H G" uniqKey="Martin H">H. G. Martín</name>
</author>
<author>
<name sortKey="Tsirigos, A" uniqKey="Tsirigos A">A. Tsirigos</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P. Hugenholtz</name>
</author>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I. Rigoutsos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Diaz, N N" uniqKey="Diaz N">N. N. Diaz</name>
</author>
<author>
<name sortKey="Krause, L" uniqKey="Krause L">L. Krause</name>
</author>
<author>
<name sortKey="Goesmann, A" uniqKey="Goesmann A">A. Goesmann</name>
</author>
<author>
<name sortKey="Niehaus, K" uniqKey="Niehaus K">K. Niehaus</name>
</author>
<author>
<name sortKey="Nattkemper, T W" uniqKey="Nattkemper T">T. W. Nattkemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nalbantoglu, O U" uniqKey="Nalbantoglu O">O. U. Nalbantoglu</name>
</author>
<author>
<name sortKey="Way, S F" uniqKey="Way S">S. F. Way</name>
</author>
<author>
<name sortKey="Hinrichs, S H" uniqKey="Hinrichs S">S. H. Hinrichs</name>
</author>
<author>
<name sortKey="Sayood, K" uniqKey="Sayood K">K. Sayood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wiegmann, A" uniqKey="Wiegmann A">A. Wiegmann</name>
</author>
<author>
<name sortKey="Richards, S" uniqKey="Richards S">S. Richards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kiszewski, A" uniqKey="Kiszewski A">A. Kiszewski</name>
</author>
<author>
<name sortKey="Sachs, S E" uniqKey="Sachs S">S. E. Sachs</name>
</author>
<author>
<name sortKey="Mellinger, A" uniqKey="Mellinger A">A. Mellinger</name>
</author>
<author>
<name sortKey="Malaney, P" uniqKey="Malaney P">P. Malaney</name>
</author>
<author>
<name sortKey="Sachs, J" uniqKey="Sachs J">J. Sachs</name>
</author>
<author>
<name sortKey="Spielman, A" uniqKey="Spielman A">A. Spielman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foster, P G" uniqKey="Foster P">P. G. Foster</name>
</author>
<author>
<name sortKey="Bergo, E S" uniqKey="Bergo E">E. S. Bergo</name>
</author>
<author>
<name sortKey="Bourke, B P" uniqKey="Bourke B">B. P. Bourke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krzywinski, J" uniqKey="Krzywinski J">J. Krzywinski</name>
</author>
<author>
<name sortKey="Besansky, N J" uniqKey="Besansky N">N. J. Besansky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yassin, A" uniqKey="Yassin A">A. Yassin</name>
</author>
<author>
<name sortKey="Orgogozo, V" uniqKey="Orgogozo V">V. Orgogozo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Throckmorton, L H" uniqKey="Throckmorton L">L. H. Throckmorton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Izumitani, H F" uniqKey="Izumitani H">H. F. Izumitani</name>
</author>
<author>
<name sortKey="Kusaka, Y" uniqKey="Kusaka Y">Y. Kusaka</name>
</author>
<author>
<name sortKey="Koshikawa, S" uniqKey="Koshikawa S">S. Koshikawa</name>
</author>
<author>
<name sortKey="Toda, M J" uniqKey="Toda M">M. J. Toda</name>
</author>
<author>
<name sortKey="Katoh, T" uniqKey="Katoh T">T. Katoh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neafsey, D E" uniqKey="Neafsey D">D. E. Neafsey</name>
</author>
<author>
<name sortKey="Waterhouse, R M" uniqKey="Waterhouse R">R. M. Waterhouse</name>
</author>
<author>
<name sortKey="Abai, M R" uniqKey="Abai M">M. R. Abai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krafsur, E" uniqKey="Krafsur E">E. Krafsur</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gooding, R H" uniqKey="Gooding R">R. H. Gooding</name>
</author>
<author>
<name sortKey="Krafsur, E S" uniqKey="Krafsur E">E. S. Krafsur</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elsen, P" uniqKey="Elsen P">P. Elsen</name>
</author>
<author>
<name sortKey="Amoudi, M A" uniqKey="Amoudi M">M. A. Amoudi</name>
</author>
<author>
<name sortKey="Leclercq, M" uniqKey="Leclercq M">M. Leclercq</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lichtenberg, J" uniqKey="Lichtenberg J">J. Lichtenberg</name>
</author>
<author>
<name sortKey="Yilmaz, A" uniqKey="Yilmaz A">A. Yilmaz</name>
</author>
<author>
<name sortKey="Welch, J D" uniqKey="Welch J">J. D. Welch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M. Tompa</name>
</author>
<author>
<name sortKey="Li, N" uniqKey="Li N">N. Li</name>
</author>
<author>
<name sortKey="Bailey, T L" uniqKey="Bailey T">T. L. Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pesole, G" uniqKey="Pesole G">G. Pesole</name>
</author>
<author>
<name sortKey="Liuni, S" uniqKey="Liuni S">S. Liuni</name>
</author>
<author>
<name sortKey="Grillo, G" uniqKey="Grillo G">G. Grillo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gelfand, Y" uniqKey="Gelfand Y">Y. Gelfand</name>
</author>
<author>
<name sortKey="Rodriguez, A" uniqKey="Rodriguez A">A. Rodriguez</name>
</author>
<author>
<name sortKey="Benson, G" uniqKey="Benson G">G. Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khan, A" uniqKey="Khan A">A. Khan</name>
</author>
<author>
<name sortKey="Fornes, O" uniqKey="Fornes O">O. Fornes</name>
</author>
<author>
<name sortKey="Stigliani, A" uniqKey="Stigliani A">A. Stigliani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Moses, A M" uniqKey="Moses A">A. M. Moses</name>
</author>
<author>
<name sortKey="Pollard, D A" uniqKey="Pollard D">D. A. Pollard</name>
</author>
<author>
<name sortKey="Nix, D A" uniqKey="Nix D">D. A. Nix</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, Q" uniqKey="Zhou Q">Q. Zhou</name>
</author>
<author>
<name sortKey="Bachtrog, D" uniqKey="Bachtrog D">D. Bachtrog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hao, Y J" uniqKey="Hao Y">Y. J. Hao</name>
</author>
<author>
<name sortKey="Zou, Y L" uniqKey="Zou Y">Y. L. Zou</name>
</author>
<author>
<name sortKey="Ding, Y R" uniqKey="Ding Y">Y. R. Ding</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freitas, L A" uniqKey="Freitas L">L. A. Freitas</name>
</author>
<author>
<name sortKey="Russo, C A" uniqKey="Russo C">C. A. Russo</name>
</author>
<author>
<name sortKey="Voloch, C M" uniqKey="Voloch C">C. M. Voloch</name>
</author>
<author>
<name sortKey="Mutaquiha, O C" uniqKey="Mutaquiha O">O. C. Mutaquiha</name>
</author>
<author>
<name sortKey="Marques, L P" uniqKey="Marques L">L. P. Marques</name>
</author>
<author>
<name sortKey="Schrago, C G" uniqKey="Schrago C">C. G. Schrago</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beebe, N W" uniqKey="Beebe N">N. W. Beebe</name>
</author>
<author>
<name sortKey="Russell, T" uniqKey="Russell T">T. Russell</name>
</author>
<author>
<name sortKey="Burkot, T R" uniqKey="Burkot T">T. R. Burkot</name>
</author>
<author>
<name sortKey="Cooper, R D" uniqKey="Cooper R">R. D. Cooper</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Comput Math Methods Med</journal-id>
<journal-id journal-id-type="iso-abbrev">Comput Math Methods Med</journal-id>
<journal-id journal-id-type="publisher-id">CMMM</journal-id>
<journal-title-group>
<journal-title>Computational and Mathematical Methods in Medicine</journal-title>
</journal-title-group>
<issn pub-type="ppub">1748-670X</issn>
<issn pub-type="epub">1748-6718</issn>
<publisher>
<publisher-name>Hindawi</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31827584</article-id>
<article-id pub-id-type="pmc">6881769</article-id>
<article-id pub-id-type="doi">10.1155/2019/4259479</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>K-mer-Based Motif Analysis in Insect Species across
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
Genera and Its Application to Species Classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Cserhati</surname>
<given-names>Matyas</given-names>
</name>
<xref ref-type="aff" rid="I1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Xiao</surname>
<given-names>Peng</given-names>
</name>
<xref ref-type="aff" rid="I1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid" authenticated="false">https://orcid.org/0000-0002-5393-9316</contrib-id>
<name>
<surname>Guda</surname>
<given-names>Chittibabu</given-names>
</name>
<email>babu.guda@unmc.edu</email>
<xref ref-type="aff" rid="I1"></xref>
</contrib>
</contrib-group>
<aff id="I1">Department of Genetics, Cell Biology & Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</aff>
<author-notes>
<fn fn-type="other">
<p>Guest Editor: Dimitrios Vlachakis</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<pub-date pub-type="epub">
<day>15</day>
<month>11</month>
<year>2019</year>
</pub-date>
<volume>2019</volume>
<elocation-id>4259479</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>5</month>
<year>2019</year>
</date>
<date date-type="rev-recd">
<day>18</day>
<month>9</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>9</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2019 Matyas Cserhati et al.</copyright-statement>
<copyright-year>2019</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of
<italic>Diptera</italic>
that include
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two
<italic>Aedes</italic>
and one
<italic>Culex</italic>
species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.</p>
</abstract>
<funding-group>
<award-group>
<funding-source>University of Nebraska Medical Center</funding-source>
</award-group>
<award-group>
<funding-source>National Institutes of Health</funding-source>
<award-id>P20GM103427</award-id>
<award-id>P30CA036727</award-id>
<award-id>3P30MH062261</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>1. Introduction</title>
<p>DNA k-mers are short recurring elements in the genomes of all living species. These elements are both conserved and diverged across species owing to their functional significance, which enables these k-mer signatures ideal for species identification. Several recent studies have described the distribution of statistically significant k-mers in the genomes and several regulatory subregions (core, proximal, distal promoters, and 3′ and 5′ UTRs) in a small number of plant species as well as modern and archaic humans [
<xref rid="B1" ref-type="bibr">1</xref>
<xref rid="B3" ref-type="bibr">3</xref>
]. A k-mer is a type of short oligonucleotide of length k. K-mers can be part of core segments of transcription factor binding sites or regulatory elements that take part in protein binding and gene regulation in different subregions of the genome.</p>
<p>The present version of the algorithm is an alignment-free k-mer sequence comparison method. Such methods involve statistical analysis and comparison of k-mers between the genomes of two species. These methods vary in the statistical measures applied, such as the comparison of word frequency, incorporation of information theory, universal sequence maps, and the measurement of complexity [
<xref rid="B4" ref-type="bibr">4</xref>
]. The advantages of k-mer-based alignment-free methods over alignment-based phylogenetic algorithms are that they can process the data much faster and eliminate biases that could be induced by using
<italic>a priori</italic>
-defined guide trees when performing the alignment, and subjective selection of alignment scoring parameters, such as gap opening and extension [
<xref rid="B5" ref-type="bibr">5</xref>
,
<xref rid="B6" ref-type="bibr">6</xref>
].</p>
<p>Related methods include metagenomic algorithms that are capable of identifying bacterial taxonomic groups based on metagenomic sequence data. Such methods focus on taxon identification but not taxon comparison. Such an algorithm is PhyloPythia, which applies a multiclass support vector machine (SVM) using relative frequency profiles of short oligonucleotides to classify genome fragments as short as 1 Kbp into taxonomic ranks between genus and phylum with high specificity [
<xref rid="B7" ref-type="bibr">7</xref>
]. Another method, TACOA, uses abundance profiles to represent whole-genome sequences and uses a k-nearest-neighbor-classification-based method [
<xref rid="B8" ref-type="bibr">8</xref>
]. It correctly classified fragments larger than 800 bp between 39% and 76% at the genus or superkingdom level, respectively [
<xref rid="B8" ref-type="bibr">8</xref>
]. These methods perform quite well at oligonucleotide lengths as low as 4 bp. Yet another algorithm, RAIphy, calculates the log odds ratio between the observed and expected occurrence of each k-mer based on Markov assumptions for the k-mer probabilities [
<xref rid="B9" ref-type="bibr">9</xref>
]. This algorithm assigns a genomic fragment to a specific taxon based on a comparison between the two [
<xref rid="B9" ref-type="bibr">9</xref>
]. The possible weakness of these metagenomic methods is that they likely only use very short fragments as compared to the whole genome and thereby quite possibly skew their oligonucleotide frequency profiles. However, an analysis of the whole genome gives a certain and complete picture of these profiles.</p>
<p>Compared to our previous work in this area [
<xref rid="B1" ref-type="bibr">1</xref>
<xref rid="B3" ref-type="bibr">3</xref>
], the current improved algorithm scores k-mer significance based on a normalized scale of −1 to +1, which is used to calculate k-mer signatures so as not only to predict statistically significant and biologically relevant k-mers, but also to make the genomes of two given species comparable based on their k-mer signatures. Therefore, the goal of this study is to further develop a k-mer prediction method, which can be used to predict biologically significant k-mers and then to use these k-mers in species comparison and clustering.</p>
<p>The present method is novel in that it measures the Pearson correlation coefficient values of the normalized k-mer relevance scores (not just the k-mer's frequencies) between the whole genomes of a number of species and assigns them to clusters. While the underlying algorithm is similar, certain changes were made to differentiate statistically significant over- and underrepresented k-mers (see Materials and Methods). Furthermore, the k-mer prediction algorithm is now applied to a wider range of animal species as compared to plant species in the previous studies. This is because different genera of species provide sufficient diversity for cross-comparison, and the whole-genome sequences for these species were also available.</p>
<p>In this study, the whole-genome sequences of 22
<italic>Anopheles</italic>
species, 30
<italic>Drosophila</italic>
species, and six
<italic>Glossina</italic>
species from the NCBI database were downloaded (58 in total), analyzed, and compared with each other. We also included the whole-genome sequences of
<italic>Apis mellifera</italic>
and
<italic>Caenorhabditis briggsae</italic>
as outliers. Aside from the previously mentioned 58 species, the whole genomes of two
<italic>Aedes</italic>
and one
<italic>Culex</italic>
species were also downloaded and compared with the three Dipteran genera. These species were used as outlier species in order to measure how the genome of such unrelated species measures up to the species in the three genera under study. With a larger number of species involved, more general inferences can be made about k-mer content, as well as inferences about the phylogenetic aspects of two genera in relation to each other. These analyses have become possible because the whole-genome sequences of 110 fly species are available now, thus facilitating comparative studies with regard to gene content, genetic mechanisms, and genome structure [
<xref rid="B10" ref-type="bibr">10</xref>
].</p>
<p>
<italic>Anopheles</italic>
is a genus of mosquitoes, belonging to the family Culicidae and suborder Nematocera.
<italic>Anopheles</italic>
has 485 species, 100 of which can transmit malaria via the genus
<italic>Plasmodium</italic>
, and 41of the 100 species cause human malaria. 14 of the 22
<italic>Anopheles</italic>
in this study are among those species, which cause malaria [
<xref rid="B11" ref-type="bibr">11</xref>
]. They are global in distribution and are studied mainly because of their epidemiological importance, having caused around 200 million deaths in 2013 [
<xref rid="B12" ref-type="bibr">12</xref>
]. The taxonomy of the subfamily Anophelinae is unstable. For example, a supposedly sister genus,
<italic>Bironella</italic>
, was classified by different groups either to be outside or within
<italic>Anopheles</italic>
. Morphological traits and DNA sequence data were studied to address the relationships between
<italic>Anopheles</italic>
,
<italic>Bironella</italic>
, and
<italic>Chagasia</italic>
, but were not able to produce stable results [
<xref rid="B13" ref-type="bibr">13</xref>
]. Therefore, the genome k-mer analysis of
<italic>Anopheles</italic>
species is a timely task.</p>
<p>The genus
<italic>Drosophila</italic>
(pomace, vinegar, or wine flies) [
<xref rid="B14" ref-type="bibr">14</xref>
] includes various multiple subgenera and clades and are widely distributed in the northern hemisphere. According to Throckmorton [
<xref rid="B15" ref-type="bibr">15</xref>
], the faunal disjunction of
<italic>Drosophila</italic>
species between the Old and New World occurred in five lineages (the Scaptodrosophilan, Sophophoran, virilis-repleta, immigrans-tripunctata, and the Hirtodrosophilan) [
<xref rid="B16" ref-type="bibr">16</xref>
]. The fruit fly species
<italic>Drosophila melanogaster</italic>
is probably the most well-known and widely studied insect in the world, due to its easy culturing, high reproductive rate and generation time, and small body size.</p>
<p>On a genomic level,
<italic>Anopheles</italic>
and
<italic>Drosophila</italic>
have several marked genomic differences. In general, anophelines have greater intron loss compared to drosophilids. They also have more genes, which are the result of gene fission and fusion events, affecting an average of 10.1% of all genes in the genomes of the 10 species with the most contiguous genome assemblies. Furthermore, codon usage is more uniform in anopheline genomes than in drosophilids [
<xref rid="B17" ref-type="bibr">17</xref>
].</p>
<p>Species of the genus
<italic>Glossina</italic>
(family Glossinidae, suborder Brachycera) or tsetse flies are characterized by difficult culturing, long generation times, and low reproductive rates. These fly species are studied due to their medical and economic importance of parasitism and their role as vectors of trypanosomes [
<xref rid="B18" ref-type="bibr">18</xref>
]. The genus consists of three subgenera,
<italic>Austenina</italic>
,
<italic>Nemorhina</italic>
, and
<italic>Glossina</italic>
, represented by the species,
<italic>G. fusca</italic>
,
<italic>G. palpalis</italic>
, and
<italic>G. morsitans</italic>
. Their 22 species were previously classified within five species complexes [
<xref rid="B19" ref-type="bibr">19</xref>
]. Extent tsetse flies are distributed in sub-Saharan Africa as well as the Saudi Arabian Peninsula [
<xref rid="B20" ref-type="bibr">20</xref>
].</p>
<p>With a novel method in hand, the goal of this study is to predict biologically important k-mers of lengths 7–9 bp in species identification by statistically examining all lexicographically possible k-mers for the 58
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
species. K-mers, which are 8 bp long, correspond to the typical length of DNA which is recognized by transcription factors. Therefore, these k-mers are the length of typical core transcription factor binding sites [
<xref rid="B21" ref-type="bibr">21</xref>
,
<xref rid="B22" ref-type="bibr">22</xref>
]. We also allowed for a ±1 bp wobble; this is why we chose a range of 7–9 bp. We achieve this by scoring the k-mers of the 58 species' motifome (defined as all lexicographically possible k-mers of a given length in the genome) based on their whole-genome sequences. Furthermore, with such a whole-genome k-mer signature (WGKS) available for each species (that is, the available scores of all k-mers in the genome of a given species), it will be possible to do an all-versus-all comparison for all the study species. This way, we could assign the species into different species clusters based on high correlations of the WGKS among members of the same cluster.</p>
</sec>
<sec id="sec2">
<title>2. Materials and Methods</title>
<sec id="sec2.1">
<title>2.1. Sequence Data</title>
<p>The whole-genome sequences were downloaded from the NCBI Genome database (
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/genome/">https://www.ncbi.nlm.nih.gov/genome/</ext-link>
) for 22
<italic>Anopheles</italic>
species, 30
<italic>Drosophila</italic>
species, and six
<italic>Glossina</italic>
species. A genomic summary of these species including the names of the whole-genome sequences, the number of chromosomes/contigs/scaffolds present in their individual genomes, the size of their genomes, as well as the A/C/G/T % of their genomes is in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
. Two
<italic>Anopheles</italic>
species,
<italic>A. farauti</italic>
and
<italic>gambiae</italic>
, have two separate genomes in the database. One of their genomes was broken up into a large number of shorter fragments. Therefore, the genome with a smaller number of contigs was selected. The genome size and the ACGT% have been plotted for each of the 58 species in Supplemental Figures
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
, respectively.</p>
<p>The 5′and 3′ UTR sequences for seven
<italic>Drosophila</italic>
species and the intron sequences for 12
<italic>Drosophila</italic>
species were downloaded from the FlyBase database (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.flybase.net/genomes/">ftp://ftp.flybase.net/genomes/</ext-link>
). The 5′ and 3′ UTR sequence sets for
<italic>Anopheles gambiae</italic>
were also downloaded from the UTR database (
<ext-link ext-link-type="uri" xlink:href="http://utrdb.ba.itb.cnr.it/home/download">http://utrdb.ba.itb.cnr.it/home/download</ext-link>
) [
<xref rid="B23" ref-type="bibr">23</xref>
] as a comparison with the
<italic>Drosophila</italic>
species. Summary statistics for these species for 5′ and 3′ UTR regions as well as introns can also be seen in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
.</p>
<p>The 28 mitochondrial genomes were downloaded from the NCBI database:
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/genome/browse">https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles</ext-link>
. The genomes were aligned with the CLUSTALW2 software and trimmed so as to make the alignment less variable at the ends.</p>
</sec>
<sec id="sec2.2">
<title>2.2. K-mer Scoring Algorithm</title>
<p>The original k-mer scoring algorithm was described in the study of Lichtenberg et al. [
<xref rid="B21" ref-type="bibr">21</xref>
] and Cserhati et al. [
<xref rid="B1" ref-type="bibr">1</xref>
]. The algorithm is briefly described below; however, more details on the mathematical background for scoring the significance of a given k-mer can be found in the original publications. A flowchart showing the individual steps of the algorithm with inputs and outputs is shown in
<xref ref-type="fig" rid="fig1">Figure 1</xref>
.</p>
<p>The adapted algorithm used in the analysis is an enumeration algorithm, which counts the total occurrence of all possible k-mers of a given length
<italic>k</italic>
bp. A k-mer is viewed, for example, as the core section of a transcription factor binding site (TFBS) that different kinds of regulatory factors bind to, but it could also be a k-mer with any other kind of functional relevance. The k-mer sequence corresponds to a DNA surface that can specifically bind a regulatory protein. K-mers of lengths 7–9 bp were analyzed in this study (heptamers, octamers, and nonamers). For any length k, there are 4
<sup>k</sup>
combinatorically possible k-mers, all making up the so-called motifome, as mentioned in the introduction. The longer the k-mer, the more specific its sequence is and the more well-defined its binding surface gets. The observed occurrence O of each k-mer was calculated for each possible k-mer.</p>
<p>For each genome, the background base pair distribution was also calculated in percent (A/C/G/T%). With this information, the probability of any given k-mer can be calculated based on Markov assumptions:
<disp-formula id="EEq1">
<label>(1)</label>
<mml:math id="M1">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")" separators="|">
<mml:mrow>
<mml:mtext>expected</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mrow>
<mml:mstyle>
<mml:mo stretchy="true"></mml:mo>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
where
<italic>p</italic>
<sub>
<italic>i</italic>
</sub>
is the percentage occurrence of the base at position
<italic>i</italic>
in the k-mer. These probabilities are multiplied together to get the expected probability of the given k-mer. The expected occurrence
<italic>E</italic>
of a given k-mer is equal to the length of the genome multiplied by the probability of the k-mer:
<disp-formula id="EEq2">
<label>(2)</label>
<mml:math id="M2">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>E</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>genome</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")" separators="|">
<mml:mrow>
<mml:mtext>expected</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>In the previous works, a scoring algorithm was used to measure how much the actual occurrence O of a k-mer deviated from the expected occurrence E:
<disp-formula id="EEq3">
<label>(3)</label>
<mml:math id="M3">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="|">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mi>O</mml:mi>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
where
<italic>S</italic>
<sub>k-mer</sub>
is the calculated score and
<italic>O</italic>
and
<italic>E</italic>
are the observed and expected occurrences of a given k-mer, respectively. The purpose of this score is to filter out meaningless k-mers that could simply occur by chance. If the expected and observed occurrences should be about the same, the score should be close to 0. However, if the observed occurrence of the k-mer is much greater than the expected occurrence, then the score is close to 1. If the expected occurrence
<italic>E</italic>
is much greater than O, then the score goes to infinity.</p>
<p>In this current study, equation (
<xref ref-type="disp-formula" rid="EEq3">3</xref>
) was modified to differentiate overrepresented and underrepresented k-mers. The new scoring equation is as follows:
<disp-formula id="EEq4">
<label>(4)</label>
<mml:math id="M4">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>With this setup, there are three possible cases:
<disp-formula id="EEq5">
<label>(5)</label>
<mml:math id="M5">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>O</mml:mi>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi></mml:mi>
<mml:mtext>overrepresented k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer,</mml:mtext>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>O</mml:mi>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi></mml:mi>
<mml:mtext>underrepresented k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer,</mml:mtext>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>O</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi></mml:mi>
<mml:mtext>randomly occurring k</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>mer.</mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>This way, all possible k-mers 7–9 bp long were scored for all 22
<italic>Anopheles</italic>
, all 30
<italic>Drosophila</italic>
, and all six
<italic>Glossina</italic>
species, as well as the two outlier species,
<italic>A. mellifera</italic>
and
<italic>C. briggsae</italic>
.</p>
<p>The input at this first stage of the algorithm is the whole-genome sequences of all species in the study. The whole-genome sequence is used to calculate the expected and observed occurrences of all 4
<sup>k</sup>
k-mers. The output is the WGKS for all species, a two-column list including the k-mers and their scores.</p>
<p>The list of k-mers and their occurrence and score values are all available in the Supplemental material online. The python script that performs the analysis is publicly available on github at
<ext-link ext-link-type="uri" xlink:href="https://github.com/csmatyi/motif_analysis">https://github.com/csmatyi/motif_analysis</ext-link>
for interested users.</p>
</sec>
<sec id="sec2.3">
<title>2.3. Calculation of Correlation between Any Two Species Based on K-mer Scores and Heatmap</title>
<p>The input of this stage of the algorithm is the WGKSs for all species in the study from the previous step. The output is a symmetric matrix of Pearson correlation coefficients (CC) for all species pairs showing how well their WGKSs correlate with one another.</p>
<p>The Pearson correlation coefficient between any two species was calculated based on their k-mer scores for any given k-mer length (here 7–9 bp). For any two species under consideration, all possible k-mers of length
<italic>k</italic>
were sorted lexicographically from
<italic>A</italic>
<sub>
<italic>k</italic>
</sub>
to
<italic>T</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>k</italic>
 = 7–9). If any k-mer was missing from either species, then it was omitted. The correlation coefficient was calculated based on the scores for each k-mer present in both species. These correlation coefficient values were depicted in a heatmap, one for each k-mer length from 7 to 9 bp.</p>
<p>We have three genera (
<italic>Drosophila</italic>
,
<italic>Anopheles</italic>
, and
<italic>Glossina</italic>
) in this study. The group is defined as all species in a specific genus and the nongroup is defined as all the remaining species. To compare the statistical significance of the CC values between any two species within a group vs. all CC values of all correlations between any one species in a group and any one species in a nongroup, we performed the Welch's
<italic>t</italic>
-test (unequal variance) for each comparison.</p>
</sec>
<sec id="sec2.4">
<title>2.4. Creation of Plots</title>
<p>In the last step of the algorithm, the symmetric CC matrix is transformed into a heatmap to depict the relationship between all species in the study. Barplots, boxplots, and heatmaps were generated using the barplot, boxplot, and heatmap functions in R, version 3.4.3. Phylogenetic trees for the three insect genera were created in R using the library phangorn, using the commands upgma. The CC values for octamers were subtracted from 1 to get distance values, which were then used in the upgma command. Venn diagrams were created with the online software at
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.psb.ugent.be/webtools/Venn/">http://bioinformatics.psb.ugent.be/webtools/Venn/</ext-link>
.</p>
</sec>
<sec id="sec2.5">
<title>2.5. Phylogenetic Trees</title>
<p>Phylogenetic trees were created for all three insect genera using the phangorn library in R. The distance metric was 1−CC for all species pairs. Trees were created using the UPGMA, WPGMA, and NJ methods, using the upgma, wpgma, and nj commands.</p>
</sec>
<sec id="sec2.6">
<title>2.6. Taxonomical Comparisons</title>
<p>The classification of
<italic>Drosophila</italic>
species was matched to data (Genus/Subgenus/Group/Complex) in the TaxoDros Database at
<ext-link ext-link-type="uri" xlink:href="http://www.taxodros.uzh.ch/">http://www.taxodros.uzh.ch</ext-link>
.</p>
</sec>
<sec id="sec2.7">
<title>2.7. Acquisition of Tandem Repeat Sequences</title>
<p>Tandem repeat sequences were retrieved from the Tandem Repeats Database at
<ext-link ext-link-type="uri" xlink:href="https://tandem.bu.edu/cgi-bin/trdb/trdb.exe">https://tandem.bu.edu/cgi-bin/trdb/trdb.exe</ext-link>
[
<xref rid="B24" ref-type="bibr">24</xref>
]. Repeats of length 8 with 0 mismatches were selected for
<italic>D. mojavensis</italic>
.</p>
</sec>
<sec id="sec2.8">
<title>2.8. Matching Biologically Relevant Genome K-mers against Position Weight Matrixes in the JASPAR Database</title>
<p>Position weight matrixes (PWM) for 140 transcription factor binding sites (TFBS) in
<italic>D. melanogaster</italic>
were downloaded from the JASPAR website [
<xref rid="B25" ref-type="bibr">25</xref>
] at
<ext-link ext-link-type="uri" xlink:href="http://jaspar.genereg.net/">http://jaspar.genereg.net</ext-link>
. For all 30
<italic>Drosophila</italic>
species, all biologically relevant candidate octamer k-mers were matched against all of 140 of these PWMs in a sliding window-like manner (since not all octamers were as long as these PWMs). A cutoff of 80% sequence similarity was used to call a match between a given k-mer and a JASPAR k-mer (which is the default used by the JASPAR database). Each JASPAR database hit is listed next to each k-mer in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
.</p>
<p>Putative biologically relevant genome k-mers were determined for a given species by calculating the mean score and standard deviation for each species and using the mean ± 2SD value as a cutoff. All k-mers with a score value above this limit were predicted as biologically relevant. This is because in the normal distribution, 5% of all values lie above 1.96
<italic>z</italic>
-score limit. This cutoff was also used in the k-mer prediction study of modern and archaic humans [
<xref rid="B3" ref-type="bibr">3</xref>
].</p>
</sec>
</sec>
<sec id="sec3">
<title>3. Results and Discussion</title>
<sec id="sec3.1">
<title>3.1. Whole-Genome Sequence Analysis</title>
<p>For each species, the whole-genome motifome was enumerated and scored for k-mers of lengths 7–9 bp. Then, the whole-genome k-mer content was compared in an all-versus-all pairwise fashion, to determine correlation coefficients of 1,953 comparisons in total (including comparisons between the same species). These values are available in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
for k-mers of lengths 7–9 bp.</p>
<p>The CC value represents how similar the WGKSs are between two species. Similar WGKSs of two species in turn reflect how similar the genomes are between these two species. Obviously, a more similar pair of species will contain more similar distribution of k-mers throughout their genomes and thus have a higher CC value. This is because on a macroscopic level, the genomes of similar species have not had enough time to diverge and accumulate too many mutational differences. Conversely, the distribution of k-mers in the genomes of dissimilar species is different, and thus, their WGKSs are also different. Therefore, they also contain k-mers (for example, transcription factor binding sites) with different functions. For example, in a study of
<italic>D. melanogaster</italic>
,
<italic>D. simulans</italic>
,
<italic>D. erecta</italic>
, and
<italic>D. yakuba</italic>
, 5% of functional Zeste transcription factor binding sites were gained and/or lost compared to the other lineages [
<xref rid="B26" ref-type="bibr">26</xref>
].</p>
<p>The CC matrixes for the 63 species are depicted in the heatmaps in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
based on octamers. In the heatmap, a lighter, yellower color denotes a higher CC value, closer to 1, denoting species whose WGKS is similar to one another. Darker, redder colors denote CC values closer to 0, denoting species pairs with an unrelated WGKS. What is very clear is that the three genera,
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
clearly separate from one another quite well and also from the two outliers.</p>
<sec id="sec3.1.1">
<title>3.1.1.
<italic>Drosophila</italic>
</title>
<p>Within the
<italic>Drosophila</italic>
cluster, a smaller subgroup can be seen including eight species:
<italic>D. albomicans</italic>
,
<italic>D. americana</italic>
,
<italic>D. arizonae</italic>
,
<italic>D. grimshawi</italic>
,
<italic>D. mojavensis</italic>
,
<italic>D. nasuta</italic>
,
<italic>D. navajoa</italic>
, and
<italic>D. virilis</italic>
. These species represent a separate monophylogenetic group within the genus
<italic>Drosophila</italic>
and correspond to the subgenus
<italic>Drosophila</italic>
. All of the other species belong to the subgenus
<italic>Sophophora</italic>
. Within
<italic>Sophophora</italic>
, four species can be seen which themselves form a small, compact group:
<italic>D. miranda</italic>
,
<italic>D. obscura</italic>
,
<italic>D. persimilis</italic>
, and
<italic>D. pseudoobscura</italic>
. These four species belong to the obscura species group within
<italic>Sophophora</italic>
. The phylogenetic tree for the genus
<italic>Drosophila</italic>
can be seen in
<xref ref-type="fig" rid="fig3">Figure 3(a)</xref>
. Trees using the UPGMA, WPGMA, and NJ methods were drawn as described in the Materials and Methods section. The
<italic>Drosophila</italic>
and
<italic>Sophophora</italic>
separate well from one another. The outlier species
<italic>D. ananassae</italic>
,
<italic>busckii</italic>
, and
<italic>willistoni</italic>
also separate well from all of the other species.</p>
<p>This algorithm was used not only for measuring species similarity based on correlation of k-mer content, but also for predicting biologically relevant genome k-mers in all three Dipteran genera, as described in the Materials and Methods section. A list of all putative biologically relevant octamers is provided in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
. A summary of these predicted k-mers can be seen in
<xref rid="tab1" ref-type="table">Table 1</xref>
.</p>
<p>It was found that shorter k-mers are more conserved because it is harder to conserve longer stretches of DNA. However, the shorter the k-mer, the less possible number of k-mers can be studied. Shortening k-mers loses information and precision because longer k-mers increase the k-mer signature, making the calculation of the CC value more precise. For octamers, the mean CC value was 0.857 (std. dev. 0.07). A
<italic>p</italic>
value of unequal variance of 2.3 × 10
<sup>−247</sup>
was calculated for CC values within
<italic>Drosophila</italic>
and CC values between
<italic>Drosophila</italic>
and non-
<italic>Drosophila</italic>
species. A Cohen's
<italic>d</italic>
-value of 3.18 (CI of 3.03–3.32, 95% confidence level) was calculated, which is very high.</p>
</sec>
<sec id="sec3.1.2">
<title>3.1.2.
<italic>Drosophila</italic>
Species with High Repeat Content in Their Genomes</title>
<p>Another species of
<italic>Drosophila</italic>
(
<italic>D. busckii</italic>
) is seemingly misplaced on the heatmap, between
<italic>Anopheles</italic>
and
<italic>Glossina</italic>
, away from the other
<italic>Drosophila</italic>
species. This species has the lowest average CC value compared to all other
<italic>Drosophila</italic>
species (0.753, octamers, Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
). However, when the CC value of
<italic>D. busckii</italic>
was compared to that of the six members of the genus
<italic>Glossina</italic>
, the average CC value was 0.699 (looking at octamers). When comparing CC values between
<italic>D. busckii</italic>
and
<italic>Glossina</italic>
versus
<italic>D. busckii</italic>
and all other
<italic>Drosophila</italic>
, the
<italic>p</italic>
value was 0.006. When comparing
<italic>D. busckii</italic>
only with the eight members of this small monophyletic group within
<italic>Drosophila</italic>
, an average CC of 0.891 can be calculated, with a
<italic>p</italic>
value of 4.7 × 10
<sup>−9</sup>
, when comparing CC values between
<italic>D. busckii</italic>
and
<italic>Glossina</italic>
versus
<italic>D. busckii</italic>
and these eight
<italic>Drosophila</italic>
species. The TaxoDros Database also classifies
<italic>D. busckii</italic>
in its own separate species group (the busckii species group, which is part of the
<italic>Dorsilopha</italic>
subgenus). It is not exactly certain why
<italic>D. busckii</italic>
clusters the way it does. Zhou and Bachtrog [
<xref rid="B27" ref-type="bibr">27</xref>
] have observed that 60% of the neo-Y-linked genes have become nonfunctional in
<italic>D. busckii</italic>
. Therefore, it is possible that due to this, the regulatory motifs in their promoter regions have also undergone differential mutations, thereby altering the k-mer content of this species.</p>
<p>
<italic>D. ananassae</italic>
, a species belonging to the
<italic>Drosophila</italic>
subgenus, shows lower resemblance to the members of the
<italic>Sophophora</italic>
subgenus. This can be seen well in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
.
<italic>D. ananassae</italic>
is the species with the next lowest average CC value (0.800, octamers, Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
) to all other
<italic>Drosophila</italic>
species. This could be due to the fact that its genome has the highest percent content of repetitive elements (24.93%), followed by
<italic>D. willistoni</italic>
(15.57%), also with the fifth lowest average CC value among the drosophilids (0.832, octamers, Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
). A high repetitive element content in a species' genome means that the observed occurrence of many k-mers will be increased, thereby skewing the score for that specific k-mer. This in turn will also decrease the CC value between the given species and other species which do not have a high repetitive element content. These two species also have the highest number of pseudotransfer (t)RNA genes (
<italic>D. ananassae</italic>
—165/472;
<italic>D. willistoni</italic>
—164/484). Indeed, of the 81 of the 98 reverse complement k-mers of
<italic>D. ananassae</italic>
with a minimum score of 0.8 and a minimum occurrence of 10,000, only 6–14 were also found in any of the other
<italic>Drosophila</italic>
species, also with a minimum score of 0.8 and a minimum occurrence of 10,000. For
<italic>D. willistoni</italic>
, of the 44 of the top 46 such abundant high-scoring reverse complement k-mer, only 8–22 were found also to be high-scoring in the genome of any other
<italic>Drosophila</italic>
species, except for the genome of
<italic>D. mojavensis</italic>
, which had 30 such high-scoring abundant reverse complement k-mers (see Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
for lists). This indicates that these abundant, high-scoring repetitive k-mers might be the reason for skewing the CC values between
<italic>D. ananassae</italic>
and
<italic>D. willistoni</italic>
and all other species.</p>
<p>
<italic>D. mojavensis</italic>
is another species that clusters well with other species from the subgenus
<italic>Drosophila</italic>
, but still had the third lowest CC value (0.823, octamers, Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
). 592 octamer k-mers for this species without any mismatches were selected from the Tandem Repeats Database (TRDB) [
<xref rid="B24" ref-type="bibr">24</xref>
]. These k-mers were filtered if they had a score less than 0.333. According to equation (
<xref ref-type="disp-formula" rid="EEq4">4</xref>
) in the Materials and Methods section, this corresponded to a k-mer which occurred twice as many times as its expected occurrence and therefore serves as a good cutoff CC value to gauge functional biological relevance. 245 of these 592 k-mers (41.4%) from the WGKS of
<italic>D. mojavensis</italic>
had a score higher than or equal to 0.333.
<italic>D. mojavensis</italic>
had 86 abundant, high-scoring reverse compliment k-mers (see filtering criteria in the previous paragraph). Five other
<italic>Drosophila</italic>
species had at least half as many (43) such specific k-mers, including
<italic>D. busckii</italic>
, which as seen before had the lowest mean CC value with all other
<italic>Drosophila</italic>
species (Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
).
<italic>D. mojavensis</italic>
and
<italic>D. busckii</italic>
have a CC value of 0.88 (octamer level), which is above both the mean and the median within
<italic>Drosophila</italic>
. This also indicates that the high repeat k-mer content of this species may be skewing its CC values with other
<italic>Drosophila</italic>
species as well.</p>
</sec>
<sec id="sec3.1.3">
<title>3.1.3.
<italic>Glossina</italic>
</title>
<p>The
<italic>Anopheles</italic>
and
<italic>Glossina</italic>
clusters are much more compact than
<italic>Drosophila</italic>
. The mean CC between all six
<italic>Glossina</italic>
species was 0.978 with a std. dev. of 0.02 (looking at octamers, see
<xref rid="tab2" ref-type="table">Table 2</xref>
), whereas the average CC between
<italic>Glossina</italic>
and non-
<italic>Glossina</italic>
species was 0.761 with a std. dev. of 0.143 (
<xref rid="tab2" ref-type="table">Table 2</xref>
). The
<italic>p</italic>
value is 1.5 × 10
<sup>−18</sup>
comparing within-
<italic>Glossina</italic>
CC values versus CC values between
<italic>Glossina</italic>
and non-
<italic>Glossina</italic>
species. A Cohen's
<italic>d</italic>
-value of 8.48 (CI of 7.79–9.18, 95% confidence level) was calculated, which is very high.</p>
<p>Supplementary Figures
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
also show that both the genome size (315–380 Mbp) and the ACGT% are also relatively invariable compared to the other two Dipteran families. This might be due to the relatively small number of species examined and also the close relationship of the six species examined. On the heatmap,
<italic>G. brevipalpis</italic>
is correctly classified into its own group, corresponding to the subgenus
<italic>Austenina</italic>
.
<italic>G. morsitans morsitans</italic>
and
<italic>G. pallipides</italic>
on the heatmap correctly cluster together and belong to the subgenus
<italic>Glossina</italic>
.
<italic>G. fuscipes</italic>
and
<italic>G. palpalis gambiensis</italic>
also cluster together on the heatmap as part of the subgenus
<italic>Nemorhina</italic>
. One species,
<italic>G. austeni</italic>
, however clusters together with the palpalis group, whereas according to NCBI taxonomy it belongs to the subgenus
<italic>Glossina</italic>
. These species relationships are also mirrored in the phylogenetic tree in
<xref ref-type="fig" rid="fig3">Figure 3(b)</xref>
. All three phylogenetic algorithms produce the same species relationships as described previously.</p>
</sec>
<sec id="sec3.1.4">
<title>3.1.4.
<italic>Anopheles</italic>
</title>
<p>The mean CC value calculated for octamers between
<italic>Anopheles</italic>
species was 0.948 (std. dev. 0.023). A
<italic>p</italic>
value of 0.0 was calculated for CC values within
<italic>Drosophila</italic>
and CC values between
<italic>Anopheles</italic>
and non-
<italic>Anopheles</italic>
species (meaning that the
<italic>p</italic>
value was too low that the neglog value cannot be displayed). A Cohen's
<italic>d</italic>
-value of 3.18 (CI of 5.03–5.47, 95% confidence level) was calculated, which is very high.</p>
<p>Hao et al. [
<xref rid="B28" ref-type="bibr">28</xref>
] performed a phylogenetic analysis based on 13 conserved mitochondrial protein-coding genes from 50 mosquito species. Based on their phylogenetic tree, the species from the
<italic>Anopheles</italic>
cluster as well as the
<italic>Aedes</italic>
and
<italic>C. quinquefasciatus</italic>
clustered similarly in the Hao et al.'s study and also in the present study. For example,
<italic>A. darlingi</italic>
is located on a separate major branch in the Hao study, and in the present study, this species as well as
<italic>A. albimanus</italic>
grouped together within the
<italic>Anopheles</italic>
cluster, within the subgenus
<italic>Nyssorhynchus</italic>
(within the genus
<italic>Anopheles</italic>
, subfamily Anophelinae). These two species had a CC value of 0.989 (looking at octamers, see Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
), whereas the average CC value between these two species and the rest of the
<italic>Anopheles</italic>
cluster is 0.955 (see Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
). In both the heatmaps and Figure 6 of Hao et al. [
<xref rid="B28" ref-type="bibr">28</xref>
],
<italic>A. arabiensis</italic>
,
<italic>gambiae</italic>
,
<italic>melus</italic>
, and
<italic>merus</italic>
cluster together, corresponding to the
<italic>gambiae</italic>
species complex of the subgenus
<italic>Cellia</italic>
of the genus
<italic>Anopheles</italic>
. In the heatmaps,
<italic>A. farauti</italic>
and
<italic>koliensis</italic>
cluster together, and in the phylogenetic tree of Hao et al.'s study, these two species also cluster on the same major branch. Also, the species
<italic>A. cracens</italic>
and
<italic>dirus</italic>
cluster together closely in both the heatmap and the phylogenetic tree. In the Hao et al. study [
<xref rid="B28" ref-type="bibr">28</xref>
] and also on the heatmap, the species
<italic>A. sinensis</italic>
and
<italic>atroparvus</italic>
also cluster together. These two species are members of the
<italic>Anopheles</italic>
subgenus of the genus
<italic>Anopheles</italic>
.</p>
<p>In another study by Freitas et al. [
<xref rid="B29" ref-type="bibr">29</xref>
], cytochrome oxidase subunits I and II (COI and COII) as well as the 5.8 S ribosomal subunit were analyzed to study the phylogenetic relationships between 47
<italic>Anopheles</italic>
species. In their study as well as the present one,
<italic>A. farauti</italic>
,
<italic>koliensis</italic>
and
<italic>punctulatus</italic>
all clustered together, which are part of the
<italic>Anopheles punctulatus</italic>
group, which are major malaria vectors in the Southwest Pacific [
<xref rid="B30" ref-type="bibr">30</xref>
].
<italic>A. arabiensis</italic>
,
<italic>gambiae</italic>
,
<italic>melus</italic>
, and
<italic>merus</italic>
also clustered closely in both studies, just as they did in the Hao et al.'s study [
<xref rid="B28" ref-type="bibr">28</xref>
]. However, whereas in the heatmaps
<italic>A. dirus</italic>
and
<italic>stephensi</italic>
clustered together, they were located on separate branches of the phylogenetic trees in both the Hao and the Freitas studies.</p>
<p>This difference might be due to both the Hao and Freitas studies having analyzed only the mitochondrial genome as opposed to the whole-genome studies in this paper. Nevertheless, these close clusterings between all three studies are remarkable in that very similar results were derived from analyzing a handful of mitochondrial proteins as well as from a global sequence analysis of the whole genome. The phylogenetic tree for
<italic>Anopheles</italic>
can be seen in
<xref ref-type="fig" rid="fig3">Figure 3(c)</xref>
. The UPGMA and WPGMA trees look similar to one another, whereas the NJ tree looks somewhat different.</p>
</sec>
<sec id="sec3.1.5">
<title>3.1.5. Species Relationships Based on Alignment of Mitochondrial DNA</title>
<p>The mitochondrial whole-genome sequence for 29 species (19
<italic>Anopheles</italic>
, 6
<italic>Drosophila</italic>
, 2
<italic>Aedes</italic>
, 1
<italic>Culex</italic>
, and 1
<italic>Apis</italic>
) was downloaded from the NCBI database. These sequences were aligned, and then, the percent identity was calculated for each possible pairwise species pair. These identity values are depicted in
<xref ref-type="fig" rid="fig4">Figure 4</xref>
and are also available in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
.</p>
<p>
<xref ref-type="fig" rid="fig4">Figure 4</xref>
depicts species from the genera
<italic>Anopheles</italic>
and
<italic>Drosophila</italic>
segregating into two well-defined groups. The
<italic>p</italic>
value for
<italic>Anopheles</italic>
is 6.2 × 10
<sup>−81</sup>
, whereas for
<italic>Drosophila</italic>
it is 8.9 × 10
<sup>−9</sup>
. The two
<italic>Aedes</italic>
species group together, along with
<italic>Culex quinquefasciatus</italic>
. The outlier species,
<italic>Apis mellifera</italic>
, groups well away from all of the other species.</p>
<p>Within the genus
<italic>Drosophila</italic>
, only
<italic>D. albomicans</italic>
belongs to the subgenus
<italic>Sophophora</italic>
, whereas the other five species belong to the subgenus
<italic>Drosophila</italic>
, supporting previous results coming from the analysis of the WGKS.</p>
<p>Within the genus
<italic>Anopheles</italic>
, four species,
<italic>A. arabiensis</italic>
,
<italic>gambiae</italic>
,
<italic>melas</italic>
, and
<italic>merus</italic>
are very similar according to both their mtDNA, which reinforces the previous results from the analysis of the WGKS. Another group of species which cluster tightly together are
<italic>A. farauti</italic>
,
<italic>punctulatus</italic>
,
<italic>cracens</italic>
, and
<italic>dirus</italic>
. These four species also cluster tightly on
<xref ref-type="fig" rid="fig2">Figure 2</xref>
. Five other species,
<italic>A. culicifacies</italic>
,
<italic>epiroticus</italic>
,
<italic>funestus</italic>
,
<italic>minimus</italic>
, and
<italic>stephensi</italic>
. These species do not cluster together in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
. This difference could simply be due to the fact that the k-mer profiles of the 28 species in question here reflect the k-mer distribution of the mtDNA only, and not that of the whole entire genome.</p>
</sec>
<sec id="sec3.1.6">
<title>3.1.6. Classification of New Species Based on WGKS</title>
<p>Since the taxonomy of many insect groups is in flux, it was interesting to see how several species from different genera were classified according to this algorithm. The WGKS of two
<italic>Aedes</italic>
species,
<italic>A. aegypti</italic>
and
<italic>A. albopictus</italic>
, and of
<italic>Culex quinquefasciatus</italic>
(all three being mosquito species) were analyzed and compared to species of
<italic>Anopheles</italic>
, to see if they form a separate group, or if they possibly form a monophyletic group together with
<italic>Anopheles</italic>
. Whole-genome sequences for species in the genera
<italic>Bironella</italic>
or
<italic>Chagasia</italic>
, the two closest genera to
<italic>Anopheles</italic>
in the subfamily Anophelinae were not available at NCBI. In
<xref ref-type="fig" rid="fig2">Figure 2</xref>
, all three species separate from the genus
<italic>Anopheles</italic>
. The two
<italic>Aedes</italic>
species have an average CC of 0.651 with
<italic>Anopheles</italic>
, whereas they have a CC of 0.847 between themselves (when looking at octamers, see Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
). When comparing the CC values between
<italic>Aedes</italic>
and
<italic>Anopheles</italic>
to the CC values within the genus
<italic>Anopheles</italic>
itself (Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
), a
<italic>p</italic>
value of 9.1 × 10
<sup>−54</sup>
can be calculated. Thus, it can be concluded that
<italic>Aedes</italic>
form a group separate from
<italic>Anopheles</italic>
. When comparing
<italic>C. quinquefasciatus</italic>
to
<italic>Anopheles</italic>
, the mean CC value is 0.706. This is significantly different than the mean CC of
<italic>Anopheles</italic>
species among themselves (0.948, see
<xref rid="tab2" ref-type="table">Table 2</xref>
, also looking at octamers). The
<italic>p</italic>
value between these two sets of CC values is 5.7 × 10
<sup>−23</sup>
. Therefore, it can be inferred that
<italic>C. quinquefasciatus</italic>
is also separate from the genus
<italic>Anopheles</italic>
.</p>
<p>This shows that the present method is useful in classifying as of yet unknown organisms for which only the whole-genome sequence is available. The utility of comparing WGKS is greater than methods which analyze only groups of genes, which make up only a fraction of the entire genome sequence. Phylogenies based on different genes often conflict with each other [
<xref rid="B6" ref-type="bibr">6</xref>
].</p>
</sec>
<sec id="sec3.1.7">
<title>3.1.7. Divergence and Similarities between Genera</title>
<p>In order to measure the divergence of the two genera from each other, boxplots were created comparing the range of CC values within the genera
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
as well as between the three genera themselves, as well as between
<italic>C. briggsae</italic>
and the three insect genera individually, and also between
<italic>A. mellifera</italic>
and the three insect genera individually. This was done for k-mers of size 7–9 bp, and the boxplots can be seen in
<xref ref-type="fig" rid="fig5">Figure 5</xref>
(octamers) and Supplementary Figures
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
(heptamers and nonamers). The minimum, median, average, and maximum CC value for each of the seven comparisons as well as their standard deviations can be seen in
<xref rid="tab2" ref-type="table">Table 2</xref>
.</p>
<p>The mean CC values within the three genera are much higher than for all other comparisons (e.g., 0.955 within
<italic>Anopheles</italic>
and 0.869 within
<italic>Drosophila</italic>
for heptamers,
<xref rid="tab2" ref-type="table">Table 2</xref>
), whether they are between the two genera
<italic>Anopheles</italic>
and
<italic>Drosophila</italic>
or between either one of the two outlier species and either one of these two genera. This trend is consistent for all k-mer lengths. The minimum, mean, median, and maximum CC values decrease with increasing motif length, but this is due to the fact that as the motif length increases, the number of possible k-mers also increases proportionally, and therefore, CC values also tend to decrease. These tendencies all illustrate the clear genomic content differences between the genera
<italic>Drosophila</italic>
,
<italic>Anopheles</italic>
, and
<italic>Glossina</italic>
.</p>
<p>It was also interesting to see which nonrepetitive (i.e., k-mers which do not consist of dimer or trimer repeats) genome k-mers were the most common between
<italic>Anopheles, Drosophila</italic>
, and
<italic>Glossina</italic>
for k-mers of lengths 7–9 bp. For this, all k-mers with a score of at least 0.5 (such k-mers occur three times more frequently than expected) and which occurred in at least half of all species in a given genus were selected (at least 11
<italic>Anopheles</italic>
species and at least 15
<italic>Drosophila</italic>
species, but at least 5 species in
<italic>Glossina</italic>
). These k-mers are listed in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
for lengths 7–9 bp. Common k-mers between all three genera are also listed in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and are visualized in
<xref ref-type="fig" rid="fig6">Figure 6</xref>
(octamers) and Supplementary Figures
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
(heptamers and nonamers).</p>
</sec>
</sec>
<sec id="sec3.2">
<title>3.2. Analysis of 5′ and 3′ UTRs</title>
<p>Besides the whole genome, k-mer analysis was done for 5′ and 3′ UTRs for seven
<italic>Drosophila</italic>
species (D.
<italic>ananassae</italic>
,
<italic>erecta</italic>
,
<italic>grimshawi</italic>
,
<italic>melanogaster</italic>
,
<italic>mojavensis</italic>
,
<italic>pseudoobscura</italic>
, and
<italic>simulans</italic>
) and also
<italic>Anopheles gambiae</italic>
as an outlier species which was compared to these
<italic>Drosophila</italic>
species. Besides the WGKS, a species' 5PKS, 3PKS, and also IKS (5′ prime k-mer signature, 3′ k-mer signature, and intron k-mer signature) can also be defined. Sequence statistics for the selected
<italic>Drosophila</italic>
species and
<italic>A. gambiae</italic>
are available in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
. However, since 5′ and 3′ UTR sequences were not available for many species besides
<italic>Drosophila</italic>
, we could only do a restricted analysis, instead of analyzing species relationships on a heatmap as in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
.</p>
<p>Figures
<xref ref-type="fig" rid="fig7">7(a)</xref>
and
<xref ref-type="fig" rid="fig7">7(b)</xref>
depict the CC ranges in boxplots for both within the genus
<italic>Drosophila</italic>
and between
<italic>A. gambiae</italic>
and the genus
<italic>Drosophila</italic>
for k-mer lengths 7–9 bp for 5′ and 3′ UTRs, respectively. Both figures show that the CC range for comparisons between
<italic>A. gambiae</italic>
and
<italic>Drosophila</italic>
is much lower than that for within
<italic>Drosophila</italic>
itself. This difference between the two genera is more pronounced in 3′ UTRs as compared to 5′ UTRs. The CC values are present in a matrix for both 5′ and 3′ UTRs in Supplementary Files
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
, respectively.</p>
<p>Summary statistics for CC values within
<italic>Drosophila</italic>
and between
<italic>A. gambiae</italic>
and
<italic>Drosophila</italic>
can be seen in
<xref rid="tab3" ref-type="table">Table 3</xref>
. The
<italic>p</italic>
values for 5′ and 3′ UTRs for k-mer lengths 7–9 bp are all statistically significant at the 5% level. This reflects that the same kind of genetic difference between the two genera is also present in the 5′ and 3′ UTR regions.
<xref ref-type="fig" rid="fig8"> Figure 8</xref>
shows the number of 5′ UTR nonrepetitive k-mers which are common to all seven
<italic>Drosophila</italic>
species, 104, 602, and 2128 for motif lengths 7–9 bp. For 3′ UTRs, there are 70, 451, and 1396 motifs of lengths 7–9 bp. This is reflective of the lower overall CC range for 3′ UTR k-mers than 5′ UTR k-mers seen earlier (Figures
<xref ref-type="fig" rid="fig7">7(a)</xref>
and
<xref ref-type="fig" rid="fig7">7(b)</xref>
). The number of common k-mers increases in a roughly proportionate manner as the length of the k-mer increases, due to increasing k-mer space (e.g., there are more possible nonamers than octamers). These common k-mers are listed in Supplementary Files
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
and
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
for 5′ and 3′ UTRs.</p>
</sec>
<sec id="sec3.3">
<title>3.3. Analysis of Introns</title>
<p>The intron regions of twelve
<italic>Drosophila</italic>
species (
<italic>ananassae</italic>
,
<italic>erecta</italic>
,
<italic>grimshawi</italic>
,
<italic>melanogaster</italic>
,
<italic>mojavensis</italic>
,
<italic>persimilis</italic>
,
<italic>pseudoobscura</italic>
,
<italic>sechellia</italic>
,
<italic>simulans</italic>
,
<italic>virilis</italic>
,
<italic>willistoni</italic>
, and
<italic>yakuba</italic>
) were analyzed in a way similar to the whole genome as well as the 5′ and 3′ UTR regions. Intron sequences were available for only
<italic>Drosophila</italic>
; therefore, we could not perform any species comparisons between this genus and
<italic>Anopheles</italic>
or
<italic>Glossina</italic>
.</p>
<p>
<xref ref-type="fig" rid="fig9">Figure 9</xref>
depicts the range of CC values for k-mer lengths 7–9 bp for these twelve species. Summary statistics for all k-mer lengths are available in
<xref rid="tab3" ref-type="table">Table 3</xref>
. The number of common k-mers to all twelve species is depicted in
<xref ref-type="fig" rid="fig8">Figure 8</xref>
(37, 344, and 1890 for motif lengths 7–9 bp) and is listed in Supplementary
<xref ref-type="supplementary-material" rid="supplementary-material-1"></xref>
, where the CC matrix is also available for k-mer lengths 7–9 bp.</p>
<p>As with the 5′ and 3′ UTR regions, the number of common k-mers also increases with increasing k-mer length, from 7 to 9 bp. The number of common intron k-mers is also less than the number of common 3′ UTR k-mers, which in turn is less than the number of common 5′ UTR k-mers (for heptamers and octamers), but not for nonamers (see
<xref ref-type="fig" rid="fig8">Figure 8</xref>
). This indicates that, for these two k-mer lengths, as the size of the sequence regions decreases, the number of common k-mers increases.</p>
</sec>
</sec>
<sec id="sec4">
<title>4. Conclusion</title>
<p>The motif prediction algorithm presented in previous works has been refined, expanded, and applied to a lot larger selection of species, allowing broader inferences to be made from the analysis. Furthermore, by defining the WGKS of yet unknown species, they can be classified into existing taxonomical categories. This algorithm is one more tool with which to characterize and classify new species, as in the case of
<italic>A. aegypti</italic>
and
<italic>albopictus</italic>
and
<italic>C. quinquefasciatus</italic>
. The WGKS, but also motif signatures from other subgenomic regions, can be useful in separating species into individual genera, sharply separated from one another. We believe that this algorithm can be put to use to not only predict biologically relevant whole-genome and subgenomic motifs, but also cluster species into taxonomic groups based on similarities and differences among their motif signatures. This algorithm has only been used to analyze insect species, but could also be applied to compare species from other phyla.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>This work was supported by the development funds to CG from the University of Nebraska Medical Center. The authors are thankful to the Bioinformatics and Systems Biology core that receives partial support from NIH grants (P20GM103427, P30CA036727, and 3P30MH062261). The authors wish to thank Frank Sherwin for his professional knowledge on the ecological differences in
<italic>Drosophila</italic>
species and how to classify them.</p>
</ack>
<sec sec-type="data-availability">
<title>Data Availability</title>
<p>Processed data and results are available in the supplementary files. Raw genomic data will be made available upon request.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interests.</p>
</sec>
<sec>
<title>Authors' Contributions</title>
<p>MC designed the entire analysis, performed all of the calculations, created all of the figures, and wrote the manuscript. CG and PX contributed to the conception of this work and made essential improvements to the manuscript.</p>
</sec>
<sec sec-type="supplementary-material" id="supplementary-material-1">
<title>Supplementary Materials</title>
<supplementary-material content-type="local-data" id="supp-1">
<label>Supplementary Materials</label>
<caption>
<p>Supplemental Figure 1: genome size for all 58 studied species. The size of the genome of each species is given in Mbp.
<italic>Anopheles</italic>
species colored in blue,
<italic>Drosophila</italic>
species in red, and
<italic>Glossina</italic>
species in green. Supplemental Figure 2: ACGT% content for all 58 studied species. The ACGT% for all 58 species is given for all species, adding up to one in a stacked barplot. Supplemental Figure 3(a): heatmap depicting species relationships between the 63 species included in the analysis based on the whole-genome k-mer signature for heptamers. Supplemental Figure 3(b): heatmap depicting species relationships between the 63 species included in the analysis based on the whole-genome k-mer signature for nonamers. Supplemental Figure 4(a): Pearson correlation coefficient between species of
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
as well as the two control species,
<italic>A. mellifera</italic>
and
<italic>C. briggsae</italic>
for heptamers. Supplemental Figure 4(b): Pearson correlation coefficient between species of
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
as well as the two control species,
<italic>A. mellifera</italic>
and
<italic>C. briggsae</italic>
for nonamers. Supplemental Figure 5(a): common nonrepetitive (nondimer and nontrimer) heptamer content between 11
<italic>Anopheles</italic>
, 15
<italic>Drosophila</italic>
, and 5
<italic>Glossina</italic>
species. Each included heptamer had a minimum score of 0.5. Supplemental Figure 5(b): common nonrepetitive (nondimer and nontrimer) nonamer content between 11
<italic>Anopheles</italic>
, 15
<italic>Drosophila</italic>
and 5
<italic>Glossina</italic>
species. Each included nonamer had a minimum score of 0.5. Supplemental File 1: statistics of whole genome, 5′ and 3′ UTR, and intron sequences for the studied species. The species, file name, number of contigs, genome/subgenomic region size, and ACGT% are provided for each species. The pairwise sequence identity for all species pairs is included for the mitochondrial genome comparisons. Supplemental File 2: Pearson correlation matrix for whole-genome k-mer signatures. The Pearson correlation matrix between all pairs of the studied species is provided for k-mers of lengths 7–9 bp. Supplemental File 3: predicted biologically relevant whole-genome k-mers (octamers). Biologically relevant octamers were predicted by the k-mer prediction algorithm for the 22
<italic>Anopheles</italic>
, 30
<italic>Drosophila</italic>
, and six
<italic>Glossina</italic>
species. For the
<italic>Drosophila</italic>
species, all predicted octamers were matched against 140
<italic>Drosophila</italic>
PWMs from the JASPAR database with a cutoff of 0.8. Supplemental File 4: high-scoring repetitive motif content of three
<italic>Drosophila</italic>
species. High-scoring and high-occurring octamer palindrome k-mers are listed for
<italic>Drosophila ananassae</italic>
,
<italic>mojavensis</italic>
, and
<italic>willistoni</italic>
. These k-mers were matched with the k-mers from other
<italic>Drosophila</italic>
species for comparison. Nonamers from
<italic>D. mojavensis</italic>
were also matched with motifs from the TRDB. Supplemental File 5: nonrepetitive frequent k-mers from the three genera. Nonrepetitive (nondimer/trimer repeats) were found in a majority of the species from the three genera for k-mers of lengths 7–9 bp. K-mers common to all three genera were also noted. Supplemental File 6: Pearson correlation matrix for 5′ UTR k-mer signatures. The Pearson correlation matrix between all pairs of species involving
<italic>A. gambiae</italic>
and seven
<italic>Drosophila</italic>
species is provided for k-mers of lengths 7–9 bp in the 5′ UTR regions. Common heptamers, octamers, and nonamers are also provided. Supplemental File 7: Pearson correlation matrix for 3′ UTR k-mer signatures. The Pearson correlation matrix between all pairs of species involving
<italic>A. gambiae</italic>
and seven
<italic>Drosophila</italic>
species is provided for k-mers of lengths 7–9 bp in the 3′ UTR regions. Common heptamers, octamers, and nonamers are also provided. Supplemental File 8: Pearson correlation matrix for intron k-mer signatures. The Pearson correlation matrix between all pairs of species involving 12
<italic>Drosophila</italic>
species is provided for k-mers of lengths 7–9 bp in the intron regions. Common heptamers, octamers, and nonamers are also provided.</p>
</caption>
<media xlink:href="4259479.f1.zip">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
<ref-list>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cserháti</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Turóczy</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Dudits</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Györgyey</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>The rice word landscape: a detailed catalogue of the rice motif content in the non-coding regions</article-title>
<source>
<italic toggle="yes">OMICS: A Journal of Integrative Biology</italic>
</source>
<year>2012</year>
<volume>16</volume>
<issue>6</issue>
<fpage>334</fpage>
<lpage>342</lpage>
<pub-id pub-id-type="doi">10.1089/omi.2011.0056</pub-id>
<pub-id pub-id-type="other">2-s2.0-84862501108</pub-id>
<pub-id pub-id-type="pmid">22702246</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cserhati</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Motif content comparison between monocot and dicot species</article-title>
<source>
<italic toggle="yes">Genomics Data</italic>
</source>
<year>2015</year>
<volume>3</volume>
<fpage>128</fpage>
<lpage>136</lpage>
<pub-id pub-id-type="doi">10.1016/j.gdata.2014.12.006</pub-id>
<pub-id pub-id-type="other">2-s2.0-84922895481</pub-id>
<pub-id pub-id-type="pmid">26484161</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cserhati</surname>
<given-names>M. F.</given-names>
</name>
<name>
<surname>Mooter</surname>
<given-names>M.-E.</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Motifome comparison between modern human, Neanderthal and Denisovan</article-title>
<source>
<italic toggle="yes">BMC Genomics</italic>
</source>
<year>2018</year>
<volume>19</volume>
<issue>1</issue>
<fpage>p. 472</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-018-4710-1</pub-id>
<pub-id pub-id-type="other">2-s2.0-85048714105</pub-id>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison—a review</article-title>
<source>
<italic toggle="yes">Bioinformatics</italic>
</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>523</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="other">2-s2.0-0037342499</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pollard</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Iyer</surname>
<given-names>V. N.</given-names>
</name>
<name>
<surname>Moses</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>M. B.</given-names>
</name>
</person-group>
<article-title>Widespread discordance of gene trees with species tree in
<italic>Drosophila</italic>
: evidence for incomplete lineage sorting</article-title>
<source>
<italic toggle="yes">PLoS Genetics</italic>
</source>
<year>2006</year>
<volume>2</volume>
<issue>10</issue>
<fpage>p. e173</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pgen.0020173</pub-id>
<pub-id pub-id-type="other">2-s2.0-33750437728</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
</person-group>
<article-title>Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction</article-title>
<source>
<italic toggle="yes">Nucleic Acids Research</italic>
</source>
<year>2008</year>
<volume>36</volume>
<issue>5</issue>
<fpage>p. e33</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn075</pub-id>
<pub-id pub-id-type="other">2-s2.0-41149132297</pub-id>
<pub-id pub-id-type="pmid">18296485</pub-id>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McHardy</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Martín</surname>
<given-names>H. G.</given-names>
</name>
<name>
<surname>Tsirigos</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Rigoutsos</surname>
<given-names>I.</given-names>
</name>
</person-group>
<article-title>Accurate phylogenetic classification of variable-length DNA fragments</article-title>
<source>
<italic toggle="yes">Nature Methods</italic>
</source>
<year>2007</year>
<volume>4</volume>
<issue>1</issue>
<fpage>63</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth976</pub-id>
<pub-id pub-id-type="other">2-s2.0-33845957530</pub-id>
<pub-id pub-id-type="pmid">17179938</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Diaz</surname>
<given-names>N. N.</given-names>
</name>
<name>
<surname>Krause</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Goesmann</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Niehaus</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Nattkemper</surname>
<given-names>T. W.</given-names>
</name>
</person-group>
<article-title>TACOA - taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach</article-title>
<source>
<italic toggle="yes">BMC Bioinformatics</italic>
</source>
<year>2009</year>
<volume>10</volume>
<issue>1</issue>
<fpage>p. 56</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-56</pub-id>
<pub-id pub-id-type="other">2-s2.0-62549109116</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nalbantoglu</surname>
<given-names>O. U.</given-names>
</name>
<name>
<surname>Way</surname>
<given-names>S. F.</given-names>
</name>
<name>
<surname>Hinrichs</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Sayood</surname>
<given-names>K.</given-names>
</name>
</person-group>
<article-title>RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles</article-title>
<source>
<italic toggle="yes">BMC Bioinformatics</italic>
</source>
<year>2011</year>
<volume>12</volume>
<issue>1</issue>
<fpage>p. 41</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-41</pub-id>
<pub-id pub-id-type="other">2-s2.0-79251560623</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wiegmann</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Richards</surname>
<given-names>S.</given-names>
</name>
</person-group>
<article-title>Genomes of
<italic>Diptera</italic>
</article-title>
<source>
<italic toggle="yes">Current Opinion in Insect Science</italic>
</source>
<year>2008</year>
<volume>25</volume>
<fpage>116</fpage>
<lpage>124</lpage>
<pub-id pub-id-type="doi">10.1016/j.cois.2018.01.007</pub-id>
<pub-id pub-id-type="other">2-s2.0-85042116131</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kiszewski</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sachs</surname>
<given-names>S. E.</given-names>
</name>
<name>
<surname>Mellinger</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Malaney</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Sachs</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Spielman</surname>
<given-names>A.</given-names>
</name>
</person-group>
<article-title>A global index representing the stability of malaria transmission</article-title>
<source>
<italic toggle="yes">The American Journal of Tropical Medicine and Hygiene</italic>
</source>
<year>2004</year>
<volume>70</volume>
<issue>5</issue>
<fpage>486</fpage>
<lpage>498</lpage>
<pub-id pub-id-type="doi">10.4269/ajtmh.2004.70.486</pub-id>
<pub-id pub-id-type="pmid">15155980</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Foster</surname>
<given-names>P. G.</given-names>
</name>
<name>
<surname>Bergo</surname>
<given-names>E. S.</given-names>
</name>
<name>
<surname>Bourke</surname>
<given-names>B. P.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Phylogenetic analysis and DNA-based species confirmation in
<italic>Anopheles</italic>
(Nyssorhynchus)</article-title>
<source>
<italic toggle="yes">PLoS One</italic>
</source>
<year>2013</year>
<volume>8</volume>
<issue>2</issue>
<pub-id pub-id-type="publisher-id">e54063</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0054063</pub-id>
<pub-id pub-id-type="other">2-s2.0-84873503568</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krzywinski</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Besansky</surname>
<given-names>N. J.</given-names>
</name>
</person-group>
<article-title>Molecular systematics of
<italic>Anopheles</italic>
: from subgenera to subpopulations</article-title>
<source>
<italic toggle="yes">Annual Review of Entomology</italic>
</source>
<year>2003</year>
<volume>48</volume>
<issue>1</issue>
<fpage>111</fpage>
<lpage>139</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.ento.48.091801.112647</pub-id>
<pub-id pub-id-type="other">2-s2.0-0037209058</pub-id>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yassin</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Orgogozo</surname>
<given-names>V.</given-names>
</name>
</person-group>
<article-title>Coevolution between male and female genitalia in the
<italic>Drosophila melanogaster</italic>
species subgroup</article-title>
<source>
<italic toggle="yes">PLoS One</italic>
</source>
<year>2013</year>
<volume>8</volume>
<issue>2</issue>
<pub-id pub-id-type="publisher-id">e57158</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0057158</pub-id>
<pub-id pub-id-type="other">2-s2.0-84874535685</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Throckmorton</surname>
<given-names>L. H.</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>King</surname>
<given-names>R. C.</given-names>
</name>
</person-group>
<article-title>The phylogeny, ecology and geography of
<italic>Drosophila</italic>
</article-title>
<source>
<italic toggle="yes">Handbook of Genetics</italic>
</source>
<year>1975</year>
<volume>3</volume>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>Plenum Press</publisher-name>
</element-citation>
</ref>
<ref id="B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Izumitani</surname>
<given-names>H. F.</given-names>
</name>
<name>
<surname>Kusaka</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Koshikawa</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Toda</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Katoh</surname>
<given-names>T.</given-names>
</name>
</person-group>
<article-title>Phylogeography of the subgenus
<italic>Drosophila</italic>
(Diptera:
<italic>Drosophilidae</italic>
): evolutionary history of faunal divergence between the Old and the new worlds</article-title>
<source>
<italic toggle="yes">PLoS One</italic>
</source>
<year>2016</year>
<volume>11</volume>
<issue>7</issue>
<pub-id pub-id-type="publisher-id">e0160051</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0160051</pub-id>
<pub-id pub-id-type="other">2-s2.0-85012890939</pub-id>
</element-citation>
</ref>
<ref id="B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Neafsey</surname>
<given-names>D. E.</given-names>
</name>
<name>
<surname>Waterhouse</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Abai</surname>
<given-names>M. R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Mosquito genomics. Highly evolvable malaria vectors: the genomes of 16
<italic>Anopheles</italic>
mosquitoes</article-title>
<source>
<italic toggle="yes">Science</italic>
</source>
<year>2015</year>
<volume>347</volume>
<issue>6217</issue>
<pub-id pub-id-type="publisher-id">1258522</pub-id>
<pub-id pub-id-type="doi">10.1126/science.1258522</pub-id>
<pub-id pub-id-type="other">2-s2.0-84922482170</pub-id>
</element-citation>
</ref>
<ref id="B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krafsur</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>Tsetse flies: genetics, evolution, and role as vectors</article-title>
<source>
<italic toggle="yes">Infection, Genetics and Evolution</italic>
</source>
<year>2009</year>
<volume>9</volume>
<issue>1</issue>
<fpage>124</fpage>
<lpage>141</lpage>
<pub-id pub-id-type="doi">10.1016/j.meegid.2008.09.010</pub-id>
<pub-id pub-id-type="other">2-s2.0-57549104676</pub-id>
</element-citation>
</ref>
<ref id="B19">
<label>19</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Gooding</surname>
<given-names>R. H.</given-names>
</name>
<name>
<surname>Krafsur</surname>
<given-names>E. S.</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Maudlin</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Homes</surname>
<given-names>P. H.</given-names>
</name>
<name>
<surname>Miles</surname>
<given-names>M. A.</given-names>
</name>
</person-group>
<source>
<italic toggle="yes">Tsetse Genetics: Applications to Biology and Systematics</italic>
</source>
<year>2004</year>
<publisher-loc>Wallingford, UK</publisher-loc>
<publisher-name>CAB International</publisher-name>
</element-citation>
</ref>
<ref id="B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elsen</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Amoudi</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Leclercq</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>First record of
<italic>Glossina fuscipes</italic>
newstead 1910 and
<italic>Glossina morsitans submorsitans</italic>
newstead 1910 in south-western Saudi Arabia</article-title>
<source>
<italic toggle="yes">Annales de la Société Belge de Médecine Tropicale</italic>
</source>
<year>1990</year>
<volume>70</volume>
<fpage>281</fpage>
<lpage>287</lpage>
<pub-id pub-id-type="pmid">2291693</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lichtenberg</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yilmaz</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Welch</surname>
<given-names>J. D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The word landscape of the non-coding segments of the
<italic>Arabidopsis thaliana</italic>
genome</article-title>
<source>
<italic toggle="yes">BMC Genomics</italic>
</source>
<year>2009</year>
<volume>10</volume>
<issue>1</issue>
<fpage>p. 463</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-10-463</pub-id>
<pub-id pub-id-type="other">2-s2.0-70449713663</pub-id>
<pub-id pub-id-type="pmid">19814816</pub-id>
</element-citation>
</ref>
<ref id="B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tompa</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>T. L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites</article-title>
<source>
<italic toggle="yes">Nature Biotechnology</italic>
</source>
<year>2005</year>
<volume>23</volume>
<issue>1</issue>
<fpage>137</fpage>
<lpage>144</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1053</pub-id>
<pub-id pub-id-type="other">2-s2.0-21144439147</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pesole</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liuni</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Grillo</surname>
<given-names>G.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>UTRdb: a specialized database of 5′ and 3′ untranslated regions of eukaryotic mRNAs</article-title>
<source>
<italic toggle="yes">Nucleic Acids Research</italic>
</source>
<year>1999</year>
<volume>27</volume>
<issue>1</issue>
<fpage>188</fpage>
<lpage>191</lpage>
<pub-id pub-id-type="doi">10.1093/nar/27.1.188</pub-id>
<pub-id pub-id-type="other">2-s2.0-0032943759</pub-id>
<pub-id pub-id-type="pmid">9847176</pub-id>
</element-citation>
</ref>
<ref id="B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gelfand</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>TRDB--the Tandem repeats database</article-title>
<source>
<italic toggle="yes">Nucleic Acids Research</italic>
</source>
<year>2007</year>
<volume>35</volume>
<issue>Database</issue>
<fpage>D80</fpage>
<lpage>D87</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl1013</pub-id>
<pub-id pub-id-type="other">2-s2.0-33846070952</pub-id>
<pub-id pub-id-type="pmid">17175540</pub-id>
</element-citation>
</ref>
<ref id="B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Khan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Fornes</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Stigliani</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework</article-title>
<source>
<italic toggle="yes">Nucleic Acids Research</italic>
</source>
<year>2018</year>
<volume>46</volume>
<issue>D1</issue>
<fpage>D260</fpage>
<lpage>D266</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkx1126</pub-id>
<pub-id pub-id-type="other">2-s2.0-85040936337</pub-id>
<pub-id pub-id-type="pmid">29140473</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moses</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Pollard</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Nix</surname>
<given-names>D. A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Large-scale turnover of functional transcription factor binding sites in
<italic>Drosophila</italic>
</article-title>
<source>
<italic toggle="yes">PLoS Computational Biology</italic>
</source>
<year>2006</year>
<volume>2</volume>
<issue>10</issue>
<fpage>p. e130</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.0020130</pub-id>
<pub-id pub-id-type="other">2-s2.0-33750442876</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Bachtrog</surname>
<given-names>D.</given-names>
</name>
</person-group>
<article-title>Ancestral chromatin configuration constrains chromatin evolution on differentiating sex chromosomes in
<italic>Drosophila</italic>
</article-title>
<source>
<italic toggle="yes">PLoS Genetics</italic>
</source>
<year>2015</year>
<volume>11</volume>
<issue>6</issue>
<pub-id pub-id-type="publisher-id">e1005331</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pgen.1005331</pub-id>
<pub-id pub-id-type="other">2-s2.0-84937789496</pub-id>
</element-citation>
</ref>
<ref id="B28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hao</surname>
<given-names>Y. J.</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>Y. L.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y. R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Complete mitochondrial genomes of
<italic>Anopheles stephensi</italic>
and
<italic>An. dirus</italic>
and comparative evolutionary mitochondriomics of 50 mosquitoes</article-title>
<source>
<italic toggle="yes">Scientific Reports</italic>
</source>
<year>2017</year>
<volume>7</volume>
<issue>1</issue>
<fpage>p. 7666</fpage>
<pub-id pub-id-type="doi">10.1038/s41598-017-07977-0</pub-id>
<pub-id pub-id-type="other">2-s2.0-85027149873</pub-id>
</element-citation>
</ref>
<ref id="B29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Freitas</surname>
<given-names>L. A.</given-names>
</name>
<name>
<surname>Russo</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Voloch</surname>
<given-names>C. M.</given-names>
</name>
<name>
<surname>Mutaquiha</surname>
<given-names>O. C.</given-names>
</name>
<name>
<surname>Marques</surname>
<given-names>L. P.</given-names>
</name>
<name>
<surname>Schrago</surname>
<given-names>C. G.</given-names>
</name>
</person-group>
<article-title>Diversification of the genus
<italic>Anopheles</italic>
and a neotropical clade from the late cretaceous</article-title>
<source>
<italic toggle="yes">PLoS One</italic>
</source>
<year>2015</year>
<volume>10</volume>
<issue>8</issue>
<pub-id pub-id-type="publisher-id">e0134462</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0134462</pub-id>
<pub-id pub-id-type="other">2-s2.0-84941985122</pub-id>
</element-citation>
</ref>
<ref id="B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beebe</surname>
<given-names>N. W.</given-names>
</name>
<name>
<surname>Russell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Burkot</surname>
<given-names>T. R.</given-names>
</name>
<name>
<surname>Cooper</surname>
<given-names>R. D.</given-names>
</name>
</person-group>
<article-title>Anopheles punctulatusGroup: evolution, distribution, and control</article-title>
<source>
<italic toggle="yes">Annual Review of Entomology</italic>
</source>
<year>2015</year>
<volume>60</volume>
<issue>1</issue>
<fpage>335</fpage>
<lpage>350</lpage>
<pub-id pub-id-type="doi">10.1146/annurev-ento-010814-021206</pub-id>
<pub-id pub-id-type="other">2-s2.0-84920842075</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="fig1" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>Flowchart depicting the algorithm. First, the whole-genome sequences or subgenomic region of interest for all species are analyzed, and the WGKS is produced. This is a list of all possible k-mers together with their normalized score values. These WGKSs are compared in an all-versus-all manner, using the Pearson correlation coefficient. This produces a CC matrix, which is then visualized in a heatmap, depicting species relationships.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.001"></graphic>
</fig>
<fig id="fig2" orientation="portrait" position="float">
<label>Figure 2</label>
<caption>
<p>Heatmap depicting CC values calculated in an all-versus-all pairwise manner between the 63 species included in the analysis based on the whole-genome k-mer signature for octamers. Colors closer to yellow or white indicate higher CC values, while those closer to red indicate lower CC values. The range of the CC values in this matrix is from 0.259 to 1.0.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.002"></graphic>
</fig>
<fig id="fig3" orientation="portrait" position="float">
<label>Figure 3</label>
<caption>
<p>UPGMA, WPGMA, and NJ trees for 1−CC values for all species pairs from each of the three genera: (a)
<italic>Drosophila</italic>
, (b)
<italic>Glossina</italic>
, and (c)
<italic>Anopheles</italic>
.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.003"></graphic>
</fig>
<fig id="fig4" orientation="portrait" position="float">
<label>Figure 4</label>
<caption>
<p>Heatmap depicting similarity of the mitochondrial genomes across 28 species. Lower similarity values are shown in darker, redder colors, closer to 0% similarity, whereas higher similarity values, closer to 100%, are shown in brighter, yellow/white colors. The range of similarity values is between 0 and 100%.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.004"></graphic>
</fig>
<fig id="fig5" orientation="portrait" position="float">
<label>Figure 5</label>
<caption>
<p>Pearson correlation coefficient (CC) values between species of Anopheles, Drosophila, and
<italic>Glossina</italic>
as well as the two control species, A.
<italic>mellifera</italic>
and
<italic>C. elegans</italic>
for
<italic>octamers</italic>
. The first three columns represent CC values between all pairs of species within each genera of Anopheles, Drosophila, and
<italic>Glossina</italic>
, respectively; columns 4–6 represent comparisons across the species from three genera, 7–9 represent comparison between C.
<italic>elegans</italic>
and the three genera, while 10–12 represent comparison between A.
<italic>mellifera</italic>
and the three genera.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.005"></graphic>
</fig>
<fig id="fig6" orientation="portrait" position="float">
<label>Figure 6</label>
<caption>
<p>Common nonrepetitive (nondimer and nontrimer) octamer content between 11
<italic>Anopheles</italic>
, 15
<italic>Drosophila</italic>
, and 5
<italic>Glossina</italic>
species. Each included octamer had a minimum score of 0.5.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.006"></graphic>
</fig>
<fig id="fig7" orientation="portrait" position="float">
<label>Figure 7</label>
<caption>
<p>Comparison of the similarity in the 5′ and 3′ UTRs with the genus of
<italic>Drosophila</italic>
and between the species of
<italic>Drosophila</italic>
and
<italic>A. gambiae</italic>
, using k-mers of lengths 7–9 bp: (a) 5′ UTR and (b) 3′ UTR. Yellow bars represent comparisons among
<italic>Drosophila</italic>
species, and green bars represent comparison between
<italic>Drosophila</italic>
species and
<italic>A. gambiae</italic>
.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.007"></graphic>
</fig>
<fig id="fig8" orientation="portrait" position="float">
<label>Figure 8</label>
<caption>
<p>Number of common k-mers of lengths 7–9 bp for all seven
<italic>Drosophila</italic>
species for 5′ and 3′ UTRs and introns.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.008"></graphic>
</fig>
<fig id="fig9" orientation="portrait" position="float">
<label>Figure 9</label>
<caption>
<p>Pearson correlation coefficient values range from all-versus-all comparison of twelve
<italic>Drosophila</italic>
species for k-mer lengths 7–9 bp which have data from the intron regions.</p>
</caption>
<graphic xlink:href="CMMM2019-4259479.009"></graphic>
</fig>
<table-wrap id="tab1" orientation="portrait" position="float">
<label>Table 1</label>
<caption>
<p>Number of statistically significant genome k-mers and minimum score for all species.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Species</th>
<th align="center" rowspan="1" colspan="1">No. of significant k-mers</th>
<th align="center" rowspan="1" colspan="1">Min. score</th>
<th align="center" rowspan="1" colspan="1">No. of hits in JASPAR database</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_albimanus</td>
<td align="center" rowspan="1" colspan="1">1646</td>
<td align="center" rowspan="1" colspan="1">0.383</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_arabiensis</td>
<td align="center" rowspan="1" colspan="1">1629</td>
<td align="center" rowspan="1" colspan="1">0.349</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_atroparvus</td>
<td align="center" rowspan="1" colspan="1">1425</td>
<td align="center" rowspan="1" colspan="1">0.346</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_christyi</td>
<td align="center" rowspan="1" colspan="1">1366</td>
<td align="center" rowspan="1" colspan="1">0.414</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_cracens</td>
<td align="center" rowspan="1" colspan="1">1523</td>
<td align="center" rowspan="1" colspan="1">0.433</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_culicifacies</td>
<td align="center" rowspan="1" colspan="1">1440</td>
<td align="center" rowspan="1" colspan="1">0.371</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_darlingi</td>
<td align="center" rowspan="1" colspan="1">1646</td>
<td align="center" rowspan="1" colspan="1">0.413</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_dirus</td>
<td align="center" rowspan="1" colspan="1">1648</td>
<td align="center" rowspan="1" colspan="1">0.387</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_epiroticus</td>
<td align="center" rowspan="1" colspan="1">1562</td>
<td align="center" rowspan="1" colspan="1">0.375</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_farauti</td>
<td align="center" rowspan="1" colspan="1">1397</td>
<td align="center" rowspan="1" colspan="1">0.435</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_funestus</td>
<td align="center" rowspan="1" colspan="1">1579</td>
<td align="center" rowspan="1" colspan="1">0.340</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_gambiae</td>
<td align="center" rowspan="1" colspan="1">1509</td>
<td align="center" rowspan="1" colspan="1">0.394</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_koliensis</td>
<td align="center" rowspan="1" colspan="1">1309</td>
<td align="center" rowspan="1" colspan="1">0.461</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_maculatus</td>
<td align="center" rowspan="1" colspan="1">1613</td>
<td align="center" rowspan="1" colspan="1">0.377</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_melas</td>
<td align="center" rowspan="1" colspan="1">1551</td>
<td align="center" rowspan="1" colspan="1">0.379</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_merus</td>
<td align="center" rowspan="1" colspan="1">1755</td>
<td align="center" rowspan="1" colspan="1">0.281</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_minimus</td>
<td align="center" rowspan="1" colspan="1">1406</td>
<td align="center" rowspan="1" colspan="1">0.379</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_nili</td>
<td align="center" rowspan="1" colspan="1">1206</td>
<td align="center" rowspan="1" colspan="1">0.427</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_punctulatus</td>
<td align="center" rowspan="1" colspan="1">1276</td>
<td align="center" rowspan="1" colspan="1">0.456</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_quadriannulatus</td>
<td align="center" rowspan="1" colspan="1">1771</td>
<td align="center" rowspan="1" colspan="1">0.270</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_sinensis</td>
<td align="center" rowspan="1" colspan="1">1381</td>
<td align="center" rowspan="1" colspan="1">0.419</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Anopheles_stephensi</td>
<td align="center" rowspan="1" colspan="1">1666</td>
<td align="center" rowspan="1" colspan="1">0.369</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_albomicans</td>
<td align="center" rowspan="1" colspan="1">2279</td>
<td align="center" rowspan="1" colspan="1">0.428</td>
<td align="center" rowspan="1" colspan="1">23</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_americana</td>
<td align="center" rowspan="1" colspan="1">2209</td>
<td align="center" rowspan="1" colspan="1">0.428</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_ananassae</td>
<td align="center" rowspan="1" colspan="1">2067</td>
<td align="center" rowspan="1" colspan="1">0.481</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_arizonae</td>
<td align="center" rowspan="1" colspan="1">2293</td>
<td align="center" rowspan="1" colspan="1">0.405</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_biarmipes</td>
<td align="center" rowspan="1" colspan="1">1899</td>
<td align="center" rowspan="1" colspan="1">0.475</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_bipectinata</td>
<td align="center" rowspan="1" colspan="1">1934</td>
<td align="center" rowspan="1" colspan="1">0.449</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_busckii</td>
<td align="center" rowspan="1" colspan="1">2406</td>
<td align="center" rowspan="1" colspan="1">0.442</td>
<td align="center" rowspan="1" colspan="1">25</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_elegans</td>
<td align="center" rowspan="1" colspan="1">1768</td>
<td align="center" rowspan="1" colspan="1">0.519</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_erecta</td>
<td align="center" rowspan="1" colspan="1">2047</td>
<td align="center" rowspan="1" colspan="1">0.470</td>
<td align="center" rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_eugracilis</td>
<td align="center" rowspan="1" colspan="1">1838</td>
<td align="center" rowspan="1" colspan="1">0.424</td>
<td align="center" rowspan="1" colspan="1">21</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_ficusphila</td>
<td align="center" rowspan="1" colspan="1">1591</td>
<td align="center" rowspan="1" colspan="1">0.435</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_grimshawi</td>
<td align="center" rowspan="1" colspan="1">2377</td>
<td align="center" rowspan="1" colspan="1">0.465</td>
<td align="center" rowspan="1" colspan="1">16</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_kikkawai</td>
<td align="center" rowspan="1" colspan="1">1834</td>
<td align="center" rowspan="1" colspan="1">0.468</td>
<td align="center" rowspan="1" colspan="1">16</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_melanogaster</td>
<td align="center" rowspan="1" colspan="1">1805</td>
<td align="center" rowspan="1" colspan="1">0.472</td>
<td align="center" rowspan="1" colspan="1">20</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_miranda</td>
<td align="center" rowspan="1" colspan="1">1973</td>
<td align="center" rowspan="1" colspan="1">0.429</td>
<td align="center" rowspan="1" colspan="1">28</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_mojavensis</td>
<td align="center" rowspan="1" colspan="1">2435</td>
<td align="center" rowspan="1" colspan="1">0.435</td>
<td align="center" rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_nasuta</td>
<td align="center" rowspan="1" colspan="1">1981</td>
<td align="center" rowspan="1" colspan="1">0.468</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_navojoa</td>
<td align="center" rowspan="1" colspan="1">2239</td>
<td align="center" rowspan="1" colspan="1">0.508</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_obscura</td>
<td align="center" rowspan="1" colspan="1">2029</td>
<td align="center" rowspan="1" colspan="1">0.500</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_persimilis</td>
<td align="center" rowspan="1" colspan="1">2111</td>
<td align="center" rowspan="1" colspan="1">0.423</td>
<td align="center" rowspan="1" colspan="1">24</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_pseudoobscura</td>
<td align="center" rowspan="1" colspan="1">2046</td>
<td align="center" rowspan="1" colspan="1">0.393</td>
<td align="center" rowspan="1" colspan="1">26</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_rhopaloa</td>
<td align="center" rowspan="1" colspan="1">1757</td>
<td align="center" rowspan="1" colspan="1">0.427</td>
<td align="center" rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_sechellia</td>
<td align="center" rowspan="1" colspan="1">1883</td>
<td align="center" rowspan="1" colspan="1">0.456</td>
<td align="center" rowspan="1" colspan="1">20</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_serrata</td>
<td align="center" rowspan="1" colspan="1">1820</td>
<td align="center" rowspan="1" colspan="1">0.410</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_simulans</td>
<td align="center" rowspan="1" colspan="1">1758</td>
<td align="center" rowspan="1" colspan="1">0.496</td>
<td align="center" rowspan="1" colspan="1">21</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_suzukii</td>
<td align="center" rowspan="1" colspan="1">1937</td>
<td align="center" rowspan="1" colspan="1">0.442</td>
<td align="center" rowspan="1" colspan="1">21</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_takahashi</td>
<td align="center" rowspan="1" colspan="1">1834</td>
<td align="center" rowspan="1" colspan="1">0.364</td>
<td align="center" rowspan="1" colspan="1">20</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_virilis</td>
<td align="center" rowspan="1" colspan="1">2415</td>
<td align="center" rowspan="1" colspan="1">0.475</td>
<td align="center" rowspan="1" colspan="1">23</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_willistoni</td>
<td align="center" rowspan="1" colspan="1">2223</td>
<td align="center" rowspan="1" colspan="1">0.425</td>
<td align="center" rowspan="1" colspan="1">21</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Drosophila_yakuba</td>
<td align="center" rowspan="1" colspan="1">1843</td>
<td align="center" rowspan="1" colspan="1">0.410</td>
<td align="center" rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_austeni</td>
<td align="center" rowspan="1" colspan="1">1741</td>
<td align="center" rowspan="1" colspan="1">0.367</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_brevipalpis</td>
<td align="center" rowspan="1" colspan="1">1973</td>
<td align="center" rowspan="1" colspan="1">0.360</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_fuscipes</td>
<td align="center" rowspan="1" colspan="1">1787</td>
<td align="center" rowspan="1" colspan="1">0.370</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_pallidipes</td>
<td align="center" rowspan="1" colspan="1">1732</td>
<td align="center" rowspan="1" colspan="1">0.373</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_palpalis_gambiensis</td>
<td align="center" rowspan="1" colspan="1">1810</td>
<td align="center" rowspan="1" colspan="1">0.342</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Glossina_morsitans_morsitans</td>
<td align="center" rowspan="1" colspan="1">1735</td>
<td align="center" rowspan="1" colspan="1">0.377</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab2" orientation="portrait" position="float">
<label>Table 2</label>
<caption>
<p>CC statistics for k-mers of lengths 7–9 bp for different combinations of the genera under study.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Group comparison</th>
<th align="center" rowspan="1" colspan="1">Min</th>
<th align="center" rowspan="1" colspan="1">Median</th>
<th align="center" rowspan="1" colspan="1">Mean</th>
<th align="center" rowspan="1" colspan="1">Max</th>
<th align="center" rowspan="1" colspan="1">Std. dev.</th>
<th align="center" rowspan="1" colspan="1">No. of comparisons</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="7" rowspan="1">Heptamers</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.913</td>
<td align="center" rowspan="1" colspan="1">0.957</td>
<td align="center" rowspan="1" colspan="1">0.955</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.022</td>
<td align="center" rowspan="1" colspan="1">231</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.590</td>
<td align="center" rowspan="1" colspan="1">0.833</td>
<td align="center" rowspan="1" colspan="1">0.837</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.087</td>
<td align="center" rowspan="1" colspan="1">630</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.590</td>
<td align="center" rowspan="1" colspan="1">0.874</td>
<td align="center" rowspan="1" colspan="1">0.869</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.072</td>
<td align="center" rowspan="1" colspan="1">435</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.677</td>
<td align="center" rowspan="1" colspan="1">0.938</td>
<td align="center" rowspan="1" colspan="1">0.882</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.104</td>
<td align="center" rowspan="1" colspan="1">378</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.965</td>
<td align="center" rowspan="1" colspan="1">0.994</td>
<td align="center" rowspan="1" colspan="1">0.986</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.014</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.441</td>
<td align="center" rowspan="1" colspan="1">0.739</td>
<td align="center" rowspan="1" colspan="1">0.772</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.144</td>
<td align="center" rowspan="1" colspan="1">1326</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.441</td>
<td align="center" rowspan="1" colspan="1">0.648</td>
<td align="center" rowspan="1" colspan="1">0.644</td>
<td align="center" rowspan="1" colspan="1">0.770</td>
<td align="center" rowspan="1" colspan="1">0.059</td>
<td align="center" rowspan="1" colspan="1">660</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.677</td>
<td align="center" rowspan="1" colspan="1">0.740</td>
<td align="center" rowspan="1" colspan="1">0.744</td>
<td align="center" rowspan="1" colspan="1">0.787</td>
<td align="center" rowspan="1" colspan="1">0.027</td>
<td align="center" rowspan="1" colspan="1">132</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.642</td>
<td align="center" rowspan="1" colspan="1">0.749</td>
<td align="center" rowspan="1" colspan="1">0.745</td>
<td align="center" rowspan="1" colspan="1">0.812</td>
<td align="center" rowspan="1" colspan="1">0.033</td>
<td align="center" rowspan="1" colspan="1">180</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.528</td>
<td align="center" rowspan="1" colspan="1">0.559</td>
<td align="center" rowspan="1" colspan="1">0.562</td>
<td align="center" rowspan="1" colspan="1">0.643</td>
<td align="center" rowspan="1" colspan="1">0.030</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.266</td>
<td align="center" rowspan="1" colspan="1">0.620</td>
<td align="center" rowspan="1" colspan="1">0.573</td>
<td align="center" rowspan="1" colspan="1">0.667</td>
<td align="center" rowspan="1" colspan="1">0.102</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.485</td>
<td align="center" rowspan="1" colspan="1">0.492</td>
<td align="center" rowspan="1" colspan="1">0.499</td>
<td align="center" rowspan="1" colspan="1">0.534</td>
<td align="center" rowspan="1" colspan="1">0.018</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.568</td>
<td align="center" rowspan="1" colspan="1">0.617</td>
<td align="center" rowspan="1" colspan="1">0.629</td>
<td align="center" rowspan="1" colspan="1">0.702</td>
<td align="center" rowspan="1" colspan="1">0.043</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.242</td>
<td align="center" rowspan="1" colspan="1">0.484</td>
<td align="center" rowspan="1" colspan="1">0.474</td>
<td align="center" rowspan="1" colspan="1">0.567</td>
<td align="center" rowspan="1" colspan="1">0.065</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.570</td>
<td align="center" rowspan="1" colspan="1">0.590</td>
<td align="center" rowspan="1" colspan="1">0.589</td>
<td align="center" rowspan="1" colspan="1">0.617</td>
<td align="center" rowspan="1" colspan="1">0.017</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" colspan="7" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td align="left" colspan="7" rowspan="1">Octamers</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.904</td>
<td align="center" rowspan="1" colspan="1">0.950</td>
<td align="center" rowspan="1" colspan="1">0.948</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.023</td>
<td align="center" rowspan="1" colspan="1">231</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.588</td>
<td align="center" rowspan="1" colspan="1">0.824</td>
<td align="center" rowspan="1" colspan="1">0.822</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.089</td>
<td align="center" rowspan="1" colspan="1">630</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.588</td>
<td align="center" rowspan="1" colspan="1">0.858</td>
<td align="center" rowspan="1" colspan="1">0.857</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.069</td>
<td align="center" rowspan="1" colspan="1">435</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.655</td>
<td align="center" rowspan="1" colspan="1">0.93</td>
<td align="center" rowspan="1" colspan="1">0.869</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.113</td>
<td align="center" rowspan="1" colspan="1">378</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.948</td>
<td align="center" rowspan="1" colspan="1">0.988</td>
<td align="center" rowspan="1" colspan="1">0.978</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.020</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.443</td>
<td align="center" rowspan="1" colspan="1">0.723</td>
<td align="center" rowspan="1" colspan="1">0.761</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.143</td>
<td align="center" rowspan="1" colspan="1">1326</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.443</td>
<td align="center" rowspan="1" colspan="1">0.637</td>
<td align="center" rowspan="1" colspan="1">0.633</td>
<td align="center" rowspan="1" colspan="1">0.760</td>
<td align="center" rowspan="1" colspan="1">0.055</td>
<td align="center" rowspan="1" colspan="1">660</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.655</td>
<td align="center" rowspan="1" colspan="1">0.716</td>
<td align="center" rowspan="1" colspan="1">0.719</td>
<td align="center" rowspan="1" colspan="1">0.755</td>
<td align="center" rowspan="1" colspan="1">0.026</td>
<td align="center" rowspan="1" colspan="1">132</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.621</td>
<td align="center" rowspan="1" colspan="1">0.728</td>
<td align="center" rowspan="1" colspan="1">0.723</td>
<td align="center" rowspan="1" colspan="1">0.791</td>
<td align="center" rowspan="1" colspan="1">0.034</td>
<td align="center" rowspan="1" colspan="1">180</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.521</td>
<td align="center" rowspan="1" colspan="1">0.554</td>
<td align="center" rowspan="1" colspan="1">0.556</td>
<td align="center" rowspan="1" colspan="1">0.634</td>
<td align="center" rowspan="1" colspan="1">0.029</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.279</td>
<td align="center" rowspan="1" colspan="1">0.610</td>
<td align="center" rowspan="1" colspan="1">0.567</td>
<td align="center" rowspan="1" colspan="1">0.652</td>
<td align="center" rowspan="1" colspan="1">0.094</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.477</td>
<td align="center" rowspan="1" colspan="1">0.484</td>
<td align="center" rowspan="1" colspan="1">0.490</td>
<td align="center" rowspan="1" colspan="1">0.522</td>
<td align="center" rowspan="1" colspan="1">0.017</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.564</td>
<td align="center" rowspan="1" colspan="1">0.611</td>
<td align="center" rowspan="1" colspan="1">0.624</td>
<td align="center" rowspan="1" colspan="1">0.696</td>
<td align="center" rowspan="1" colspan="1">0.042</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.259</td>
<td align="center" rowspan="1" colspan="1">0.481</td>
<td align="center" rowspan="1" colspan="1">0.477</td>
<td align="center" rowspan="1" colspan="1">0.565</td>
<td align="center" rowspan="1" colspan="1">0.061</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.564</td>
<td align="center" rowspan="1" colspan="1">0.585</td>
<td align="center" rowspan="1" colspan="1">0.583</td>
<td align="center" rowspan="1" colspan="1">0.608</td>
<td align="center" rowspan="1" colspan="1">0.016</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" colspan="7" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td align="left" colspan="7" rowspan="1">Nonamers</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.886</td>
<td align="center" rowspan="1" colspan="1">0.939</td>
<td align="center" rowspan="1" colspan="1">0.938</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1">0.025</td>
<td align="center" rowspan="1" colspan="1">231</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.577</td>
<td align="center" rowspan="1" colspan="1">0.805</td>
<td align="center" rowspan="1" colspan="1">0.801</td>
<td align="center" rowspan="1" colspan="1">0.993</td>
<td align="center" rowspan="1" colspan="1">0.092</td>
<td align="center" rowspan="1" colspan="1">630</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.577</td>
<td align="center" rowspan="1" colspan="1">0.838</td>
<td align="center" rowspan="1" colspan="1">0.839</td>
<td align="center" rowspan="1" colspan="1">0.992</td>
<td align="center" rowspan="1" colspan="1">0.069</td>
<td align="center" rowspan="1" colspan="1">435</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.629</td>
<td align="center" rowspan="1" colspan="1">0.919</td>
<td align="center" rowspan="1" colspan="1">0.852</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1">0.121</td>
<td align="center" rowspan="1" colspan="1">378</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.919</td>
<td align="center" rowspan="1" colspan="1">0.975</td>
<td align="center" rowspan="1" colspan="1">0.961</td>
<td align="center" rowspan="1" colspan="1">0.993</td>
<td align="center" rowspan="1" colspan="1">0.028</td>
<td align="center" rowspan="1" colspan="1">15</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.439</td>
<td align="center" rowspan="1" colspan="1">0.705</td>
<td align="center" rowspan="1" colspan="1">0.747</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1">0.143</td>
<td align="center" rowspan="1" colspan="1">1326</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.439</td>
<td align="center" rowspan="1" colspan="1">0.624</td>
<td align="center" rowspan="1" colspan="1">0.619</td>
<td align="center" rowspan="1" colspan="1">0.746</td>
<td align="center" rowspan="1" colspan="1">0.053</td>
<td align="center" rowspan="1" colspan="1">660</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Anopheles</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.629</td>
<td align="center" rowspan="1" colspan="1">0.689</td>
<td align="center" rowspan="1" colspan="1">0.691</td>
<td align="center" rowspan="1" colspan="1">0.724</td>
<td align="center" rowspan="1" colspan="1">0.024</td>
<td align="center" rowspan="1" colspan="1">132</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Drosophila</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.589</td>
<td align="center" rowspan="1" colspan="1">0.697</td>
<td align="center" rowspan="1" colspan="1">0.694</td>
<td align="center" rowspan="1" colspan="1">0.766</td>
<td align="center" rowspan="1" colspan="1">0.034</td>
<td align="center" rowspan="1" colspan="1">180</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.510</td>
<td align="center" rowspan="1" colspan="1">0.544</td>
<td align="center" rowspan="1" colspan="1">0.545</td>
<td align="center" rowspan="1" colspan="1">0.619</td>
<td align="center" rowspan="1" colspan="1">0.027</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.285</td>
<td align="center" rowspan="1" colspan="1">0.594</td>
<td align="center" rowspan="1" colspan="1">0.553</td>
<td align="center" rowspan="1" colspan="1">0.636</td>
<td align="center" rowspan="1" colspan="1">0.086</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>C. briggsae</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.464</td>
<td align="center" rowspan="1" colspan="1">0.470</td>
<td align="center" rowspan="1" colspan="1">0.475</td>
<td align="center" rowspan="1" colspan="1">0.503</td>
<td align="center" rowspan="1" colspan="1">0.014</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Anopheles</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.555</td>
<td align="center" rowspan="1" colspan="1">0.602</td>
<td align="center" rowspan="1" colspan="1">0.615</td>
<td align="center" rowspan="1" colspan="1">0.685</td>
<td align="center" rowspan="1" colspan="1">0.041</td>
<td align="center" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Drosophila</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.270</td>
<td align="center" rowspan="1" colspan="1">0.475</td>
<td align="center" rowspan="1" colspan="1">0.474</td>
<td align="center" rowspan="1" colspan="1">0.558</td>
<td align="center" rowspan="1" colspan="1">0.058</td>
<td align="center" rowspan="1" colspan="1">30</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. mellifera</italic>
vs.
<italic>Glossina</italic>
</td>
<td align="center" rowspan="1" colspan="1">0.551</td>
<td align="center" rowspan="1" colspan="1">0.572</td>
<td align="center" rowspan="1" colspan="1">0.570</td>
<td align="center" rowspan="1" colspan="1">0.592</td>
<td align="center" rowspan="1" colspan="1">0.014</td>
<td align="center" rowspan="1" colspan="1">6</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>CC values were calculated for the genera
<italic>Anopheles</italic>
,
<italic>Drosophila</italic>
, and
<italic>Glossina</italic>
as well as between these three genera and between two outliers,
<italic>Apis mellifera</italic>
and
<italic>Caenorhabditis elegans</italic>
, and these two genera. For each combination, the minimum, mean, median, maximum CC values were calculated as well as the standard deviation and the number of species comparisons.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab3" orientation="portrait" position="float">
<label>Table 3</label>
<caption>
<p>CC statistics for 5′, 3′ UTRs and introns for k-mer lengths
<italic>k</italic>
 = 7–9 bp between
<italic>A. gambiae</italic>
and
<italic>Drosophila</italic>
.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Comparison</th>
<th align="center" rowspan="1" colspan="1">Region</th>
<th align="center" rowspan="1" colspan="1">k</th>
<th align="center" rowspan="1" colspan="1">Min</th>
<th align="center" rowspan="1" colspan="1">Median</th>
<th align="center" rowspan="1" colspan="1">Mean</th>
<th align="center" rowspan="1" colspan="1">Max</th>
<th align="center" rowspan="1" colspan="1">St. dev.</th>
<th align="center" rowspan="1" colspan="1">
<italic>n</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>p</italic>
value</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family
<sup></sup>
</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.692</td>
<td align="center" rowspan="1" colspan="1">0.862</td>
<td align="center" rowspan="1" colspan="1">0.841</td>
<td align="center" rowspan="1" colspan="1">0.975</td>
<td align="center" rowspan="1" colspan="1">0.100</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.623</td>
<td align="center" rowspan="1" colspan="1">0.734</td>
<td align="center" rowspan="1" colspan="1">0.722</td>
<td align="center" rowspan="1" colspan="1">0.774</td>
<td align="center" rowspan="1" colspan="1">0.050</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">5.1
<italic>e</italic>
−4</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.651</td>
<td align="center" rowspan="1" colspan="1">0.828</td>
<td align="center" rowspan="1" colspan="1">0.809</td>
<td align="center" rowspan="1" colspan="1">0.963</td>
<td align="center" rowspan="1" colspan="1">0.101</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.524</td>
<td align="center" rowspan="1" colspan="1">0.620</td>
<td align="center" rowspan="1" colspan="1">0.599</td>
<td align="center" rowspan="1" colspan="1">0.644</td>
<td align="center" rowspan="1" colspan="1">0.043</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">6.2
<italic>e</italic>
−8</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family
<sup></sup>
</td>
<td align="center" rowspan="1" colspan="1">Introns</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.759</td>
<td align="center" rowspan="1" colspan="1">0.894</td>
<td align="center" rowspan="1" colspan="1">0.895</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1">0.058</td>
<td align="center" rowspan="1" colspan="1">66</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">0.503</td>
<td align="center" rowspan="1" colspan="1">0.786</td>
<td align="center" rowspan="1" colspan="1">0.737</td>
<td align="center" rowspan="1" colspan="1">0.940</td>
<td align="center" rowspan="1" colspan="1">0.153</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">0.422</td>
<td align="center" rowspan="1" colspan="1">0.643</td>
<td align="center" rowspan="1" colspan="1">0.620</td>
<td align="center" rowspan="1" colspan="1">0.694</td>
<td align="center" rowspan="1" colspan="1">0.090</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">0.487</td>
<td align="center" rowspan="1" colspan="1">0.705</td>
<td align="center" rowspan="1" colspan="1">0.688</td>
<td align="center" rowspan="1" colspan="1">0.908</td>
<td align="center" rowspan="1" colspan="1">0.125</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">0.392</td>
<td align="center" rowspan="1" colspan="1">0.513</td>
<td align="center" rowspan="1" colspan="1">0.498</td>
<td align="center" rowspan="1" colspan="1">0.562</td>
<td align="center" rowspan="1" colspan="1">0.055</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">1.2
<italic>e</italic>
−5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">Introns</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">0.392</td>
<td align="center" rowspan="1" colspan="1">0.690</td>
<td align="center" rowspan="1" colspan="1">0.676</td>
<td align="center" rowspan="1" colspan="1">0.981</td>
<td align="center" rowspan="1" colspan="1">0.135</td>
<td align="center" rowspan="1" colspan="1">66</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.280</td>
<td align="center" rowspan="1" colspan="1">0.626</td>
<td align="center" rowspan="1" colspan="1">0.569</td>
<td align="center" rowspan="1" colspan="1">0.854</td>
<td align="center" rowspan="1" colspan="1">0.183</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">5′ UTR</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.201</td>
<td align="center" rowspan="1" colspan="1">0.453</td>
<td align="center" rowspan="1" colspan="1">0.431</td>
<td align="center" rowspan="1" colspan="1">0.512</td>
<td align="center" rowspan="1" colspan="1">0.104</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">0.023</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.334</td>
<td align="center" rowspan="1" colspan="1">0.526</td>
<td align="center" rowspan="1" colspan="1">0.524</td>
<td align="center" rowspan="1" colspan="1">0.795</td>
<td align="center" rowspan="1" colspan="1">0.122</td>
<td align="center" rowspan="1" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>A. gambiae</italic>
vs.
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">3′ UTR</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.242</td>
<td align="center" rowspan="1" colspan="1">0.360</td>
<td align="center" rowspan="1" colspan="1">0.356</td>
<td align="center" rowspan="1" colspan="1">0.422</td>
<td align="center" rowspan="1" colspan="1">0.056</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">5.9
<italic>e</italic>
−5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Within
<italic>Drosophila</italic>
family</td>
<td align="center" rowspan="1" colspan="1">Introns</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.721</td>
<td align="center" rowspan="1" colspan="1">0.855</td>
<td align="center" rowspan="1" colspan="1">0.854</td>
<td align="center" rowspan="1" colspan="1">0.978</td>
<td align="center" rowspan="1" colspan="1">0.062</td>
<td align="center" rowspan="1" colspan="1">66</td>
<td align="center" rowspan="1" colspan="1">NA</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Minimum, mean, median, and maximum CC values were calculated for the 5′, 3′ UTR and intron regions of different
<italic>Drosophila</italic>
species compared to
<italic>A. gambiae</italic>
. The num[[parms resize(1),pos(50,50),size(200,200),bgcol(156)]] comparisons and the
<italic>p</italic>
value are also included.
<sup></sup>
For 5′ and 3′ UTRs, the following
<italic>Drosophila</italic>
species were examined:
<italic>D. ananassae</italic>
,
<italic>erecta</italic>
,
<italic>grimshawi</italic>
,
<italic>melanogaster</italic>
,
<italic>mojavensis</italic>
,
<italic>pseudoobscura</italic>
, and
<italic>simulans</italic>
.
<sup></sup>
For introns, the following
<italic>Drosophila</italic>
species were examined:
<italic>D. ananassae</italic>
,
<italic>erecta</italic>
,
<italic>grimshawi</italic>
,
<italic>melanogaster</italic>
,
<italic>mojavensis</italic>
,
<italic>persimilis</italic>
,
<italic>pseudoobscura</italic>
,
<italic>sechelia</italic>
,
<italic>simulans</italic>
,
<italic>virilis</italic>
,
<italic>willistoni</italic>
, and
<italic>yakuba</italic>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B959 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B959 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021