Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers

Identifieur interne : 000B10 ( Pmc/Corpus ); précédent : 000B09; suivant : 000B11

Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers

Auteurs : Yaron Orenstein ; Ron Shamir

Source :

RBID : PMC:3694677

Abstract

Motivation: Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure in vitro the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism’s genome allow us to measure in vivo the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of k that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is, therefore, to cover all k-mers with a minimal number of probes. The standard way to do this uses the de Bruijn sequence of length . However, as probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well.

Results: Here, we show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases resulting in a more efficient array, which allows covering more longer sequences; alternatively, additional sequences with redundant k-mers of interest can be added.

Availability: The software is freely available from our website http://acgt.cs.tau.ac.il/shortcake/.

Contact:rshamir@tau.ac.il


Url:
DOI: 10.1093/bioinformatics/btt230
PubMed: 23813011
PubMed Central: 3694677

Links to Exploration step

PMC:3694677

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Design of shortest double-stranded DNA sequences covering all
<italic>k</italic>
-mers with applications to protein-binding microarrays and synthetic enhancers</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23813011</idno>
<idno type="pmc">3694677</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694677</idno>
<idno type="RBID">PMC:3694677</idno>
<idno type="doi">10.1093/bioinformatics/btt230</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000B10</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B10</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Design of shortest double-stranded DNA sequences covering all
<italic>k</italic>
-mers with applications to protein-binding microarrays and synthetic enhancers</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure
<italic>in vitro</italic>
the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism’s genome allow us to measure
<italic>in vivo</italic>
the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all
<italic>k</italic>
-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of
<italic>k</italic>
that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is, therefore, to cover all
<italic>k</italic>
-mers with a minimal number of probes. The standard way to do this uses the de Bruijn sequence of length
<inline-formula>
<inline-graphic xlink:href="btt230i1.jpg"></inline-graphic>
</inline-formula>
. However, as probes are double stranded, when a
<italic>k</italic>
-mer is included in a probe, its reverse complement
<italic>k</italic>
-mer is accounted for as well.</p>
<p>
<bold>Results:</bold>
Here, we show how to efficiently create a shortest possible sequence with the property that it contains each
<italic>k</italic>
-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as
<italic>k</italic>
increases resulting in a more efficient array, which allows covering more longer sequences; alternatively, additional sequences with redundant
<italic>k</italic>
-mers of interest can be added.</p>
<p>
<bold>Availability:</bold>
The software is freely available from our website
<ext-link ext-link-type="uri" xlink:href="http://acgt.cs.tau.ac.il/shortcake/">http://acgt.cs.tau.ac.il/shortcake/</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>rshamir@tau.ac.il</email>
</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Berger, M" uniqKey="Berger M">M Berger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edmonds, J" uniqKey="Edmonds J">J Edmonds</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edmonds, J" uniqKey="Edmonds J">J Edmonds</name>
</author>
<author>
<name sortKey="Johnson, E" uniqKey="Johnson E">E Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fleischner, H" uniqKey="Fleischner H">H Fleischner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fordyce, P" uniqKey="Fordyce P">P Fordyce</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jolma, A" uniqKey="Jolma A">A Jolma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kao, M" uniqKey="Kao M">M Kao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kececioglu, J" uniqKey="Kececioglu J">J Kececioglu</name>
</author>
<author>
<name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuhn, H" uniqKey="Kuhn H">H Kuhn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mintseris, J" uniqKey="Mintseris J">J Mintseris</name>
</author>
<author>
<name sortKey="Eisen, M" uniqKey="Eisen M">M Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nutiu, R" uniqKey="Nutiu R">R Nutiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y Orenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Philippakis, A" uniqKey="Philippakis A">A Philippakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Riesenfeld, S" uniqKey="Riesenfeld S">S Riesenfeld</name>
</author>
<author>
<name sortKey="Pollard, K" uniqKey="Pollard K">K Pollard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, R" uniqKey="Smith R">R Smith</name>
</author>
<author>
<name sortKey="Ahituv, N" uniqKey="Ahituv N">N Ahituv</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="West, D" uniqKey="West D">D West</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23813011</article-id>
<article-id pub-id-type="pmc">3694677</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btt230</article-id>
<article-id pub-id-type="publisher-id">btt230</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany</subject>
<subj-group>
<subject>Original Papers</subject>
<subj-group>
<subject>Gene Regulation and Transcriptomics</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Design of shortest double-stranded DNA sequences covering all
<italic>k</italic>
-mers with applications to protein-binding microarrays and synthetic enhancers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Orenstein</surname>
<given-names>Yaron</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<xref ref-type="corresp" rid="btt230-COR1">*</xref>
</contrib>
</contrib-group>
<aff>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel</aff>
<author-notes>
<corresp id="btt230-COR1">*To whom correspondence should be addressed.</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>1</day>
<month>7</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>6</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>19</day>
<month>6</month>
<year>2013</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>29</volume>
<issue>13</issue>
<fpage>i71</fpage>
<lpage>i79</lpage>
<permissions>
<copyright-statement>© The Author 2013. Published by Oxford University Press.</copyright-statement>
<copyright-year>2013</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/">http://creativecommons.org/licenses/by-nc/3.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure
<italic>in vitro</italic>
the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism’s genome allow us to measure
<italic>in vivo</italic>
the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all
<italic>k</italic>
-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of
<italic>k</italic>
that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is, therefore, to cover all
<italic>k</italic>
-mers with a minimal number of probes. The standard way to do this uses the de Bruijn sequence of length
<inline-formula>
<inline-graphic xlink:href="btt230i1.jpg"></inline-graphic>
</inline-formula>
. However, as probes are double stranded, when a
<italic>k</italic>
-mer is included in a probe, its reverse complement
<italic>k</italic>
-mer is accounted for as well.</p>
<p>
<bold>Results:</bold>
Here, we show how to efficiently create a shortest possible sequence with the property that it contains each
<italic>k</italic>
-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as
<italic>k</italic>
increases resulting in a more efficient array, which allows covering more longer sequences; alternatively, additional sequences with redundant
<italic>k</italic>
-mers of interest can be added.</p>
<p>
<bold>Availability:</bold>
The software is freely available from our website
<ext-link ext-link-type="uri" xlink:href="http://acgt.cs.tau.ac.il/shortcake/">http://acgt.cs.tau.ac.il/shortcake/</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>rshamir@tau.ac.il</email>
</p>
</abstract>
<counts>
<page-count count="9"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="SEC1">
<title>1 INTRODUCTION</title>
<p>Gene regulation is a central focus of biological research. The main factors that regulate gene expression are transcription factors (TFs). These proteins bind to short DNA sequences, either in promoters or enhancers, and by that encourage or impede gene transcription. TFs bind to different DNA sequences with different affinity and specificity. Understanding TF-binding specificity and its effect on gene expression and the final phenotype is a fundamental goal in the study of gene regulation.</p>
<p>Recent technologies measure the binding intensity of a TF to many DNA sequences [e.g. protein-binding microarray (PBM) (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
) and MITOMI [
<xref ref-type="bibr" rid="btt230-B5">Fordyce
<italic>et al.</italic>
, 2010</xref>
)]. These technologies synthesize a large set of DNA sequences and measure the binding intensity of the TF to each of those sequences. Some technologies use random DNA sequences (
<xref ref-type="bibr" rid="btt230-B13">Nutiu
<italic>et al.</italic>
, 2011</xref>
). Others use sequences that cover all possible DNA
<italic>k</italic>
-mers, as they provide a complete picture of the binding spectrum (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="btt230-B5">Fordyce
<italic>et al.</italic>
, 2010</xref>
). A similar approach was also used to test binding
<italic>in vivo</italic>
. A recent study used synthesized enhancer oligomers designed to cover all 6mers to test their effect on limb formation in zebrafish (
<xref ref-type="bibr" rid="btt230-B17">Smith and Ahituv, 2012</xref>
).</p>
<p>
<italic>De Bruijn sequences</italic>
are the most compact sequences that cover all
<italic>k</italic>
-mers (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="btt230-B5">Fordyce
<italic>et al.</italic>
, 2010</xref>
). The length of a de Bruijn sequence of order
<italic>k</italic>
over alphabet
<inline-formula>
<inline-graphic xlink:href="btt230i2.jpg"></inline-graphic>
</inline-formula>
is
<inline-formula>
<inline-graphic xlink:href="btt230i3.jpg"></inline-graphic>
</inline-formula>
, where the DNA alphabet is
<inline-formula>
<inline-graphic xlink:href="btt230i4.jpg"></inline-graphic>
</inline-formula>
. Because of the exponential dependency on
<italic>k</italic>
and small space on the experimental device, these technologies are limited to a small value of
<italic>k</italic>
. The most popular technology, PBM, was used in hundreds of experiments to date using arrays with
<italic>k</italic>
= 10. To create
<italic>p</italic>
-long probe sequences, the sequence is split into intervals of length
<italic>p</italic>
with
<italic>k</italic>
− 1 overlap (
<italic>p</italic>
= 36 is used in PBMs).</p>
<p>Despite the universal and high-throughput nature of these technologies, the data produced are still limited. For many TFs, binding depends on >10 DNA positions, usually with six to eight core positions and additional side positions that have a significant contribution (
<xref ref-type="bibr" rid="btt230-B13">Nutiu
<italic>et al.</italic>
, 2011</xref>
;
<xref ref-type="bibr" rid="btt230-B14">Orenstein
<italic>et al.</italic>
, 2013</xref>
). A recent study from the Taipale Laboratory using HT-Selex showed that many TFs have longer motifs that are not covered well by an all 10mer array (
<xref ref-type="bibr" rid="btt230-B6">Jolma
<italic>et al.</italic>
, 2013</xref>
). The RankMotif++ algorithm for PBM data also generates motifs of length
<inline-formula>
<inline-graphic xlink:href="btt230i5.jpg"></inline-graphic>
</inline-formula>
in most cases (
<xref ref-type="bibr" rid="btt230-B2">Chen
<italic>et al.</italic>
, 2007</xref>
). Covering all
<italic>k</italic>
-mers for a greater value of
<italic>k</italic>
will lead to improved understanding of TF binding.</p>
<p>As the probes are double-stranded DNA segments, one can save by using the reverse complementarity of DNA: whenever a
<italic>k</italic>
-mer is included, its reverse complement is included as well, and there is no need to cover it again. This brings up the following question: a sequence
<italic>S</italic>
is called a
<italic>reverse complementary complete sequence</italic>
of order
<italic>k</italic>
(RC complete sequence for short) if for each
<italic>k</italic>
-mer either the
<italic>k</italic>
-mer or its reverse complement are included in
<italic>S</italic>
. Can we construct an optimal (minimum length) RC complete sequence? Theoretically, if for each
<italic>k</italic>
-mer
<italic>T</italic>
the sequence
<italic>S</italic>
includes either
<italic>T</italic>
or its reverse complement but not both, one could save a factor of nearly 2 compared with the length of a de Bruijn sequence.</p>
<p>
<xref ref-type="bibr" rid="btt230-B12">Ministeris and Eisen (2006)</xref>
and
<xref ref-type="bibr" rid="btt230-B15">Philippakis
<italic>et al.</italic>
(2008)</xref>
proposed the use of (regular) de Bruijn sequences for designing probes for PBMs. Philippakis
<italic>et al.</italic>
used linear feedback shift registers to generate a de Bruijn sequence with good coverage of gapped
<italic>k</italic>
-mers. This approach was used for constructing two microarrays that are in use today with
<italic>k</italic>
= 10 (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
). The idea of exploiting reverse complementarity was raised by
<xref ref-type="bibr" rid="btt230-B12">Ministeris and Eisen (2006)</xref>
, who sketched an algorithm for it without proof. In fact, as we shall show, the algorithm of
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen (2006)</xref>
does not provide an optimal solution for even values of
<italic>k</italic>
. In the context of sequence assembly, Medvedev
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="btt230-B10">Medvedev and Brudno, 2009</xref>
;
<xref ref-type="bibr" rid="btt230-B11">Medvedev
<italic>et al.</italic>
, 2007</xref>
) solved the problem of constructing a minimum length sequence that covers a given set of
<italic>k</italic>
-mers, using reverse complementarity. Although their algorithm can be applied to solve the problem raised in this study, they do not address it directly. When applied to our problem, their algorithm requires
<inline-formula>
<inline-graphic xlink:href="btt230i6.jpg"></inline-graphic>
</inline-formula>
time. As we shall see, our algorithm is much faster.</p>
<p>In this study, we address the problem of constructing an optimal RC complete sequence. We first give a lower bound for the length of such a sequence. We prove that for odd
<italic>k</italic>
, there exists a sequence that achieves the lower bound and show how to construct it in time complexity that is linear in the output sequence length. For odd
<italic>k</italic>
, the algorithm constructs two tours that are reverse complementary to each other and together cover all edges of the de Bruijn graph and is identical to
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen (2006)</xref>
. Then, we show how to adjust the algorithm to handle the case of even
<italic>k</italic>
, achieving a saving factor approaching 2 as
<italic>k</italic>
increases. We give two solutions: a simple near-optimal one requiring linear time and a more complex (
<inline-formula>
<inline-graphic xlink:href="btt230i7.jpg"></inline-graphic>
</inline-formula>
time) solution that guarantees optimality of the resulting sequence. In particular, this implies that the lower bound is not tight for even
<italic>k</italic>
. We implemented the algorithm and we demonstrate the saving it achieves. The produced sequences are nearly half the length compared with a regular de Bruijn sequence.</p>
<p>The article is organized as follows. We first provide formal definitions and preliminaries. We then present a lower bound for the length of an optimal sequence based on
<italic>k</italic>
-mer counts. Then, we present an algorithm that works in linear time on the de Bruijn graph and prove that it solves the problem for odd
<italic>k</italic>
. We conclude by describing the two possible solutions for even
<italic>k</italic>
and report on experimental results with all the algorithms.</p>
</sec>
<sec id="SEC2">
<title>2 PRELIMINARIES</title>
<p>We start with some basic definitions of graphs and sequences. For more details see, e.g.
<xref ref-type="bibr" rid="btt230-B18">West
<italic>et al.</italic>
(2001)</xref>
.</p>
<p>A
<italic>directed graph</italic>
(digraph or simply a graph)
<inline-formula>
<inline-graphic xlink:href="btt230i8.jpg"></inline-graphic>
</inline-formula>
is a set of vertices
<inline-formula>
<inline-graphic xlink:href="btt230i9.jpg"></inline-graphic>
</inline-formula>
and a set of edges
<inline-formula>
<inline-graphic xlink:href="btt230i10.jpg"></inline-graphic>
</inline-formula>
. Each edge is an ordered pair of vertices
<inline-formula>
<inline-graphic xlink:href="btt230i11.jpg"></inline-graphic>
</inline-formula>
, and we say the edge is directed from
<italic>v
<sub>i</sub>
</italic>
to
<italic>v
<sub>j</sub>
</italic>
. The
<italic>indegree</italic>
of vertex
<italic>v</italic>
is the number of edges entering
<italic>v</italic>
. Similarly, the
<italic>outdegree</italic>
is the number of edges outgoing from
<italic>v</italic>
. A vertex is
<italic>balanced</italic>
if its indegree equals its outdegree. A
<italic>path</italic>
in a digraph is a sequence of vertices,
<inline-formula>
<inline-graphic xlink:href="btt230i12.jpg"></inline-graphic>
</inline-formula>
, such that for each
<inline-formula>
<inline-graphic xlink:href="btt230i13.jpg"></inline-graphic>
</inline-formula>
there is an edge
<inline-formula>
<inline-graphic xlink:href="btt230i14.jpg"></inline-graphic>
</inline-formula>
. A
<italic>cycle</italic>
is a path where
<inline-formula>
<inline-graphic xlink:href="btt230i15.jpg"></inline-graphic>
</inline-formula>
. A digraph is
<italic>strongly connected</italic>
if for every pair of vertices
<inline-formula>
<inline-graphic xlink:href="btt230i16.jpg"></inline-graphic>
</inline-formula>
there exists a path from
<italic>u</italic>
to
<italic>v</italic>
and a path from
<italic>v</italic>
to
<italic>u</italic>
.
<italic>A strongly connected component</italic>
in a digraph is a maximal set of vertices that induces a strongly connected subgraph.</p>
<p>An
<italic>Eulerian tour</italic>
through a digraph
<italic>G</italic>
is a cycle that traverses all edges in
<italic>G</italic>
, such that each edge is traversed exactly once. If a digraph contains an Eulerian tour, we call it
<italic>Eulerian</italic>
. A digraph is Eulerian if and only if it is strongly connected and all vertices are balanced (
<xref ref-type="bibr" rid="btt230-B18">West
<italic>et al.</italic>
, 2001</xref>
).</p>
<p>A
<italic>de Bruijn sequence</italic>
of order
<italic>k</italic>
over alphabet Σ is a minimum length sequence that covers each
<italic>k</italic>
-mer over Σ exactly once. For convenience, we define the
<italic>length</italic>
of the sequence as the number of
<italic>k</italic>
-mers in it. Hence, a sequence of length
<italic>t</italic>
contains
<inline-formula>
<inline-graphic xlink:href="btt230i17.jpg"></inline-graphic>
</inline-formula>
characters. A de Bruijn sequence has length
<inline-formula>
<inline-graphic xlink:href="btt230i18.jpg"></inline-graphic>
</inline-formula>
, which is the minimum possible for covering all
<italic>k</italic>
-mers.</p>
<p>Given sequences
<inline-formula>
<inline-graphic xlink:href="btt230i19.jpg"></inline-graphic>
</inline-formula>
over alphabet Σ, the
<italic>overlap</italic>
between
<italic>a</italic>
and
<italic>b</italic>
, denoted
<inline-formula>
<inline-graphic xlink:href="btt230i20.jpg"></inline-graphic>
</inline-formula>
, is the largest suffix of
<italic>a</italic>
that is also a prefix of
<italic>b</italic>
.</p>
<p>A
<italic>de Bruijn graph</italic>
of order
<italic>k</italic>
is a digraph in which for every possible
<italic>k</italic>
-mer
<inline-formula>
<inline-graphic xlink:href="btt230i21.jpg"></inline-graphic>
</inline-formula>
there is a vertex denoted by
<inline-formula>
<inline-graphic xlink:href="btt230i22.jpg"></inline-graphic>
</inline-formula>
. There is an edge from
<italic>u</italic>
to
<italic>v</italic>
if and only if
<inline-formula>
<inline-graphic xlink:href="btt230i23.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i24.jpg"></inline-graphic>
</inline-formula>
, that is,
<inline-formula>
<inline-graphic xlink:href="btt230i25.jpg"></inline-graphic>
</inline-formula>
. Each edge represents a unique
<inline-formula>
<inline-graphic xlink:href="btt230i26.jpg"></inline-graphic>
</inline-formula>
-mer. For example, the edge
<inline-formula>
<inline-graphic xlink:href="btt230i27.jpg"></inline-graphic>
</inline-formula>
above represents
<inline-formula>
<inline-graphic xlink:href="btt230i28.jpg"></inline-graphic>
</inline-formula>
. To distinguish vertices from edges, we will use square brackets for vertices. Hence,
<inline-formula>
<inline-graphic xlink:href="btt230i29.jpg"></inline-graphic>
</inline-formula>
is the edge between
<inline-formula>
<inline-graphic xlink:href="btt230i30.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i31.jpg"></inline-graphic>
</inline-formula>
. Obviously, for each vertex
<italic>v</italic>
the indegree and outdegree are
<inline-formula>
<inline-graphic xlink:href="btt230i32.jpg"></inline-graphic>
</inline-formula>
, and the graph is strongly connected. Thus, a de Bruijn graph is Eulerian. Any Eulerian tour represents a de Bruijn sequence of order
<italic>k</italic>
+ 1. Each edge and vertex in the graph is represented by
<inline-formula>
<inline-graphic xlink:href="btt230i33.jpg"></inline-graphic>
</inline-formula>
bits. Throughout the article, we assume this number of bits is contained in one computer word; hence, we deduce that it takes
<italic>O</italic>
(1) time to find an edge or a vertex.</p>
<p>A
<italic>complementarity</italic>
relation between characters is a symmetric non-reflexive one-to-one relation. The alphabet of DNA is
<inline-formula>
<inline-graphic xlink:href="btt230i34.jpg"></inline-graphic>
</inline-formula>
with the complementarity relation
<inline-formula>
<inline-graphic xlink:href="btt230i35.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i36.jpg"></inline-graphic>
</inline-formula>
. By symmetry also
<inline-formula>
<inline-graphic xlink:href="btt230i37.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i38.jpg"></inline-graphic>
</inline-formula>
. The
<italic>reverse complement</italic>
of sequence
<inline-formula>
<inline-graphic xlink:href="btt230i39.jpg"></inline-graphic>
</inline-formula>
, denoted
<inline-formula>
<inline-graphic xlink:href="btt230i40.jpg"></inline-graphic>
</inline-formula>
, is defined as the sequence obtained by reversing the original sequence and replacing each character by its complement, i.e.
<inline-formula>
<inline-graphic xlink:href="btt230i41.jpg"></inline-graphic>
</inline-formula>
. For example,
<inline-formula>
<inline-graphic xlink:href="btt230i42.jpg"></inline-graphic>
</inline-formula>
=
<italic>TTCG</italic>
. A sequence
<italic>s</italic>
is called a
<italic>palindromic reverse complementary</italic>
sequence or in short a
<italic>palindrome</italic>
, if
<inline-formula>
<inline-graphic xlink:href="btt230i43.jpg"></inline-graphic>
</inline-formula>
. For example,
<italic>ACGT</italic>
is a palindrome. We define a
<italic>reverse complementary complete sequence</italic>
of order
<italic>k</italic>
over alphabet Σ (RC complete sequence for short) as a sequence such that for each
<italic>k</italic>
-mer
<italic>s</italic>
, at least one of
<italic>s</italic>
and
<italic>RC</italic>
(
<italic>s</italic>
) are in the sequence. Note that unlike a regular de Bruijn sequence, the definition of an RC complete sequence does not require minimality. An RC complete sequence is
<italic>optimal</italic>
if it is of minimum length.</p>
</sec>
<sec id="SEC3">
<title>3 RESULTS</title>
<sec id="SEC3.1">
<title>3.1 A lower bound for the length of an RC complete sequence</title>
<p>First, we derive a lower bound for the length of an RC complete sequence from
<italic>k</italic>
-mer counts.
<statement>
<title>Proposition 1</title>
<p>Denote by
<inline-formula>
<inline-graphic xlink:href="btt230i44.jpg"></inline-graphic>
</inline-formula>
the length of an optimal RC complete sequence of order
<italic>k</italic>
.
<disp-formula id="btt230-M1">
<label>(1)</label>
<graphic xlink:href="btt230m1"></graphic>
</disp-formula>
</p>
</statement>
<statement>
<title>Proof</title>
<p>We consider separately the cases of odd and even
<italic>k</italic>
. For odd
<italic>k</italic>
, there are no palindromes, as the middle position in a
<italic>k</italic>
-mer differs from its reverse complement. Each
<italic>k</italic>
-mer must be represented in the sequence by itself or its reverse complement. Thus, a lower bound for the minimum length is half the number of unique
<italic>k</italic>
-mers, which is
<inline-formula>
<inline-graphic xlink:href="btt230i45.jpg"></inline-graphic>
</inline-formula>
. For even
<italic>k</italic>
, some
<italic>k</italic>
-mers are palindromes. For palindromes, the first
<inline-formula>
<inline-graphic xlink:href="btt230i46.jpg"></inline-graphic>
</inline-formula>
characters define the last
<inline-formula>
<inline-graphic xlink:href="btt230i47.jpg"></inline-graphic>
</inline-formula>
characters. Hence, there are exactly
<inline-formula>
<inline-graphic xlink:href="btt230i48.jpg"></inline-graphic>
</inline-formula>
different palindromes. All palindromes must appear at least once in any RC complete sequence, whereas for the non-palindromic
<italic>k</italic>
-mers, either they or their reverse complement must appear in the sequence. Thus, for even
<italic>k</italic>
,
<inline-formula>
<inline-graphic xlink:href="btt230i49.jpg"></inline-graphic>
</inline-formula>
. ▪</p>
<p>We shall show later that
<inline-formula>
<inline-graphic xlink:href="btt230i50.jpg"></inline-graphic>
</inline-formula>
is tight for odd
<italic>k</italic>
, but not for even
<italic>k</italic>
.</p>
</statement>
</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Constructing an optimal RC complete sequence for odd
<italic>k</italic>
</title>
<p>In this section, we prove constructively that for odd
<italic>k</italic>
there exists an RC complete sequence that achieves the lower bound of Proposition 1 and is thus optimal. The proof modifies the Euler tour algorithm (
<xref ref-type="bibr" rid="btt230-B18">West
<italic>et al.</italic>
, 2001</xref>
). The modified algorithm was presented without proof in
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen (2006)</xref>
. The algorithm for generating the sequence will work on the de Bruijn graph of order
<italic>k</italic>
− 1. Every
<italic>k</italic>
-mer is represented in the graph as an edge, the graph is strongly connected and all vertices are balanced. As there are no palindromes of odd length, every edge has a unique reverse complement counterpart that is different from it. This defines a perfect matching
<italic>M</italic>
on the edges of the graph.</p>
<p>Given a directed path
<italic>F</italic>
in the graph, its
<italic>reverse complement path</italic>
is defined as the path
<italic>R</italic>
in which each edge
<inline-formula>
<inline-graphic xlink:href="btt230i51.jpg"></inline-graphic>
</inline-formula>
in
<italic>F</italic>
is replaced by the edge
<inline-formula>
<inline-graphic xlink:href="btt230i52.jpg"></inline-graphic>
</inline-formula>
. For example, for the path
<inline-formula>
<inline-graphic xlink:href="btt230i53.jpg"></inline-graphic>
</inline-formula>
, its reverse complement is
<inline-formula>
<inline-graphic xlink:href="btt230i54.jpg"></inline-graphic>
</inline-formula>
(
<xref ref-type="fig" rid="btt230-F1">Fig. 1</xref>
). We will refer to
<italic>F</italic>
and
<italic>R</italic>
as forward and reverse paths, respectively.
<fig id="btt230-F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>An illustration of forward and reverse paths (top and bottom, respectively). The forward path traverses the edges in their direction. The corresponding reverse path traverses the reverse complementary edges in reverse direction</p>
</caption>
<graphic xlink:href="btt230f1p"></graphic>
</fig>
</p>
<p>The following theorem provides a necessary and sufficient condition for the existence of an RC complete sequence that achieves the lower bound.
<statement>
<title>Theorem 1</title>
<p>For odd k, an RC complete sequence s achieves the lower bound (Proposition 1) if there exist two edge-disjoint paths with no repeating edges, corresponding to s and RC(
<italic>s</italic>
), that together cover all edges of the de Bruijn graph of order
<italic>k</italic>
− 1.</p>
</statement>
<statement>
<title>Proof</title>
<p>
<inline-formula>
<inline-graphic xlink:href="btt230i55.jpg"></inline-graphic>
</inline-formula>
Observe that the lower bound assumes one occurrence of either
<italic>w</italic>
or
<italic>RC</italic>
(
<italic>w</italic>
) but not both in the sequence for each
<italic>k</italic>
-mer
<italic>w</italic>
. Assume an RC complete sequence
<inline-formula>
<inline-graphic xlink:href="btt230i56.jpg"></inline-graphic>
</inline-formula>
achieves the lower bound. Then, because of its minimality, it contains no repeating
<italic>k</italic>
-mers; therefore, it must correspond to a path
<italic>F</italic>
in the de Bruijn graph with no repeating edges. The ordered set of
<italic>k</italic>
-mers in
<inline-formula>
<inline-graphic xlink:href="btt230i57.jpg"></inline-graphic>
</inline-formula>
corresponds to consecutive edges in
<italic>F</italic>
. Note that the reverse complement sequence
<inline-formula>
<inline-graphic xlink:href="btt230i58.jpg"></inline-graphic>
</inline-formula>
is also a path
<italic>R</italic>
in the graph: the
<italic>k</italic>
-mers in
<italic>R</italic>
are the reverse complement of those in
<italic>F</italic>
; therefore, consecutive edges form a path in the graph traversed in reverse order. As each
<italic>k</italic>
-mer or its reverse complement is covered in
<inline-formula>
<inline-graphic xlink:href="btt230i59.jpg"></inline-graphic>
</inline-formula>
, it is also true that each
<italic>k</italic>
-mer or its reverse complement is covered by
<inline-formula>
<inline-graphic xlink:href="btt230i60.jpg"></inline-graphic>
</inline-formula>
, and the two paths
<italic>F</italic>
and
<italic>R</italic>
, corresponding to the two sequences, together cover all edges.</p>
<p>
<inline-formula>
<inline-graphic xlink:href="btt230i61.jpg"></inline-graphic>
</inline-formula>
Suppose there are two edge-disjoint paths
<italic>F</italic>
and
<italic>R</italic>
with no repeated edges that together cover all edges. As they are reverse complement of each other, and together cover all edges, for each
<italic>k</italic>
-mer
<italic>w</italic>
, the sequence
<italic>s</italic>
(corresponding to path
<italic>F</italic>
) must contain either
<italic>w</italic>
or
<italic>RC</italic>
(
<italic>w</italic>
) (otherwise, some edges would have been uncovered). Hence,
<italic>s</italic>
is an RC complete sequence. The same argument holds for
<italic>RC</italic>
(
<italic>s</italic>
) (corresponding to path
<italic>R</italic>
). As each contains exactly half the edges, the length of each of them equals the lower bound ▪.</p>
</statement>
</p>
<p>Before presenting the algorithm for finding an optimal RC complete sequence, we remind the reader of the algorithm for finding an Eulerian cycle in a digraph (
<xref ref-type="bibr" rid="btt230-B4">Fleischner, 1990</xref>
). The algorithm starts from an arbitrary source vertex. Initially all edges are unmarked. It traverses a path of unmarked edges in arbitrary order. Each traversed edge is marked; therefore, no edge is traversed more than once. The algorithm also maintains a set
<italic>A</italic>
of the visited vertices that are still active, i.e. they have outgoing unmarked edges. When the last unmarked edge outgoing from a vertex is traversed, the vertex is removed from
<italic>A</italic>
. If the algorithm reaches a dead end, it starts another traversal from another vertex in
<italic>A</italic>
. A dead end can only be achieved when closing a cycle (i.e. returning to the source vertex), as in any other vertex there is always a free incoming edge and a free outgoing edge (as for every vertex except the source the unmarked outdegree and the unmarked indegree are equal). If not all edges have been traversed,
<italic>A</italic>
is not empty, and the process can start from a new source. In the end, as the graph is strongly connected and all cycles start from visited vertices (except for the initial vertex), the cycles can be joined to form one Eulerian cycle. The running time of the algorithm is linear in the number of vertices and edges.</p>
<p>Algorithm 1 finds an optimal RC complete sequence in a de Bruijn graph of order
<italic>k</italic>
− 1 when
<italic>k</italic>
is odd. The algorithm imitates the Euler path algorithm but maintains both a forward sequence and a reverse complement sequence simultaneously. The collection of cycles traversed so far is kept in
<inline-formula>
<inline-graphic xlink:href="btt230i62.jpg"></inline-graphic>
</inline-formula>
and the corresponding reverse complement cycles set is
<inline-formula>
<inline-graphic xlink:href="btt230i63.jpg"></inline-graphic>
</inline-formula>
.
<statement>
<title>Theorem 2</title>
<p>For odd k, Algorithm 1 returns forward and reverse paths that cover together all edges of the graph and represent two optimal RC complete sequences. The algorithm runs in
<inline-formula>
<inline-graphic xlink:href="btt230i78.jpg"></inline-graphic>
</inline-formula>
time.</p>
</statement>
<statement>
<title>Proof</title>
<p>We prove the theorem using several lemmas. We first show that if the forward path
<italic>F</italic>
reaches a dead end, then so does the reverse path
<italic>R</italic>
, and in that case, a cycle is closed (Lemma 1). Note that each pair
<italic>F</italic>
,
<italic>R</italic>
constructed in Steps 4–7 are reverse complementary paths by the way they are constructed. Then, we show that the cycles in
<inline-formula>
<inline-graphic xlink:href="btt230i79.jpg"></inline-graphic>
</inline-formula>
can be merged into one cycle (Lemma 2). Third, we deduce that a strongly connected component is covered by
<inline-formula>
<inline-graphic xlink:href="btt230i80.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i81.jpg"></inline-graphic>
</inline-formula>
(Lemma 3). Finally, we conclude that
<inline-formula>
<inline-graphic xlink:href="btt230i82.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i83.jpg"></inline-graphic>
</inline-formula>
cover all edges, as there is only one strongly connected component in any de Bruijn graph (Corollary 1). As each edge is traversed once, the paths are of length
<inline-formula>
<inline-graphic xlink:href="btt230i84.jpg"></inline-graphic>
</inline-formula>
and, hence, optimal.</p>
</statement>
<statement>
<title>Lemma 1</title>
<p>If the forward traversal reaches a dead end, then so does the reverse. Both paths close a cycle in this case.</p>
</statement>
<statement>
<title>Proof</title>
<p>Distinguish two cases in which the forward path reaches a dead end:</p>
</statement>
<statement>
<title>Case 1</title>
<p>
<italic>F</italic>
reaches a vertex
<italic>v</italic>
and
<italic>R</italic>
reaches a vertex
<inline-formula>
<inline-graphic xlink:href="btt230i85.jpg"></inline-graphic>
</inline-formula>
, and all outgoing edges from
<italic>v</italic>
were already traversed. We prove that in that case,
<italic>F</italic>
must close a cycle. Assume to the contrary that
<italic>F</italic>
contains no edge outgoing from
<italic>v</italic>
. In that case, all outgoing edges were traversed by
<italic>R</italic>
. Then, all incoming edges must have been traversed by
<italic>R</italic>
as well, as each time
<italic>R</italic>
reached
<italic>v</italic>
, it must have exited it as well. The only exception is if
<italic>v</italic>
is also the first (last added) vertex
<italic>u</italic>
in
<italic>R</italic>
, contradicting our assumption that
<inline-formula>
<inline-graphic xlink:href="btt230i86.jpg"></inline-graphic>
</inline-formula>
. Therefore, all incoming and outgoing edges were covered by
<italic>R</italic>
, contradicting the fact that
<italic>F</italic>
just entered
<italic>v</italic>
. We conclude that
<italic>F</italic>
has an edge outgoing from
<italic>v</italic>
and thus it closed a cycle.</p>
<p>Denote by
<inline-formula>
<inline-graphic xlink:href="btt230i87.jpg"></inline-graphic>
</inline-formula>
the last edge traversed by
<italic>F</italic>
. All edges of the form
<inline-formula>
<inline-graphic xlink:href="btt230i88.jpg"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="btt230i89.jpg"></inline-graphic>
</inline-formula>
, were traversed. Hence, the reverse edges of the form
<inline-formula>
<inline-graphic xlink:href="btt230i90.jpg"></inline-graphic>
</inline-formula>
were traversed as well. The last edge traversed by
<italic>R</italic>
was
<inline-formula>
<inline-graphic xlink:href="btt230i91.jpg"></inline-graphic>
</inline-formula>
, outgoing from the vertex
<inline-formula>
<inline-graphic xlink:href="btt230i92.jpg"></inline-graphic>
</inline-formula>
. All incoming edges to this vertex have already been traversed, as they are the reverse complements of the edges outgoing from
<italic>v</italic>
, which were traversed by
<italic>F</italic>
. Thus,
<italic>R</italic>
reaches a dead end as well.
<italic>R</italic>
closes a cycle because of a symmetrical argument to that made for
<italic>F</italic>
.</p>
</statement>
<statement>
<title>Case 2</title>
<p>
<italic>F</italic>
and
<italic>R</italic>
reach the same vertex
<italic>v</italic>
simultaneously. Denote the incoming edge used by
<italic>F</italic>
<inline-formula>
<inline-graphic xlink:href="btt230i93.jpg"></inline-graphic>
</inline-formula>
. Then, the reverse outgoing edge, which is traversed by
<italic>R</italic>
, is
<inline-formula>
<inline-graphic xlink:href="btt230i94.jpg"></inline-graphic>
</inline-formula>
. From the fact that both reach the vertex simultaneously, we get that
<inline-formula>
<inline-graphic xlink:href="btt230i95.jpg"></inline-graphic>
</inline-formula>
. Hence, in all previous traversals of this vertex
<italic>F</italic>
and
<italic>R</italic>
also reached the vertex simultaneously. Moreover, the forward and reverse paths reach a dead end together at
<italic>v</italic>
. Hence, all incoming and outgoing edges were already traversed, and they are all of the form
<inline-formula>
<inline-graphic xlink:href="btt230i96.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i97.jpg"></inline-graphic>
</inline-formula>
, for all
<inline-formula>
<inline-graphic xlink:href="btt230i98.jpg"></inline-graphic>
</inline-formula>
. Thus, both paths close a cycle ▪.</p>
</statement>
<statement>
<title>Lemma 2</title>
<p>The cycles in
<inline-formula>
<inline-graphic xlink:href="btt230i99.jpg"></inline-graphic>
</inline-formula>
can be merged into one cycle.</p>
</statement>
<statement>
<title>Proof</title>
<p>According to Lemma 1, when
<italic>F</italic>
is added to
<inline-formula>
<inline-graphic xlink:href="btt230i100.jpg"></inline-graphic>
</inline-formula>
, it is a cycle in the graph. Thus,
<inline-formula>
<inline-graphic xlink:href="btt230i101.jpg"></inline-graphic>
</inline-formula>
is a set of cycles. The first cycle starts from an arbitrary vertex, but all other cycles start from a vertex of another cycle in
<inline-formula>
<inline-graphic xlink:href="btt230i102.jpg"></inline-graphic>
</inline-formula>
(denote
<italic>encompassing</italic>
cycle). Thus, each inner cycle can be merged into its encompassing cycle, forming one merged cycle. This is true to all cycles, except for the initial cycle ▪.</p>
</statement>
<statement>
<title>Lemma 3</title>
<p>The merged cycle of
<inline-formula>
<inline-graphic xlink:href="btt230i103.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i104.jpg"></inline-graphic>
</inline-formula>
either cover two strongly connected components separately or one strongly connected component together.</p>
</statement>
<statement>
<title>Proof</title>
<p>Cycles are added to
<inline-formula>
<inline-graphic xlink:href="btt230i105.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i106.jpg"></inline-graphic>
</inline-formula>
as long as there are unmarked edges. If there are no shared vertices between
<inline-formula>
<inline-graphic xlink:href="btt230i107.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i108.jpg"></inline-graphic>
</inline-formula>
, then both sets cover edges of different components. As each set is added edges until all are traversed, they cover two strongly connected components separately. Else, there is at least one shared vertex; thus, they cover the same component. The component is strongly connected, as no edges are left to traverse ▪.</p>
</statement>
<statement>
<title>Corollary 1</title>
<p>
<inline-formula>
<inline-graphic xlink:href="btt230i109.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i110.jpg"></inline-graphic>
</inline-formula>
cover all edges of a de Bruijn graph.</p>
</statement>
<statement>
<title>Proof</title>
<p>Following Lemma 3, as there is only one strongly connected component in a de Bruijn graph,
<inline-formula>
<inline-graphic xlink:href="btt230i111.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i112.jpg"></inline-graphic>
</inline-formula>
cover it together ▪.</p>
<p>This completes the proof of Theorem 2 ▪.</p>
</statement>
<boxed-text id="btt230-INLINEBOX1" position="float">
<caption>
<title>
<bold>Algorithm 1</bold>
. Find forward and reverse paths that cover all edges in a de Bruijn graph
<inline-formula>
<inline-graphic xlink:href="btt230i64.jpg"></inline-graphic>
</inline-formula>
of even order
<italic>k</italic>
− 1.</title>
</caption>
<p>
<list list-type="order">
<list-item>
<p>Initially all edges are unmarked,
<inline-formula>
<inline-graphic xlink:href="btt230i65.jpg"></inline-graphic>
</inline-formula>
, and
<inline-formula>
<inline-graphic xlink:href="btt230i66.jpg"></inline-graphic>
</inline-formula>
, an arbitrary vertex.</p>
</list-item>
<list-item>
<p>Although
<inline-formula>
<inline-graphic xlink:href="btt230i67.jpg"></inline-graphic>
</inline-formula>
do</p>
</list-item>
<list-item>
<p>
<inline-formula>
<inline-graphic xlink:href="btt230i68.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p> Pick any starting vertex
<inline-formula>
<inline-graphic xlink:href="btt230i69.jpg"></inline-graphic>
</inline-formula>
from
<italic>A</italic>
.</p>
</list-item>
<list-item>
<p> Although there exists an unmarked edge
<inline-formula>
<inline-graphic xlink:href="btt230i70.jpg"></inline-graphic>
</inline-formula>
outgoing from
<italic>v</italic>
do</p>
</list-item>
<list-item>
<p>  Append
<italic>e</italic>
to
<italic>F</italic>
. Prepend
<italic>RC</italic>
(
<italic>e</italic>
) to
<italic>R</italic>
.</p>
</list-item>
<list-item>
<p>  Mark
<italic>e</italic>
and
<italic>RC</italic>
(
<italic>e</italic>
).</p>
</list-item>
<list-item>
<p>  Set
<inline-formula>
<inline-graphic xlink:href="btt230i71.jpg"></inline-graphic>
</inline-formula>
;
<inline-formula>
<inline-graphic xlink:href="btt230i72.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p> Remove
<italic>v</italic>
from
<italic>A</italic>
.</p>
</list-item>
<list-item>
<p>If
<inline-formula>
<inline-graphic xlink:href="btt230i73.jpg"></inline-graphic>
</inline-formula>
, add
<italic>F</italic>
to
<inline-formula>
<inline-graphic xlink:href="btt230i74.jpg"></inline-graphic>
</inline-formula>
; add
<italic>R</italic>
to
<inline-formula>
<inline-graphic xlink:href="btt230i75.jpg"></inline-graphic>
</inline-formula>
;</p>
</list-item>
<list-item>
<p>Merge the cycles in
<inline-formula>
<inline-graphic xlink:href="btt230i76.jpg"></inline-graphic>
</inline-formula>
to obtain a single forward path.</p>
</list-item>
</list>
</p>
<p>   Do the same for
<inline-formula>
<inline-graphic xlink:href="btt230i77.jpg"></inline-graphic>
</inline-formula>
.</p>
</boxed-text>
</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Two solutions for the case of even
<bold>
<italic>k</italic>
</bold>
</title>
<p>Algorithm 1 cannot be applied when
<italic>k</italic>
is even. A palindrome is represented by one edge in the de Bruijn graph (like any other
<italic>k</italic>
-mer). The algorithm must traverse both an edge and its reverse complement edge on the forward and reverse paths; therefore, for a palindromic edge, both paths should use the same edge, which is impossible.</p>
<p>One possible way to rectify the problem is by adding one more copy of each palindromic edge to the de Bruijn graph. Note that in the resulting (multi-) graph, the number of edges is exactly twice the lower bound. Adding the parallel edges would solve the problem discussed earlier in the text, but it will make some vertices unbalanced; therefore, the resulting graph is not Eulerian. Such a graph cannot be represented as a union of two reverse complementary edge-disjoint paths.</p>
<p>A more aggressive augmentation that overcomes this difficulty is adding a
<italic>cycle</italic>
for every palindromic edge. This would preserve the balance of all vertices and the strong connectivity as well. If, in addition, the added non-palindromic edges have a perfect matching between reverse complementary edges, the algorithm can be applied.</p>
<p>We present two possible augmentations. One is simple, based on the ideas aforementioned, and near-optimal; the other is optimal but requires a more complex augmentation.</p>
<sec id="SEC3.3.1">
<title>3.3.1 A simple near-optimal augmentation</title>
<p>In this approach, for each palindromic edge, we add to the de Bruijn graph all possible cyclic shifts of it. More formally, let
<inline-formula>
<inline-graphic xlink:href="btt230i113.jpg"></inline-graphic>
</inline-formula>
. For the palindrome
<inline-formula>
<inline-graphic xlink:href="btt230i114.jpg"></inline-graphic>
</inline-formula>
, we add k edges corresponding to all possible cyclic shifts of e. Obviously, as these edges form a cycle, all vertices remain balanced. In fact, this cycle contains two edges that are palindromes,
<inline-formula>
<inline-graphic xlink:href="btt230i115.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i116.jpg"></inline-graphic>
</inline-formula>
; therefore, only one cycle is added for both, and the cycle doubles both palindromic edges. It is easy to see that the remaining
<inline-formula>
<inline-graphic xlink:href="btt230i117.jpg"></inline-graphic>
</inline-formula>
edges are in fact l − 1 matching pairs of reverse complementary edges. For each edge that represents the cyclic shift starting at position i, for
<inline-formula>
<inline-graphic xlink:href="btt230i118.jpg"></inline-graphic>
</inline-formula>
, the matching edge starts at
<inline-formula>
<inline-graphic xlink:href="btt230i119.jpg"></inline-graphic>
</inline-formula>
. Hence, a perfect matching exists after adding the new cycles. In total, during the edge augmentation process, for each pair of palindromic
<italic>k</italic>
-mers, we add k edges. For example, for the palindromes ACGT and GTAC, we add ACGT, CGTA, GTAC and TACG (
<xref ref-type="fig" rid="btt230-F2">Fig. 2</xref>
). The added edges CGTA and TACG match each other. The added palindromes match the original edges in the graph. The resulting augmented graph contains
<inline-formula>
<inline-graphic xlink:href="btt230i120.jpg"></inline-graphic>
</inline-formula>
edges, where the first term is the number of edges in the original de Bruijn graph, and the second is k for each pair of palindromes.
<fig id="btt230-F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>A cycle and edge matching. For the pair of palindromes ACGT and GTAC, all cyclic shifts of these palindromes are added once (dashed edges). In the matching, palindromic edges in the original cycle are paired with their added copies (encircled by small red ovals). Other non-palindromic added edges are paired (encircled by a large red oval)</p>
</caption>
<graphic xlink:href="btt230f2p"></graphic>
</fig>
</p>
<p>In some cases, the number of added edges can be reduced. If the palindrome
<inline-formula>
<inline-graphic xlink:href="btt230i121.jpg"></inline-graphic>
</inline-formula>
is periodic, then the number of cyclic shifts needed to return to the original
<italic>k</italic>
-mer is the length of the period. For example, the period of (ATAT … T) and (TATA … A) is 2. Only two edges suffice in this case, the edges (ATAT … T) and (TATA … A). This also applies to (CGCG … G) and (GCGC … C). Therefore, each two periodic palindromes that are a cyclic shift of each other require an addition of a number of edges equal to the length of their period. Hence, a smaller augmented graph and a shorter RC complete sequence can be obtained by considering the different possible periods, which can only be of even length, as each period is a palindrome.</p>
<p>Denote by
<inline-formula>
<inline-graphic xlink:href="btt230i122.jpg"></inline-graphic>
</inline-formula>
the set of even integers that divide
<italic>k</italic>
, and by
<inline-formula>
<inline-graphic xlink:href="btt230i123.jpg"></inline-graphic>
</inline-formula>
the exact number of additional edges.
<statement>
<title>Theorem 3</title>
<p>
<disp-formula id="btt230-M2">
<label>(2)</label>
<graphic xlink:href="btt230m2"></graphic>
</disp-formula>
</p>
</statement>
<statement>
<title>Proof</title>
<p>All
<italic>k</italic>
-mer palindromes are divided to pairs, which are cyclic shifts of each other. For each pair, all distinct cyclic shifts are added. The number of shifts is equal to the length of the period of the
<italic>k</italic>
-mer. The periods can only be even, as the periodic sequences are palindromes by themselves. The number of
<italic>i</italic>
-periodic palindromes is
<inline-formula>
<inline-graphic xlink:href="btt230i124.jpg"></inline-graphic>
</inline-formula>
. These contain shorter periods, for which edges have already been counted. Thus,
<inline-formula>
<inline-graphic xlink:href="btt230i125.jpg"></inline-graphic>
</inline-formula>
is subtracted, where
<italic>j</italic>
is the maximum even integer that divides
<inline-formula>
<inline-graphic xlink:href="btt230i126.jpg"></inline-graphic>
</inline-formula>
. The number of edges added for each pair of
<italic>i</italic>
-periodic palindromes is
<italic>i</italic>
▪.</p>
</statement>
<statement>
<title>Theorem 4</title>
<p>Running Algorithm 1 on the augmented graph produces forward and reverse paths that together cover all edges of the graph and represent two RC complete sequences.</p>
</statement>
<statement>
<title>Proof</title>
<p>Algorithm 1 can be run on graphs that satisfy the following properties:
<list list-type="roman-lower">
<list-item>
<p>The graph is strongly connected.</p>
</list-item>
<list-item>
<p>All vertices are balanced.</p>
</list-item>
<list-item>
<p>There exists a perfect matching of the edges, such that each pair of edges represent a
<italic>k</italic>
-mer and its reverse complement.</p>
</list-item>
</list>
</p>
</statement>
</p>
<p>The original de Bruijn graph of order
<italic>k</italic>
satisfies (1) and (2), and there exists a perfect matching for all non-palindromic
<italic>k</italic>
-mers in it. Added edges cannot disturb the connectivity. The addition of cycles preserves the balance. Each added palindromic
<italic>k</italic>
-mer matched the edge representing the same
<italic>k</italic>
-mer in the original graph. As discussed earlier in the text, the added non-palindromic edges form a perfect matching. Thus, Algorithm 1 can be run on the augmented graph. According to Theorem 2, it produces a forward and reverse path that together covers all edges of the augmented graph.</p>
<p>Each
<italic>k</italic>
-mer is represented in the augmented graph as an edge. All edges are covered together by the forward and reverse paths. For each path and for each
<italic>k</italic>
-mer, either it or its reverse complement is covered by the path. Thus, the paths represent RC complete sequences ▪.</p>
<p>Algorithm 1 produces two sequences, forward and reverse, each of which is an RC complete sequence (
<xref ref-type="fig" rid="btt230-F3">Fig. 3</xref>
). The length of the produced sequences is the number of edges divided by two. For each pair of palindromic edges, at most
<italic>k</italic>
edges were added, and by Theorem 3 exactly
<inline-formula>
<inline-graphic xlink:href="btt230i127.jpg"></inline-graphic>
</inline-formula>
edges were added in total. Hence, the length of the sequence is
<inline-formula>
<inline-graphic xlink:href="btt230i128.jpg"></inline-graphic>
</inline-formula>
, which is bounded by
<inline-formula>
<inline-graphic xlink:href="btt230i129.jpg"></inline-graphic>
</inline-formula>
. This is an addition of
<inline-formula>
<inline-graphic xlink:href="btt230i130.jpg"></inline-graphic>
</inline-formula>
characters, where
<italic>L</italic>
denotes the lower bound in Proposition 1 for an RC complete sequence of even order
<italic>k</italic>
.
<fig id="btt230-F3" position="float">
<label>Fig. 3.</label>
<caption>
<p>An augmented de Bruijn graph of order 1 and an example of forward and reverse paths in it. The dashed edges are added edges. The blue and brown paths represent the forward and reverse paths, respectively. Numbers on edges are the order of the edges in the forward path. The sequences are ACCGAATGCT and AGCATTCGGT for forward and reverse paths, respectively</p>
</caption>
<graphic xlink:href="btt230f3p"></graphic>
</fig>
</p>
</sec>
<sec id="SEC3.3.2">
<title>3.3.2 An optimal augmentation</title>
<p>We now present another augmentation that has higher time complexity but leads to an optimal RC complete sequence. As before, starting from the de Bruijn graph
<inline-formula>
<inline-graphic xlink:href="btt230i131.jpg"></inline-graphic>
</inline-formula>
, all palindromic edges are doubled, resulting in a graph
<inline-formula>
<inline-graphic xlink:href="btt230i132.jpg"></inline-graphic>
</inline-formula>
. We temporarily disregard the reverse complementarity matching constraints. As a result of the edge doubling, there are unbalanced vertices in
<inline-formula>
<inline-graphic xlink:href="btt230i133.jpg"></inline-graphic>
</inline-formula>
. We rectify this by adding short paths between unbalanced vertices. By adding paths of minimum total length, we will obtain a third graph
<inline-formula>
<inline-graphic xlink:href="btt230i134.jpg"></inline-graphic>
</inline-formula>
in which all degrees are balanced and it has minimum number of edges. Finding an optimal set of edges
<inline-formula>
<inline-graphic xlink:href="btt230i135.jpg"></inline-graphic>
</inline-formula>
can be done by solving a maximum weight-matching problem on a related graph. In fact the problem is equivalent to the Chinese postman problem (
<xref ref-type="bibr" rid="btt230-B3">Edmonds and Johnson, 1973</xref>
) [the Chinese postman problem is used in
<xref ref-type="bibr" rid="btt230-B10">Medvedev and Brudno (2009)</xref>
and
<xref ref-type="bibr" rid="btt230-B11">Medvedev
<italic>et al.</italic>
(2007)</xref>
and is also mentioned in
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen (2006)</xref>
as a solution on the original de Bruijn graph]. We shall later show that G
<sup>2</sup>
can be modified to satisfy the reverse complementarity matching requirement without losing optimality. Hence, applying Algorithm 1 on it will produce an optimal RC complete sequence.</p>
<p>Finding an optimal set of edges
<inline-formula>
<inline-graphic xlink:href="btt230i136.jpg"></inline-graphic>
</inline-formula>
is done by solving a maximum weight-matching problem in a bipartite graph, where vertices with greater indegree than outdegree constitute one part, and the vertices with greater outdegree than indegree are the other. The edge weights are
<italic>k</italic>
minus the number of characters on the path from one vertex to the other. More formally, let
<inline-formula>
<inline-graphic xlink:href="btt230i137.jpg"></inline-graphic>
</inline-formula>
(
<inline-formula>
<inline-graphic xlink:href="btt230i138.jpg"></inline-graphic>
</inline-formula>
) be the set of vertices with indegree greater (smaller) than outdegree in
<inline-formula>
<inline-graphic xlink:href="btt230i139.jpg"></inline-graphic>
</inline-formula>
. For
<inline-formula>
<inline-graphic xlink:href="btt230i140.jpg"></inline-graphic>
</inline-formula>
, there are
<inline-formula>
<inline-graphic xlink:href="btt230i141.jpg"></inline-graphic>
</inline-formula>
vertices in
<inline-formula>
<inline-graphic xlink:href="btt230i142.jpg"></inline-graphic>
</inline-formula>
of the form
<inline-formula>
<inline-graphic xlink:href="btt230i143.jpg"></inline-graphic>
</inline-formula>
and the same number of vertices in
<inline-formula>
<inline-graphic xlink:href="btt230i144.jpg"></inline-graphic>
</inline-formula>
of the form
<inline-formula>
<inline-graphic xlink:href="btt230i145.jpg"></inline-graphic>
</inline-formula>
[note that
<inline-formula>
<inline-graphic xlink:href="btt230i146.jpg"></inline-graphic>
</inline-formula>
palindromes of period 2 are already balanced (e.g. ATA … T)]. We define a complete bipartite graph
<inline-formula>
<inline-graphic xlink:href="btt230i147.jpg"></inline-graphic>
</inline-formula>
, where the weight of edge
<inline-formula>
<inline-graphic xlink:href="btt230i148.jpg"></inline-graphic>
</inline-formula>
is the maximum overlap between the suffix of
<italic>u</italic>
and the prefix of
<italic>v</italic>
(i.e.
<inline-formula>
<inline-graphic xlink:href="btt230i149.jpg"></inline-graphic>
</inline-formula>
). The length of the shortest path
<inline-formula>
<inline-graphic xlink:href="btt230i150.jpg"></inline-graphic>
</inline-formula>
between
<italic>u</italic>
and
<italic>v</italic>
is
<inline-formula>
<inline-graphic xlink:href="btt230i151.jpg"></inline-graphic>
</inline-formula>
(
<xref ref-type="fig" rid="btt230-F4">Fig. 4</xref>
). We are looking for a maximum weight matching in
<italic>H</italic>
. The procedure is summarized in Algorithm 2, Steps 1–5.
<boxed-text position="float">
<caption>
<title>
<bold>Algorithm 2.</bold>
Find an optimal augmentation for a de Bruijn graph
<inline-formula>
<inline-graphic xlink:href="btt230i153.jpg"></inline-graphic>
</inline-formula>
of odd order.</title>
</caption>
<p>1. Add to
<italic>G</italic>
the set
<inline-formula>
<inline-graphic xlink:href="btt230i154.jpg"></inline-graphic>
</inline-formula>
of palindromic edges.</p>
<p> The resulting (multi-)graph is
<inline-formula>
<inline-graphic xlink:href="btt230i155.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>2. Define
<inline-formula>
<inline-graphic xlink:href="btt230i156.jpg"></inline-graphic>
</inline-formula>
<disp-formula>
<graphic xlink:href="btt230um1"></graphic>
</disp-formula>
</p>
<p>3. Define a complete bipartite graph
<inline-formula>
<inline-graphic xlink:href="btt230i157.jpg"></inline-graphic>
</inline-formula>
</p>
<p> with edge weights
<inline-formula>
<inline-graphic xlink:href="btt230i158.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>4. Find a maximum weight-matching
<italic>M</italic>
in
<italic>H</italic>
.</p>
<p>5. Define
<inline-formula>
<inline-graphic xlink:href="btt230i159.jpg"></inline-graphic>
</inline-formula>
</p>
<p> where
<inline-formula>
<inline-graphic xlink:href="btt230i160.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>6. Modify
<italic>M</italic>
, so that each cycle in the graph
<inline-formula>
<inline-graphic xlink:href="btt230i161.jpg"></inline-graphic>
</inline-formula>
</p>
<p> contains exactly two palindromic edges (Lemma 6).</p>
</boxed-text>
</p>
<p>The graph
<italic>G</italic>
<sup>2</sup>
produced in Step 5 of Algorithm 2 is strongly connected with all vertices balanced, but it is not guaranteed to satisfy the third property of Theorem 4, i.e. having a perfect matching among reverse complementary edges, which is needed to apply Algorithm 1. We now prove that it can be modified to satisfy this property without losing optimality. In fact, as
<inline-formula>
<inline-graphic xlink:href="btt230i162.jpg"></inline-graphic>
</inline-formula>
has a perfect matching, we only need to prove this property on the added edges
<inline-formula>
<inline-graphic xlink:href="btt230i163.jpg"></inline-graphic>
</inline-formula>
. Once this is done, Algorithm 1 can be applied to produce two reverse complementary paths that cover all edges.
<fig id="btt230-F4" position="float">
<label>Fig. 4.</label>
<caption>
<p>The bipartite graph for matching unbalanced vertices (Algorithm 2). On the top are the vertices with greater indegree, and on the bottom are the vertices with greater outdegree. Weights on the edges are the maximum overlap between the vertices’ sequences. Only the edges out of one vertex are drawn (the graph is a complete bipartite graph). Note that only unbalanced vertices corresponding to
<inline-formula>
<inline-graphic xlink:href="btt230i152.jpg"></inline-graphic>
</inline-formula>
-long prefixes and suffixes of palindromes are included</p>
</caption>
<graphic xlink:href="btt230f4p"></graphic>
</fig>
<fig id="btt230-F5" position="float">
<label>Fig. 5.</label>
<caption>
<p>Breaking down cycles with more than two palindromes. Left: Palindrome overlaps in a cycle found by the maximum matching. The rectangles at the ends indicate overlap between contiguous palindromes. Right: Partition into two cycles, one containing only the palindromes
<italic>X</italic>
and
<italic>Y</italic>
with a maximum overlap
<italic>y</italic>
. As
<inline-formula>
<inline-graphic xlink:href="btt230i180.jpg"></inline-graphic>
</inline-formula>
, the partition does not decrease the total contribution of the cycles to the weighted matching (Lemma 6)</p>
</caption>
<graphic xlink:href="btt230f5p"></graphic>
</fig>
</p>
<p>To establish Algorithm 2, we prove several lemmas:
<statement>
<title>Lemma 4</title>
<p>The shortest path from palindrome A to the palindrome B is the reverse complementary of the shortest path from B to A.</p>
</statement>
<statement>
<title>Proof</title>
<p>Denote
<inline-formula>
<inline-graphic xlink:href="btt230i164.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i165.jpg"></inline-graphic>
</inline-formula>
two palindromes. Let
<inline-formula>
<inline-graphic xlink:href="btt230i166.jpg"></inline-graphic>
</inline-formula>
for any
<inline-formula>
<inline-graphic xlink:href="btt230i167.jpg"></inline-graphic>
</inline-formula>
be an edge in the shortest path from
<italic>A</italic>
to
<italic>B</italic>
. Its reverse complement is
<inline-formula>
<inline-graphic xlink:href="btt230i168.jpg"></inline-graphic>
</inline-formula>
, which, as
<inline-formula>
<inline-graphic xlink:href="btt230i169.jpg"></inline-graphic>
</inline-formula>
are palindromes, which is the same as
<inline-formula>
<inline-graphic xlink:href="btt230i170.jpg"></inline-graphic>
</inline-formula>
, an edge in the shortest path from
<italic>B</italic>
to
<italic>A</italic>
▪.</p>
</statement>
<statement>
<title>Lemma 5</title>
<p>No cycle in
<inline-formula>
<inline-graphic xlink:href="btt230i171.jpg"></inline-graphic>
</inline-formula>
contains a single palindrome.</p>
</statement>
<statement>
<title>Proof</title>
<p>Suppose there exists a cycle containing only one palindrome. The shortest path to return to the palindrome is
<italic>t</italic>
cyclic shifts of the palindrome where
<italic>t</italic>
is the length of its period. Let
<inline-formula>
<inline-graphic xlink:href="btt230i172.jpg"></inline-graphic>
</inline-formula>
be the palindrome. Its cyclic shift
<inline-formula>
<inline-graphic xlink:href="btt230i173.jpg"></inline-graphic>
</inline-formula>
is another palindrome. Thus, the cycle includes more than one palindrome ▪.</p>
</statement>
<statement>
<title>Lemma 6</title>
<p>Every cycle in
<inline-formula>
<inline-graphic xlink:href="btt230i174.jpg"></inline-graphic>
</inline-formula>
can be decomposed into cycles containing exactly two palindromes each, without decreasing the total weight of the matching.</p>
</statement>
<statement>
<title>Proof</title>
<p>The proof is by induction on
<italic>n</italic>
, the number of palindromes in the cycle. For the induction base,
<italic>n</italic>
= 1 is impossible by Lemma 5, and
<italic>n</italic>
= 2 is trivially true. Induction step, for
<inline-formula>
<inline-graphic xlink:href="btt230i175.jpg"></inline-graphic>
</inline-formula>
, denote by
<italic>X</italic>
,
<italic>Y</italic>
,
<italic>Z</italic>
and
<italic>W</italic>
palindromes in the cycle, where
<italic>W</italic>
,
<italic>X</italic>
,
<italic>Y</italic>
and
<italic>Z</italic>
appear in this order in the cycle. Let
<inline-formula>
<inline-graphic xlink:href="btt230i176.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="btt230i177.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="btt230i178.jpg"></inline-graphic>
</inline-formula>
and let
<italic>w</italic>
be the sum of overlaps of all palindromes between
<italic>Z</italic>
and
<italic>W</italic>
(inclusive). In case
<italic>n</italic>
= 3,
<italic>Z</italic>
=
<italic>W</italic>
and
<italic>w</italic>
= 0. Without loss of generality, let
<italic>y</italic>
be a maximum overlap. The total sum of overlaps is
<inline-formula>
<inline-graphic xlink:href="btt230i179.jpg"></inline-graphic>
</inline-formula>
(
<xref ref-type="fig" rid="btt230-F5">Fig. 5</xref>
).</p>
<p>Remove
<italic>X</italic>
and
<italic>Y</italic>
and form a cycle of these two palindromes. As
<inline-formula>
<inline-graphic xlink:href="btt230i181.jpg"></inline-graphic>
</inline-formula>
are palindromes,
<inline-formula>
<inline-graphic xlink:href="btt230i182.jpg"></inline-graphic>
</inline-formula>
; therefore, the contribution of this cycle to the matching is
<inline-formula>
<inline-graphic xlink:href="btt230i183.jpg"></inline-graphic>
</inline-formula>
. The total overlap of the remaining cycle is
<italic>w</italic>
plus the overlap between
<italic>W</italic>
and
<italic>Z</italic>
, which is at least
<inline-formula>
<inline-graphic xlink:href="btt230i184.jpg"></inline-graphic>
</inline-formula>
. To see this, denote by
<inline-formula>
<inline-graphic xlink:href="btt230i185.jpg"></inline-graphic>
</inline-formula>
the
<italic>i</italic>
-long prefix of string
<italic>X</italic>
, and denote by
<inline-formula>
<inline-graphic xlink:href="btt230i186.jpg"></inline-graphic>
</inline-formula>
the
<italic>i</italic>
-long suffix of
<italic>X</italic>
. If
<inline-formula>
<inline-graphic xlink:href="btt230i187.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="btt230i188.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="btt230i188a.jpg"></inline-graphic>
</inline-formula>
, where the first, second and fourth equalities follow from the overlap assumptions and the second, third and fourth use the palindrome property. If
<inline-formula>
<inline-graphic xlink:href="btt230i189.jpg"></inline-graphic>
</inline-formula>
, similarly
<inline-formula>
<inline-graphic xlink:href="btt230i190.jpg"></inline-graphic>
</inline-formula>
. Hence,
<inline-formula>
<inline-graphic xlink:href="btt230i191.jpg"></inline-graphic>
</inline-formula>
. The total weight of the two cycles in the new matching is at least
<inline-formula>
<inline-graphic xlink:href="btt230i192.jpg"></inline-graphic>
</inline-formula>
. Hence, the difference between the new matching and the previous one is at least
<inline-formula>
<inline-graphic xlink:href="btt230i193.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="btt230i193a.jpg"></inline-graphic>
</inline-formula>
, where the last inequality follows by the choice of
<italic>y</italic>
as a maximum overlap.</p>
<p>The remaining cycle has
<italic>n</italic>
− 2 palindromes, and by the induction step, it is breakable to cycles of size two ▪.</p>
</statement>
<statement>
<title>Proposition 2</title>
<p>There exists a maximum weight matching in which all the added edges form reverse complementary pairs. Any maximum weight matching can be modified to such matching.</p>
</statement>
<statement>
<title>Proof</title>
<p>Consider the graph
<italic>G</italic>
<sup>2</sup>
produced in Step 5 of Algorithm 2. If
<inline-formula>
<inline-graphic xlink:href="btt230i194.jpg"></inline-graphic>
</inline-formula>
contains cycles of more than two palindromes, by Lemma 6, they can be decomposed into cycles of two palindromes. The new matching is of the same size, and for each cycle with exactly two palindromic edges, the remaining edges match in reverse complementary pairs (Lemma 4) ▪.</p>
<p>The maximum weight-matching problem, also known as the assignment problem (
<xref ref-type="bibr" rid="btt230-B18">West
<italic>et al.</italic>
, 2001</xref>
), can be solved by the Hungarian method in
<inline-formula>
<inline-graphic xlink:href="btt230i195.jpg"></inline-graphic>
</inline-formula>
time (
<xref ref-type="bibr" rid="btt230-B9">Kuhn, 2006</xref>
). As
<inline-formula>
<inline-graphic xlink:href="btt230i196.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="btt230i197.jpg"></inline-graphic>
</inline-formula>
, the running time is
<inline-formula>
<inline-graphic xlink:href="btt230i198.jpg"></inline-graphic>
</inline-formula>
. An improvement to this algorithm (
<xref ref-type="bibr" rid="btt230-B7">Kao
<italic>et al.</italic>
, 1997</xref>
), when the edge weights are integers, runs in
<inline-formula>
<inline-graphic xlink:href="btt230i199.jpg"></inline-graphic>
</inline-formula>
time, where
<italic>N</italic>
is the largest edge weight. In our case
<italic>N</italic>
=
<italic>k</italic>
, which gives
<inline-formula>
<inline-graphic xlink:href="btt230i200.jpg"></inline-graphic>
</inline-formula>
running time. The post-processing of the matching (Lemma 6) requires finding two palindromes with maximum overlap. This can be done in total time linear in the number of palindromes, as overlap lengths are integers in the range of 0 to
<italic>k</italic>
, and thus can be sorted using count sort. Hence, we conclude</p>
</statement>
<statement>
<title>Theorem 5</title>
<p>An optimal RC complete sequence for even k can be produced in time
<inline-formula>
<inline-graphic xlink:href="btt230i201.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>Summarizing Theorems 2 and 5 we obtain</p>
</statement>
<statement>
<title>Theorem 6</title>
<p>For every value of k, an optimal RC complete sequence can be obtained in time polynomial in the size of a de Bruijn graph of order
<italic>k</italic>
− 1.</p>
</statement>
</p>
</sec>
</sec>
</sec>
<sec id="SEC4">
<title>4 EXPERIMENTAL RESULTS</title>
<p>
<xref ref-type="table" rid="btt230-T1">Table 1</xref>
shows the results of the two algorithms for even
<italic>k</italic>
. As we can see, the sequence obtained by the algorithm is of length nearly half that of the original de Bruijn sequence. For example, for
<italic>k</italic>
= 12, the minimum length is within 0.15 per cent of
<inline-formula>
<inline-graphic xlink:href="btt230i202.jpg"></inline-graphic>
</inline-formula>
and within
<inline-formula>
<inline-graphic xlink:href="btt230i203.jpg"></inline-graphic>
</inline-formula>
characters from the theoretical lower bound.
<table-wrap id="btt230-T1" position="float">
<label>Table 1.</label>
<caption>
<p>Length of reverse complementary de Bruijn sequences produced by the two algorithms for even
<italic>k</italic>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">
<italic>k</italic>
</th>
<th rowspan="1" colspan="1">2</th>
<th rowspan="1" colspan="1">4</th>
<th rowspan="1" colspan="1">6</th>
<th rowspan="1" colspan="1">8</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">12</th>
<th rowspan="1" colspan="1">14</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Original</td>
<td rowspan="1" colspan="1">16</td>
<td rowspan="1" colspan="1">256</td>
<td rowspan="1" colspan="1">4096</td>
<td rowspan="1" colspan="1">65 536</td>
<td rowspan="1" colspan="1">1 048 576</td>
<td rowspan="1" colspan="1">16 777 216</td>
<td rowspan="1" colspan="1">268 435 456</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Lower bound</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">136</td>
<td rowspan="1" colspan="1">2080</td>
<td rowspan="1" colspan="1">32 896</td>
<td rowspan="1" colspan="1">524 800</td>
<td rowspan="1" colspan="1">8 390 656</td>
<td rowspan="1" colspan="1">134 225 920</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Algorithm 1</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">142</td>
<td rowspan="1" colspan="1">2140</td>
<td rowspan="1" colspan="1">33 262</td>
<td rowspan="1" colspan="1">526 840</td>
<td rowspan="1" colspan="1">8 400 808</td>
<td rowspan="1" colspan="1">134 275 060</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Optimal</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">142</td>
<td rowspan="1" colspan="1">2140</td>
<td rowspan="1" colspan="1">33 262</td>
<td rowspan="1" colspan="1">526 816</td>
<td rowspan="1" colspan="1">8 400 772</td>
<td rowspan="1" colspan="1">134 274 844</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Saving factor</td>
<td rowspan="1" colspan="1">1.6</td>
<td rowspan="1" colspan="1">1.8</td>
<td rowspan="1" colspan="1">1.91</td>
<td rowspan="1" colspan="1">1.97</td>
<td rowspan="1" colspan="1">1.990</td>
<td rowspan="1" colspan="1">1.997</td>
<td rowspan="1" colspan="1">1.999</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="btt230-TF1">
<p>
<italic>Note</italic>
: The top row is the length of a regular de Bruijn sequence that does not exploit complementarity. The next row contains the theoretical lower bound on RC complete sequence length (Proposition 1). The next two rows are the lengths of the sequence computed by the two algorithms of
<xref ref-type="sec" rid="SEC3.3.1">Section 3.3.1</xref>
and
<xref ref-type="sec" rid="SEC3.3.2">3.3.2</xref>
. The saving factor is the ratio between the original sequence length and length of the optimal RC complete sequence. Note that the lower bound is not tight.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>
<xref ref-type="table" rid="btt230-T2">Table 2</xref>
lists the number of probes of length
<italic>p</italic>
needed to cover all
<italic>k</italic>
-mers, by cutting an optimal RC complete sequence to
<italic>p</italic>
-long probes with overlaps of
<italic>k</italic>
− 1. As we can see, the saved factor in using the RC complete sequence is roughly 2. A comparison to the
<xref ref-type="table" rid="btt230-T1">Table 1</xref>
of (
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen, 2006</xref>
) shows that the sequence produced in (
<xref ref-type="bibr" rid="btt230-B12">Mintseris and Eisen, 2006</xref>
) is sub-optimal.
<table-wrap id="btt230-T2" position="float">
<label>Table 2.</label>
<caption>
<p>Number of probes needed to cover all
<italic>k</italic>
-mers as a function of probe length and
<italic>k</italic>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">k</th>
<th rowspan="1" colspan="1">6</th>
<th rowspan="1" colspan="1">7</th>
<th rowspan="1" colspan="1">8</th>
<th rowspan="1" colspan="1">9</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">11</th>
<th rowspan="1" colspan="1">12</th>
<th rowspan="1" colspan="1">13</th>
<th rowspan="1" colspan="1">14</th>
<th rowspan="1" colspan="1">14-DB</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">25</td>
<td rowspan="1" colspan="1">107</td>
<td rowspan="1" colspan="1">432</td>
<td rowspan="1" colspan="1">1848</td>
<td rowspan="1" colspan="1">7711</td>
<td rowspan="1" colspan="1">32 926</td>
<td rowspan="1" colspan="1">139 811</td>
<td rowspan="1" colspan="1">600 056</td>
<td rowspan="1" colspan="1">2 581 111</td>
<td rowspan="1" colspan="1">11 189 571</td>
<td rowspan="1" colspan="1">22 369 622</td>
</tr>
<tr>
<td rowspan="1" colspan="1">30</td>
<td rowspan="1" colspan="1">86</td>
<td rowspan="1" colspan="1">342</td>
<td rowspan="1" colspan="1">1447</td>
<td rowspan="1" colspan="1">5958</td>
<td rowspan="1" colspan="1">25 087</td>
<td rowspan="1" colspan="1">104 858</td>
<td rowspan="1" colspan="1">442 146</td>
<td rowspan="1" colspan="1">1 864 136</td>
<td rowspan="1" colspan="1">7 898 521</td>
<td rowspan="1" colspan="1">15 790 321</td>
</tr>
<tr>
<td rowspan="1" colspan="1">35</td>
<td rowspan="1" colspan="1">72</td>
<td rowspan="1" colspan="1">283</td>
<td rowspan="1" colspan="1">1188</td>
<td rowspan="1" colspan="1">4855</td>
<td rowspan="1" colspan="1">20 263</td>
<td rowspan="1" colspan="1">83 887</td>
<td rowspan="1" colspan="1">350 033</td>
<td rowspan="1" colspan="1">1 458 889</td>
<td rowspan="1" colspan="1">6 103 402</td>
<td rowspan="1" colspan="1">12 201 612</td>
</tr>
<tr>
<td rowspan="1" colspan="1">40</td>
<td rowspan="1" colspan="1">62</td>
<td rowspan="1" colspan="1">241</td>
<td rowspan="1" colspan="1">1008</td>
<td rowspan="1" colspan="1">4096</td>
<td rowspan="1" colspan="1">16 995</td>
<td rowspan="1" colspan="1">69 906</td>
<td rowspan="1" colspan="1">289 682</td>
<td rowspan="1" colspan="1">1 198 373</td>
<td rowspan="1" colspan="1">4 973 143</td>
<td rowspan="1" colspan="1">9 942 054</td>
</tr>
<tr>
<td rowspan="1" colspan="1">45</td>
<td rowspan="1" colspan="1">54</td>
<td rowspan="1" colspan="1">211</td>
<td rowspan="1" colspan="1">876</td>
<td rowspan="1" colspan="1">3543</td>
<td rowspan="1" colspan="1">14 634</td>
<td rowspan="1" colspan="1">59 919</td>
<td rowspan="1" colspan="1">247 082</td>
<td rowspan="1" colspan="1">1 016 801</td>
<td rowspan="1" colspan="1">4 196 089</td>
<td rowspan="1" colspan="1">8 388 608</td>
</tr>
<tr>
<td rowspan="1" colspan="1">50</td>
<td rowspan="1" colspan="1">48</td>
<td rowspan="1" colspan="1">187</td>
<td rowspan="1" colspan="1">774</td>
<td rowspan="1" colspan="1">3121</td>
<td rowspan="1" colspan="1">12 850</td>
<td rowspan="1" colspan="1">52 429</td>
<td rowspan="1" colspan="1">215 405</td>
<td rowspan="1" colspan="1">883 012</td>
<td rowspan="1" colspan="1">3 629 050</td>
<td rowspan="1" colspan="1">7 255 013</td>
</tr>
<tr>
<td rowspan="1" colspan="1">55</td>
<td rowspan="1" colspan="1">43</td>
<td rowspan="1" colspan="1">168</td>
<td rowspan="1" colspan="1">693</td>
<td rowspan="1" colspan="1">2789</td>
<td rowspan="1" colspan="1">11 453</td>
<td rowspan="1" colspan="1">46 604</td>
<td rowspan="1" colspan="1">190 927</td>
<td rowspan="1" colspan="1">780 336</td>
<td rowspan="1" colspan="1">3 197 021</td>
<td rowspan="1" colspan="1">6 391 321</td>
</tr>
<tr>
<td rowspan="1" colspan="1">60</td>
<td rowspan="1" colspan="1">39</td>
<td rowspan="1" colspan="1">152</td>
<td rowspan="1" colspan="1">628</td>
<td rowspan="1" colspan="1">2521</td>
<td rowspan="1" colspan="1">10 330</td>
<td rowspan="1" colspan="1">41 944</td>
<td rowspan="1" colspan="1">171 445</td>
<td rowspan="1" colspan="1">699 051</td>
<td rowspan="1" colspan="1">2 856 912</td>
<td rowspan="1" colspan="1">5 711 393</td>
</tr>
<tr>
<td rowspan="1" colspan="1">65</td>
<td rowspan="1" colspan="1">36</td>
<td rowspan="1" colspan="1">139</td>
<td rowspan="1" colspan="1">574</td>
<td rowspan="1" colspan="1">2300</td>
<td rowspan="1" colspan="1">9408</td>
<td rowspan="1" colspan="1">38 131</td>
<td rowspan="1" colspan="1">155 570</td>
<td rowspan="1" colspan="1">633 103</td>
<td rowspan="1" colspan="1">2 582 209</td>
<td rowspan="1" colspan="1">5 162 221</td>
</tr>
<tr>
<td rowspan="1" colspan="1">70</td>
<td rowspan="1" colspan="1">33</td>
<td rowspan="1" colspan="1">128</td>
<td rowspan="1" colspan="1">528</td>
<td rowspan="1" colspan="1">2115</td>
<td rowspan="1" colspan="1">8637</td>
<td rowspan="1" colspan="1">34 953</td>
<td rowspan="1" colspan="1">142 386</td>
<td rowspan="1" colspan="1">578 525</td>
<td rowspan="1" colspan="1">2 355 700</td>
<td rowspan="1" colspan="1">4 709 394</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="btt230-TF2">
<p>
<italic>Note</italic>
: The table contains the number of probes obtained by cutting an optimal RC complete sequence to short segments with overlaps. Left column: probe length; top row:
<italic>k</italic>
. Right column: number of probes needed when using a regular de Bruijn sequence for
<italic>k</italic>
= 14.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Running times: The simple near-optimal algorithm runs in time roughly linear in
<inline-formula>
<inline-graphic xlink:href="btt230i204.jpg"></inline-graphic>
</inline-formula>
. For example, for
<italic>k</italic>
= 8, 10 and 12 the running times are 1.5, 26 and 445 s, respectively. The optimal algorithm requires 5, 126 and 2937 s, respectively.</p>
</sec>
<sec id="SEC5">
<title>5 SUMMARY AND DISCUSSION</title>
<p>In this article, we studied the problem of constructing a minimum length sequence that covers each
<italic>k</italic>
-mer or its reverse complement at least once. The problem has applications in construction of dense double-stranded probe arrays for
<italic>in vitro</italic>
measuring of protein–DNA binding (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="btt230-B5">Fordyce
<italic>et al.</italic>
, 2010</xref>
), and for design of synthetic enhancers for
<italic>in vivo</italic>
developmental studies (
<xref ref-type="bibr" rid="btt230-B17">Smith and Ahituv, 2012</xref>
). For the case of odd
<italic>k</italic>
, we provided a proof that a simple modification of the Eulerian tour algorithm applied to the de Bruijn graph of order
<italic>k</italic>
− 1 gives an optimal solution. The algorithm requires linear time in the output sequence length, and it cuts the sequence length in half compared with using a regular de Bruijn sequence.</p>
<p>The problem is a bit more involved for even
<italic>k</italic>
, and here we provided two algorithms, a linear time near-optimal algorithm and a more complex polynomial algorithm that produces an optimal sequence. The length of the sequence produced by the optimal algorithm is slightly shorter, and both algorithms nearly halve the total length of the sequence.</p>
<p>The following related problem was studied by Medvedev
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="btt230-B10">Medvedev and Brudno, 2009</xref>
;
<xref ref-type="bibr" rid="btt230-B11">Medvedev
<italic>et al.</italic>
, 2007</xref>
): what is the minimum length sequence that contains a given set of
<italic>k</italic>
-mers? Their solution is based on bidirected graphs, which are similar to de Bruijn graphs, with the difference that a
<italic>k</italic>
-mer and its reverse complement are represented by the same vertex, and the edges represent the possible ways that double-stranded strings can overlap. These graphs were originally conceived by
<xref ref-type="bibr" rid="btt230-B8">Kececioglu and Myers (1995)</xref>
and actually discovered earlier by
<xref ref-type="bibr" rid="btt230-B19">Edmonds (1967)</xref>
. Medvedev
<italic>et al.</italic>
stated, without proof, that an Eulerian path can be found in a bidirected graph in the same way as in a regular de Bruijn graph (Lemma 1), but they did not consider explicitly the problem of covering all
<italic>k</italic>
-mers and did not make the distinction between even and odd
<italic>k</italic>
. In fact, some vertices in a bidirected graph of odd order (when edges represent
<italic>k</italic>
-mers of even length) are unbalanced, and thus an Eulerian tour does not exist. Although their method can be applied to our problem, it is slower than ours: they require
<inline-formula>
<inline-graphic xlink:href="btt230i205.jpg"></inline-graphic>
</inline-formula>
, whereas our algorithms requires
<inline-formula>
<inline-graphic xlink:href="btt230i206.jpg"></inline-graphic>
</inline-formula>
for odd
<italic>k</italic>
and
<inline-formula>
<inline-graphic xlink:href="btt230i207.jpg"></inline-graphic>
</inline-formula>
for even
<italic>k</italic>
.</p>
<p>Beyond the theoretical interest, our results are applicable to current (
<xref ref-type="bibr" rid="btt230-B1">Berger
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="btt230-B5">Fordyce
<italic>et al.</italic>
, 2010</xref>
;
<xref ref-type="bibr" rid="btt230-B17">Smith and Ahituv, 2012</xref>
) and future technologies that require complete coverage of double-stranded DNA
<italic>k</italic>
-mers. In PBM, although it is desirable to have redundancy in covering
<italic>k</italic>
-mers, space on the arrays is limited. By essentially halving the needed sequence length, space is freed on the array to select additional redundant probes with desired properties. Similarly, in designing synthetic enhancer sequences, by using shorter sequences, experiments can be simplified.</p>
<p>In current technologies, the de Bruijn (or RC complete) sequence is cut into probes of length
<italic>p</italic>
with overlap
<italic>k</italic>
− 1 (
<xref ref-type="table" rid="btt230-T2">Table 2</xref>
). There is no constraint that forces these probes to come from a single sequence. A variant of the problem we studied is as follows: what is the minimum number of double-stranded DNA probe sequences of length
<italic>p</italic>
that together cover all
<italic>k</italic>
-mers? As our solution for an RC complete sequence of even
<italic>k</italic>
covers, a few
<italic>k</italic>
-mers more than once and direct design of probe sequences of length
<italic>p</italic>
might reduce the number of probes needed to cover all
<italic>k</italic>
-mers.</p>
<p>A heuristic solution to that problem was recently proposed by Riesenfeld and Pollard (
<xref ref-type="bibr" rid="btt230-B16">Riesenfeld and Pollard, 2012</xref>
). They studied the following problem: given
<italic>k</italic>
and
<italic>m</italic>
, design a set of
<italic>m</italic>
double-stranded DNA probes (of equal or almost equal length, denoted as
<inline-formula>
<inline-graphic xlink:href="btt230i208.jpg"></inline-graphic>
</inline-formula>
) that together cover all
<italic>k</italic>
-mers. Their algorithm repeatedly searches for disjoint
<inline-formula>
<inline-graphic xlink:href="btt230i209.jpg"></inline-graphic>
</inline-formula>
-long paths between unbalanced vertices. After removal of all such paths, it finds two reverse-complementary cycles. One cycle is cut into probes (with overlaps of
<italic>k</italic>
− 1) of length
<inline-formula>
<inline-graphic xlink:href="btt230i210.jpg"></inline-graphic>
</inline-formula>
or
<inline-formula>
<inline-graphic xlink:href="btt230i211.jpg"></inline-graphic>
</inline-formula>
. If the program terminates, an optimal set of oligomers is found; however, there is no theoretical guarantee that it will terminate. In our tests, for
<italic>k</italic>
= 6, their program terminates in a few seconds, whereas for
<italic>k</italic>
= 8, it takes >1 h and for
<italic>k</italic>
= 10 > 2 weeks. For some values of
<italic>m,</italic>
the produced probes are not of equal length. A modest reduction in the number of oligomers is obtainable compared with our design: for example, for
<italic>k</italic>
= 6 and probe length 15, the algorithm of Riesenfeld and Pollard produced 208 oligomers compared with 210 in our design. For greater values of
<italic>k,</italic>
the running time was already prohibitive (for
<italic>k</italic>
= 12, it kept running for >1 month), and thus we could not test the performance for these values. Our algorithm, on the other hand, produces an output for values of
<inline-formula>
<inline-graphic xlink:href="btt230i212.jpg"></inline-graphic>
</inline-formula>
in just a few seconds, whereas for
<italic>k</italic>
= 12, the linear algorithm takes <10 min and the optimal <1 h. The time is polynomial (or even linear) in the output sequence size, independent of probe length or the number of oligomers.</p>
<p>Our study raises several additional open questions. First, following (
<xref ref-type="bibr" rid="btt230-B15">Philippakis
<italic>et al.</italic>
, 2008</xref>
), can one design an optimal RC complete sequence with improved coverage of gapped
<italic>k</italic>
-mers? Second, it is known that the number of distinct de Bruijn sequences is
<inline-formula>
<inline-graphic xlink:href="btt230i213.jpg"></inline-graphic>
</inline-formula>
. What is the number of different optimal RC complete sequences? Third, can one construct an optimal RC complete sequence for even
<italic>k</italic>
in linear time? Fourth, is there a closed formula for the length of an optimal RC complete of even order?</p>
<p>
<italic>Funding</italic>
: This study was supported in part by the
<funding-source>Israel Science Foundation</funding-source>
(grant no.
<award-id>802/08</award-id>
), and by the
<funding-source>I-CORE Program of the Planning and Budgeting Committee and the Israel Science Foundation</funding-source>
(grant no.
<award-id>41/11</award-id>
). Y.O. was supported in part by a fellowship from the Edmond J. Safra
<funding-source>Center for Bioinformatics at Tel Aviv University</funding-source>
and by a Dan David PhD Fellowship.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
</body>
<back>
<ref-list>
<title>REFERENCES</title>
<ref id="btt230-B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Berger</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities</article-title>
<source>Nat. Biotechnol.</source>
<year>2006</year>
<volume>24</volume>
<fpage>1429</fpage>
<lpage>1435</lpage>
<pub-id pub-id-type="pmid">16998473</pub-id>
</element-citation>
</ref>
<ref id="btt230-B2">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rankmotif++: a motif-search algorithm that accounts for relative ranks of
<italic>k</italic>
-mers in binding transcription factors</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>i72</fpage>
<lpage>i79</lpage>
<pub-id pub-id-type="pmid">17646348</pub-id>
</element-citation>
</ref>
<ref id="btt230-B19">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Edmonds</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>An introduction to matching</article-title>
<source>Notes of Engineering Summer Conference</source>
<year>1967</year>
<publisher-loc>Ann Arbor</publisher-loc>
<publisher-name>University of Michigan</publisher-name>
</element-citation>
</ref>
<ref id="btt230-B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edmonds</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Matching, Euler tours and the Chinese postman</article-title>
<source>Math. Program.</source>
<year>1973</year>
<volume>5</volume>
<fpage>88</fpage>
<lpage>124</lpage>
</element-citation>
</ref>
<ref id="btt230-B4">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Fleischner</surname>
<given-names>H</given-names>
</name>
</person-group>
<source>Eulerian Graphs and Related Topics</source>
<year>1990</year>
<volume>Vol. 1</volume>
<publisher-loc>Amsterdam and New York</publisher-loc>
<publisher-name>North-Holland</publisher-name>
</element-citation>
</ref>
<ref id="btt230-B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fordyce</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis</article-title>
<source>Nat. Biotechnol.</source>
<year>2010</year>
<volume>28</volume>
<fpage>970</fpage>
<lpage>975</lpage>
<pub-id pub-id-type="pmid">20802496</pub-id>
</element-citation>
</ref>
<ref id="btt230-B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jolma</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>DNA-binding specificities of human transcription factors</article-title>
<source>Cell</source>
<year>2013</year>
<volume>152</volume>
<fpage>327</fpage>
<pub-id pub-id-type="pmid">23332764</pub-id>
</element-citation>
</ref>
<ref id="btt230-B7">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kao</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>All-cavity maximum matchings</article-title>
<source>Algorithms Comput.</source>
<year>1997</year>
<volume>1350</volume>
<fpage>364</fpage>
<lpage>373</lpage>
</element-citation>
</ref>
<ref id="btt230-B8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kececioglu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Combinatorial algorithms for DNA sequence assembly</article-title>
<source>Algorithmica</source>
<year>1995</year>
<volume>13</volume>
<fpage>7</fpage>
<lpage>51</lpage>
</element-citation>
</ref>
<ref id="btt230-B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>The Hungarian method for the assignment problem</article-title>
<source>Naval Res. Logist. Q.</source>
<year>2006</year>
<volume>2</volume>
<fpage>83</fpage>
<lpage>97</lpage>
</element-citation>
</ref>
<ref id="btt230-B10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Maximum likelihood genome assembly</article-title>
<source>J. Comput. Biol.</source>
<year>2009</year>
<volume>16</volume>
<fpage>1101</fpage>
<lpage>1116</lpage>
<pub-id pub-id-type="pmid">19645596</pub-id>
</element-citation>
</ref>
<ref id="btt230-B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Computability of models for sequence assembly</article-title>
<source>Algorithms Bioinform.</source>
<year>2007</year>
<fpage>289</fpage>
<lpage>301</lpage>
</element-citation>
</ref>
<ref id="btt230-B12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mintseris</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Design of a combinatorial DNA microarray for protein-DNA interaction studies</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>429</fpage>
<pub-id pub-id-type="pmid">17018151</pub-id>
</element-citation>
</ref>
<ref id="btt230-B13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nutiu</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument</article-title>
<source>Nat. Biotechnol.</source>
<year>2011</year>
<volume>29</volume>
<fpage>659</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="pmid">21706015</pub-id>
</element-citation>
</ref>
<ref id="btt230-B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Orenstein</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rap: Accurate and fast motif finding based on protein-binding microarray data</article-title>
<source>J. Comput. Biol.</source>
<year>2013</year>
<comment>[Epub ahead of print, March 6 2013]</comment>
</element-citation>
</ref>
<ref id="btt230-B15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Philippakis</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Design of compact, universal DNA microarrays for protein binding microarray experiments</article-title>
<source>J. Comput. Biol.</source>
<year>2008</year>
<volume>15</volume>
<fpage>655</fpage>
<lpage>665</lpage>
<pub-id pub-id-type="pmid">18651798</pub-id>
</element-citation>
</ref>
<ref id="btt230-B16">
<element-citation publication-type="webpage">
<person-group person-group-type="author">
<name>
<surname>Riesenfeld</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pollard</surname>
<given-names>K</given-names>
</name>
</person-group>
<year>2012</year>
<comment>Computing MRCC libraries and related types of DNA oligomer libraries.
<ext-link ext-link-type="uri" xlink:href="https://github.com/sriesenfeld/MRCC-Libraries">https://github.com/sriesenfeld/MRCC-Libraries</ext-link>
(1 April 2013, date last accessed)</comment>
</element-citation>
</ref>
<ref id="btt230-B17">
<element-citation publication-type="webpage">
<person-group person-group-type="author">
<name>
<surname>Smith</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Ahituv</surname>
<given-names>N</given-names>
</name>
</person-group>
<year>2012</year>
<comment>Deciphering the vertebrate regulatory code using short synthetic enhancers in vivo.
<ext-link ext-link-type="uri" xlink:href="http://zendev.ucsf.edu/projectview.php?project=6mer">http://zendev.ucsf.edu/projectview.php?project=6mer</ext-link>
(1 April 2013, date last accessed)</comment>
</element-citation>
</ref>
<ref id="btt230-B18">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>West</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<source>Introduction to Graph Theory</source>
<year>2001</year>
<volume>Vol. 2</volume>
<publisher-loc>Upper Saddle River, NJ</publisher-loc>
<publisher-name>Prentice Hall</publisher-name>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B10 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B10 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3694677
   |texte=   Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23813011" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021