Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Localizing triplet periodicity in DNA and cDNA sequences

Identifieur interne : 000481 ( Pmc/Corpus ); précédent : 000480; suivant : 000482

Localizing triplet periodicity in DNA and cDNA sequences

Auteurs : Liya Wang ; Lincoln D. Stein

Source :

RBID : PMC:2992068

Abstract

Background

The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.

Results

Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.

Conclusions

MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.


Url:
DOI: 10.1186/1471-2105-11-550
PubMed: 21059240
PubMed Central: 2992068

Links to Exploration step

PMC:2992068

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Localizing triplet periodicity in DNA and cDNA sequences</title>
<author>
<name sortKey="Wang, Liya" sort="Wang, Liya" uniqKey="Wang L" first="Liya" last="Wang">Liya Wang</name>
<affiliation>
<nlm:aff id="I1">Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY, 11724, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stein, Lincoln D" sort="Stein, Lincoln D" uniqKey="Stein L" first="Lincoln D" last="Stein">Lincoln D. Stein</name>
<affiliation>
<nlm:aff id="I1">Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY, 11724, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Ontario Institute for Cancer Research, 101 College St., Suite 800, Toronto, ON, M5G0A3, Canada</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21059240</idno>
<idno type="pmc">2992068</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2992068</idno>
<idno type="RBID">PMC:2992068</idno>
<idno type="doi">10.1186/1471-2105-11-550</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">000481</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Localizing triplet periodicity in DNA and cDNA sequences</title>
<author>
<name sortKey="Wang, Liya" sort="Wang, Liya" uniqKey="Wang L" first="Liya" last="Wang">Liya Wang</name>
<affiliation>
<nlm:aff id="I1">Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY, 11724, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stein, Lincoln D" sort="Stein, Lincoln D" uniqKey="Stein L" first="Lincoln D" last="Stein">Lincoln D. Stein</name>
<affiliation>
<nlm:aff id="I1">Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY, 11724, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Ontario Institute for Cancer Research, 101 College St., Suite 800, Toronto, ON, M5G0A3, Canada</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism
<italic>C. elegans</italic>
.</p>
</sec>
<sec>
<title>Results</title>
<p>Using both simulated TP signals and the real
<italic>C. elegans </italic>
sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsonis, Aa" uniqKey="Tsonis A">AA Tsonis</name>
</author>
<author>
<name sortKey="Elsner, Jb" uniqKey="Elsner J">JB Elsner</name>
</author>
<author>
<name sortKey="Tsonis, Pa" uniqKey="Tsonis P">PA Tsonis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anastassiou, D" uniqKey="Anastassiou D">D Anastassiou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tiwari, S" uniqKey="Tiwari S">S Tiwari</name>
</author>
<author>
<name sortKey="Ramachandran, S" uniqKey="Ramachandran S">S Ramachandran</name>
</author>
<author>
<name sortKey="Bhattacharya, A" uniqKey="Bhattacharya A">A Bhattacharya</name>
</author>
<author>
<name sortKey="Bhattacharya, S" uniqKey="Bhattacharya S">S Bhattacharya</name>
</author>
<author>
<name sortKey="Ramaswamy, R" uniqKey="Ramaswamy R">R Ramaswamy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yan, M" uniqKey="Yan M">M Yan</name>
</author>
<author>
<name sortKey="Lin, Zs" uniqKey="Lin Z">ZS Lin</name>
</author>
<author>
<name sortKey="Zhang, Ct" uniqKey="Zhang C">CT Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mena Chalco, Jp" uniqKey="Mena Chalco J">JP Mena-Chalco</name>
</author>
<author>
<name sortKey="Carrer, H" uniqKey="Carrer H">H Carrer</name>
</author>
<author>
<name sortKey="Zana, Y" uniqKey="Zana Y">Y Zana</name>
</author>
<author>
<name sortKey="Cesar, Rm" uniqKey="Cesar R">RM Cesar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="George, Tp" uniqKey="George T">TP George</name>
</author>
<author>
<name sortKey="Thomas, T" uniqKey="Thomas T">T Thomas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stanke, M" uniqKey="Stanke M">M Stanke</name>
</author>
<author>
<name sortKey="Waack, S" uniqKey="Waack S">S Waack</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liew, Awc" uniqKey="Liew A">AWC Liew</name>
</author>
<author>
<name sortKey="Yan, H" uniqKey="Yan H">H Yan</name>
</author>
<author>
<name sortKey="Yang, Ms" uniqKey="Yang M">MS Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Daubechies, I" uniqKey="Daubechies I">I Daubechies</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tuqan, J" uniqKey="Tuqan J">J Tuqan</name>
</author>
<author>
<name sortKey="Rushdi, A" uniqKey="Rushdi A">A Rushdi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, Yt" uniqKey="Chan Y">YT Chan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Black, Dl" uniqKey="Black D">DL Black</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fairbrother, Wg" uniqKey="Fairbrother W">WG Fairbrother</name>
</author>
<author>
<name sortKey="Yeh, Rf" uniqKey="Yeh R">RF Yeh</name>
</author>
<author>
<name sortKey="Sharp, Pa" uniqKey="Sharp P">PA Sharp</name>
</author>
<author>
<name sortKey="Burge, Cb" uniqKey="Burge C">CB Burge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lewis, R" uniqKey="Lewis R">R Lewis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Okamura, K" uniqKey="Okamura K">K Okamura</name>
</author>
<author>
<name sortKey="Feuk, L" uniqKey="Feuk L">L Feuk</name>
</author>
<author>
<name sortKey="Marques Bonet, T" uniqKey="Marques Bonet T">T Marques-Bonet</name>
</author>
<author>
<name sortKey="Navarro, A" uniqKey="Navarro A">A Navarro</name>
</author>
<author>
<name sortKey="Scherer, Sw" uniqKey="Scherer S">SW Scherer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
<author>
<name sortKey="Sugnet, Cw" uniqKey="Sugnet C">CW Sugnet</name>
</author>
<author>
<name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
<author>
<name sortKey="Roskin, Km" uniqKey="Roskin K">KM Roskin</name>
</author>
<author>
<name sortKey="Pringle, Th" uniqKey="Pringle T">TH Pringle</name>
</author>
<author>
<name sortKey="Zahler, Am" uniqKey="Zahler A">AM Zahler</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Epps, J" uniqKey="Epps J">J Epps</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gutierrez, G" uniqKey="Gutierrez G">G Gutierrez</name>
</author>
<author>
<name sortKey="Oliver, Jl" uniqKey="Oliver J">JL Oliver</name>
</author>
<author>
<name sortKey="Marin, A" uniqKey="Marin A">A Marin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sanchez, J" uniqKey="Sanchez J">J Sanchez</name>
</author>
<author>
<name sortKey="Lopez Villasenor, I" uniqKey="Lopez Villasenor I">I Lopez-Villasenor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pickrell, Jk" uniqKey="Pickrell J">JK Pickrell</name>
</author>
<author>
<name sortKey="Marioni, Jc" uniqKey="Marioni J">JC Marioni</name>
</author>
<author>
<name sortKey="Pai, Aa" uniqKey="Pai A">AA Pai</name>
</author>
<author>
<name sortKey="Degner, Jf" uniqKey="Degner J">JF Degner</name>
</author>
<author>
<name sortKey="Engelhardt, Be" uniqKey="Engelhardt B">BE Engelhardt</name>
</author>
<author>
<name sortKey="Nkadori, E" uniqKey="Nkadori E">E Nkadori</name>
</author>
<author>
<name sortKey="Veyrieras, Jb" uniqKey="Veyrieras J">JB Veyrieras</name>
</author>
<author>
<name sortKey="Stephens, M" uniqKey="Stephens M">M Stephens</name>
</author>
<author>
<name sortKey="Gilad, Y" uniqKey="Gilad Y">Y Gilad</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21059240</article-id>
<article-id pub-id-type="pmc">2992068</article-id>
<article-id pub-id-type="publisher-id">1471-2105-11-550</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-11-550</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Localizing triplet periodicity in DNA and cDNA sequences</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Wang</surname>
<given-names>Liya</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>wangli@cshl.edu</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Stein</surname>
<given-names>Lincoln D</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>lincoln.stein@gmail.com</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY, 11724, USA</aff>
<aff id="I2">
<label>2</label>
Ontario Institute for Cancer Research, 101 College St., Suite 800, Toronto, ON, M5G0A3, Canada</aff>
<pub-date pub-type="collection">
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>8</day>
<month>11</month>
<year>2010</year>
</pub-date>
<volume>11</volume>
<fpage>550</fpage>
<lpage>550</lpage>
<history>
<date date-type="received">
<day>3</day>
<month>6</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>8</day>
<month>11</month>
<year>2010</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2010 Wang and Stein; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2010</copyright-year>
<copyright-holder>Wang and Stein; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/11/550"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism
<italic>C. elegans</italic>
.</p>
</sec>
<sec>
<title>Results</title>
<p>Using both simulated TP signals and the real
<italic>C. elegans </italic>
sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Most protein-coding genes in eukaryotic cells are composed of alternating introns and exons. With the exception of the extreme 5' and 3' ends of the first and last exons, which contain un-translated regions (UTRs), exons encode protein sequence. These are so-called "coding designated segments" or CDS. Combination of any three nucleotide bases is a codon that encodes one amino acid residue (except three stop codons). CDS regions are known to exhibit a periodic organization of three bases, a triplet periodicity (TP) [
<xref ref-type="bibr" rid="B1">1</xref>
], which is generally considered a good preliminary indicator of protein-coding exon locations. TP based approaches for prediction of CDS [
<xref ref-type="bibr" rid="B2">2</xref>
-
<xref ref-type="bibr" rid="B6">6</xref>
] is model-independent since it does not need to be trained. On the other hand, model-based methods, like Markov chains [
<xref ref-type="bibr" rid="B7">7</xref>
], tend to be more precise by training a supervised classifier based on database of previously known organisms' genomic information, though it is usually not trivial to construct negative samples for training purposes. However, when sequenced organisms have coding segments that are not represented in the currently available database, TP based methods may complement model-based systems.</p>
<p>TP signals can be detected by applying windowed Fourier Transform, also called Short Time Fourier Transform (STFT), along DNA sequences [
<xref ref-type="bibr" rid="B2">2</xref>
]. However, the ability of the STFT to identify the precise boundaries of the TP signal is limited by its requirement of an arbitrarily chosen window size over which the spectrum is calculated. It has been shown that the choice of different window lengths directly affects the prediction accuracy [
<xref ref-type="bibr" rid="B8">8</xref>
]. The window size problem of STFT is the result of the resolution tradeoff between the time and frequency domains: FT is applied to the data from time zero to end with high resolution in the frequency domain but none in time domain; STFT (or windowed FT) captures the time domain resolution but loses some resolution in the frequency domain. A natural improvement on STFT is the wavelet transform (WT) that allows one to balance resolution at any time and frequency [
<xref ref-type="bibr" rid="B9">9</xref>
]. However direct application of WT is limited by the fact that the coding regions with TP will present the same frequency (1/3) under different scales [
<xref ref-type="bibr" rid="B5">5</xref>
]. For this reason, the ability of WT to automatically capture different frequencies is not appropriate here. Here we used a Modified Wavelet Transform (MWT) algorithm to show: (1) how TP boundary can be defined with greater accuracy; (2) how the TP profile can be used to infer the splice junctions in mature mRNA sequences; (3) how ancient frame-shift mutations may explain the loss of TP signals in coding regions; and (4) how the 6 bp periodicity in some intronic and intragenic regions can be identified and corrected in order to reduce false positive identifications of coding regions.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>DNA sequence data and simulated sequence data</title>
<p>For a biological test case, we used the sequenced contig F56F11 of
<italic>C. elegans</italic>
, a 43 kbp sequence containing six protein-coding genes and two non-coding genes. The 8000 bp-long gene F56F11.4 which starts at contig position 7021 (GenBank access number
<ext-link ext-link-type="gen" xlink:href="AF0099922">AF0099922</ext-link>
, positions 7021-15020) has been used extensively as a test case for TP detection techniques [
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
]. This gene contains five CDS of length greater than 100 bp located at positions 928-1039, 2528-2857, 4114-4377, 5465-5644, and 7255-7605 relative to the start of the gene (a short alternatively-spliced sixth exon at the extreme 5' end contains a coding region that is too short to be analyzed by these techniques).</p>
<p>In addition to the real test case, we used following simulated sequence of length 900 with exact periodicities (p = 3, 6 or 9).</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-11-550-i1" overflow="scroll">
<mml:mrow>
<mml:mtext>u</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>n</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mtext>if</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo><</mml:mo>
<mml:mn>300</mml:mn>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>or</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>></mml:mo>
<mml:mn>600</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mtext>elseif</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>300</mml:mn>
<mml:mo>+</mml:mo>
<mml:mtext>m</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>p</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow></mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>otherwise</mml:mtext>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Where m is a non-negative integer and different p values simulates period of 3, 6 or 9 patterns from base position 300 to 600.</p>
</sec>
<sec>
<title>Short time Fourier transform</title>
<p>In the results section, we compare the performance between STFT and MWT. Before performing STFT or MWT, the DNA sequence is numerically mapped to four binary indicator sequences for bases A, C, G, and T. If a base type (e.g. A) presents at base position k of the DNA sequence, we put a '1' at position k of this base's binary indicator sequence, otherwise, we put a '0' there. E.g. AGTCA becomes the four binary strings 10001 for the A bases, 00010 for the C bases, 01000 for G and 00100 for T. The four binary indicator sequences have the same length as the original DNA sequence. We use u
<sub>A</sub>
, u
<sub>C</sub>
, u
<sub>G</sub>
, and u
<sub>T </sub>
to represent these four binary sequences, and apply Fourier Transform to them separately to get a new sequence of U[k] with the same length as the original.</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M2" name="1471-2105-11-550-i2" overflow="scroll">
<mml:mrow>
<mml:mrow>
<mml:mtext>U</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext>0</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>N</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>1</mml:mtext>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mtext>u[n]e</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mtext>j</mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mtext>2</mml:mtext>
<mml:mi>π</mml:mi>
</mml:mrow>
<mml:mtext>N</mml:mtext>
</mml:mfrac>
<mml:mtext>kn</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mstyle>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>k</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext>0,1,</mml:mtext>
<mml:mn>...</mml:mn>
<mml:mtext>N</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>1</mml:mtext>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The sequence U[k] provides a measure of the frequency content at 'frequency' k, which corresponds to an underlying period of N/k samples. The power spectral density (PSD) is defined as,</p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M3" name="1471-2105-11-550-i3" overflow="scroll">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mtext>PSD</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mtext>U</mml:mtext>
<mml:mtext>A</mml:mtext>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mtext>U</mml:mtext>
<mml:mtext>C</mml:mtext>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mtext>U</mml:mtext>
<mml:mtext>G</mml:mtext>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mtext>U</mml:mtext>
<mml:mtext>T</mml:mtext>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtext>k</mml:mtext>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>We choose a window size N which is a multiple of p, and slide it across the DNA sequence. The PSD[N/p] value given by STFT is recorded for each base located at the center of the sliding window. The TP property of a DNA sequence implies that PSD[N/3] (when p = 3) values peak in coding regions. Thus the plot of PSD[N/3] against base position will distinguish CDS from non-coding regions. Later we will refer PSD[N/3] simply as PSD for each base position.</p>
</sec>
<sec>
<title>Modified wavelet transform</title>
<p>A continuous wavelet transform of a continuous, square-integrable function u(
<italic>x</italic>
) at a scale a > 0 and position k is expressed by the following integral.</p>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M4" name="1471-2105-11-550-i4" overflow="scroll">
<mml:mrow>
<mml:mtext>U</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>a</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>k</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mtext>1</mml:mtext>
<mml:mrow>
<mml:msqrt>
<mml:mtext>a</mml:mtext>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mtext>u</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mtext>x</mml:mtext>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mi>ψ</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mtext>x</mml:mtext>
<mml:mo></mml:mo>
<mml:mtext>k</mml:mtext>
</mml:mrow>
<mml:mtext>a</mml:mtext>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>dx</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Where * means complex conjugate. Let t = (x - k)/a, then ψ(t) is a continuous function in both the time domain and the frequency domain called the wavelet function. One of the wavelet functions, the complex Morlet wavelet function is defined by following equation [
<xref ref-type="bibr" rid="B11">11</xref>
].</p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M5" name="1471-2105-11-550-i5" overflow="scroll">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>ψ</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mtext>t</mml:mtext>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mtext>t</mml:mtext>
<mml:mtext>2</mml:mtext>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>cos</mml:mtext>
<mml:msub>
<mml:mi>ω</mml:mi>
<mml:mtext>0</mml:mtext>
</mml:msub>
<mml:mtext>t</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>jsin</mml:mtext>
<mml:msub>
<mml:mi>ω</mml:mi>
<mml:mtext>0</mml:mtext>
</mml:msub>
<mml:mtext>t</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mtext>t</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mtext>j</mml:mtext>
<mml:msub>
<mml:mi>ω</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mtext>t</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Where ω
<sub>0 </sub>
is called the basic frequency of Morlet wavelet function. For constructing MWT, scale parameter a is added to Morlet wavelet function and now it becomes.</p>
<p>
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M6" name="1471-2105-11-550-i6" overflow="scroll">
<mml:mrow>
<mml:mrow>
<mml:mi>ψ</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mtext>t</mml:mtext>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mtext>t</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mtext>ja</mml:mtext>
<mml:msub>
<mml:mi>ω</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mtext>t</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Such modification ensures that the frequency part (
<inline-formula>
<mml:math id="M7" name="1471-2105-11-550-i7" overflow="scroll">
<mml:mrow>
<mml:msup>
<mml:mtext>e</mml:mtext>
<mml:mrow>
<mml:mtext>ja</mml:mtext>
<mml:msub>
<mml:mi>ω</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mtext>t</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
) is independent of scale parameter a since t = (x - k)/a. For practical reason, the length (or vanishing moments) of the wavelet analysis function (N) is taken as 1200 unless specified: a longer length will give higher precision but takes longer to compute. Now if ω
<sub>0 </sub>
is taken as N/b, it can precisely capture b base periodicity. The normalization factor
<inline-formula>
<mml:math id="M8" name="1471-2105-11-550-i8" overflow="scroll">
<mml:mrow>
<mml:mrow>
<mml:mtext>1</mml:mtext>
<mml:mo>/</mml:mo>
<mml:msqrt>
<mml:mtext>a</mml:mtext>
</mml:msqrt>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
in equation (4) is critical to ensure a constant norm in the space
<italic>L</italic>
<sup>2</sup>
(
<italic>R</italic>
) of square integrable functions, which in turn ensures equal areas under the MWT for two equal-amplitude Fourier components in the power spectrum. Without the normalization factor, MWT is similar to the modified Gabor Transform [
<xref ref-type="bibr" rid="B5">5</xref>
]. PSD(a, b) can be calculated by equation (3) after obtaining U by equation (4) to capture b base periodicity under scale a. b is chosen as 3 for capturing TP signal and a is 5 unless specified. Later we will also refer PSD(a, b) simply as PSD for each base position.</p>
</sec>
</sec>
<sec>
<title>Results</title>
<sec>
<title>The effect of the scale parameter</title>
<p>Using the simulated data with triplet periodicity (p = 3 in equation (1)), we compared the performance of MWT (Figure
<xref ref-type="fig" rid="F1">1a</xref>
) and STFT (Figure
<xref ref-type="fig" rid="F1">1b</xref>
). Given that TP in simulated signal starts at 300 and ends at 600, in general, the MWT produces a sharper boundary between coding and non-coding segments than STFT, thus better accuracy if we choose a cut-off PSD threshold of 0.2 (regions with PSD >0.2 are scored as coding regions). The amplitude of the PSD curve is strongly dependent on the choice of window size in the STFT algorithm, and of the scale parameter in the MWT algorithm. If we again choose a cut-off threshold of 0.2, it can be seen from Figure
<xref ref-type="fig" rid="F1">1</xref>
that STFT defined boundary is markedly sensitive to changes in window size, whereas MWT defined boundary is more robust to the choice of scale parameter. Furthermore, due to the sharpness of the MWT-generated PSD curve, the choice of cut-off has less of an effect on the inferred location of the protein-coding boundary in MWT than in STFT.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>PSD plot of simulated data with p = 3</bold>
. (a) Comparison of different scales for MWT; (b) Comparison of different window sizes for STFT. PSD values are normalized by dividing the maximal value under scale 2.5 for MWT in (a), or the maximal value under window size 180 in (b)</p>
</caption>
<graphic xlink:href="1471-2105-11-550-1"></graphic>
</fig>
<p>From Figure
<xref ref-type="fig" rid="F1">1(a)</xref>
, it can be seen that MWT with larger scale parameters gives a sharper boundary but with a lower peak amplitude, making it less tolerant to noise. For real DNA sequences, larger scale parameters (equivalent to a smaller window size) will result in more noise in introns and intergenetic regions since smoothing is performed over smaller region. A hybrid approach can be taken by using a smaller scale parameter to locate candidate coding regions and then refining the boundaries of these regions by re-running the MWT with a larger value of scale. Combining this technique with the well known GT-AG rule, which governs the great majority of Type I eukaryotic splice junctions [
<xref ref-type="bibr" rid="B12">12</xref>
], and open reading frame identification, might enable precise identification of the CDS boundaries.</p>
</sec>
<sec>
<title>TP signal in a real biological sequence</title>
<p>Figure
<xref ref-type="fig" rid="F2">2</xref>
shows the PSD plot across gene F56F11.4 using MWT. It can be seen that five exon regions are correctly identified except that the first (most left) peak is relatively weak due to the relatively short exon length (112 bp).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>PSD plot of sequence F56F11.4</bold>
. The vertical lines are drawn on the splice junctions.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-2"></graphic>
</fig>
<p>After removing the introns of gene F56F11.4, we merged the exons and plotted PSD under two different choices of scale parameter (Figure
<xref ref-type="fig" rid="F3">3</xref>
). The plots show a dramatic increase in PSD at the transition between non-coding and coding sequence. At a scale of 1.25, the PSD plot can clearly distinguish the 5' and 3' UTRs from the coding region. However, under scale of 5, the PSD plot gives a better indication of TP boundary on both sides for dividing 5' UTR and first exon, or last exon and 3' UTR. In practive, larger values of the scale parameter have a higher resolution for revealing details hidden within the broad PSD peak obtained under smaller scales. In Figure
<xref ref-type="fig" rid="F3">3</xref>
, the horizontal line represents the coding region with vertical lines marking the boundaries of individual exons.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>PSD plot of sequence F56F11.4 without introns</bold>
. PSD plot at two different scales, 5 and 1.25 for MWT. The line segments on the bottom show the splice junctions.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-3"></graphic>
</fig>
<p>More interestingly, Figure
<xref ref-type="fig" rid="F3">3</xref>
shows that, under the larger value of scale (a = 5), individual exons can be distinguished from each other even when they are merged together as they would appear in fully-spliced mRNA. The exceptions are the second and last exons, each of which contains two distinct PSD peaks (Figure
<xref ref-type="fig" rid="F2">2</xref>
and Figure
<xref ref-type="fig" rid="F3">3</xref>
).</p>
<p>To confirm that the diminution of the TP signal at splice junctions is a general phenomenon, we extracted 4655 adjacent
<italic>C. elegans </italic>
exon pairs from WormBase version WS211. Because of the length limitation on TP determination, we restricted the pairs to those in which the exons on both sides of the splice site were greater than 150 bp; we also eliminated pairs involving the first or last exon of each gene in order to avoid possible effects of UTRs. The average and standard error of the 4655 PSD plots across +/- 200 bp around the splicing site (position zero) are shown in Figure
<xref ref-type="fig" rid="F4">4</xref>
. This confirms that the drop of TP signal around the splicing site is a general phenomenon. This finding suggests that the TP signal could potentially be used to infer the splicing site from fully spliced mRNA sequences, and might be a useful criterion to enhance the accuracy of software that aligns cDNA sequence to the genome.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>PSD plot of two adjacent exons</bold>
. Solid line shows the average of PSD values of 4655 adjacent exon pairs each longer than 150 bp. Dashed line shows plot with +/- standard errors. The length of the wavelet analysis function (N) is taken as 512 and scale parameter a is taken as 5.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-4"></graphic>
</fig>
<p>The most straightforward explanation for this result is that the diminution of TP signal around splice junctions is due to the different TP patterns between adjacent exons. The presence of splicing cis-regulatory elements in the exon, such as exonic splice enhancers [
<xref ref-type="bibr" rid="B13">13</xref>
], which locally distort the pattern of evolutionary constraint, might contribute to the dimunution. However, the valleys are still observed after removing up to 90 bp on each side of the splice junction (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S1), suggesting that these elements, if present, are not the dominating factors.</p>
</sec>
<sec>
<title>The effect of 'frame-shift' mutations</title>
<p>A frame-shift mutation is a genetic mutation caused by insertion or deletion of 3n + 1 or 3n + 2 nucleotides from the coding region of a DNA sequence. Such mutations change the downstream codons which in turn changes the final protein product [
<xref ref-type="bibr" rid="B14">14</xref>
,
<xref ref-type="bibr" rid="B15">15</xref>
]. Such mutations will break the TP pattern within the exon by altering the phase of the codon bias around the mutation point. Here we examined whether the TP signal can be used to detect frame-shift mutations and vice versa.</p>
<p>Using the PSD peak corresponding to the third exon of gene F56F11.4, we showed the effects of base deletions at various positions in Figure
<xref ref-type="fig" rid="F5">5</xref>
, and the effects of different base insertion after a C nucleotide deletion in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S2. Figure
<xref ref-type="fig" rid="F5">5</xref>
shows that losing three consecutive nucleotides only slightly changes the amplitude of the peak. However, a one-base deletion at a certain range can dramatically reduce the amplitude of the peak. A subsequent compensatory two-base deletion, on the contrary, will restore the peak (also shown in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S3). On the other hand, the TP signal is not sensitive to single nucleotide variations (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S2). Here the PSD plots are generated by re-calculating TP after a mutation at a site and recording the maximal peak height. In Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Table S1, we randomly pick 45 human exons and provide the summary for likely evolutionary frame-shifts.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Amplitude of PSD peak under various base deletions</bold>
. Third PSD peak in Figure 2 is investigated for comparison of amplitude under 3-base deletion, 1-base deletion, and 1-base deletion followed by 2-base deletion at various base positions.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-5"></graphic>
</fig>
<p>During our examination of F56F11.4, we noted that the last exon of gene is the longest one, but does not have the expected strongest PSD peak (since PSD signal is cumulative). We suspected that this might reflect an ancient frameshift mutation (or more than one insertion/deletion). To investigate this possibility, we introduced a series of one- and two-base deletions at various positions around the last exon of F56F11.4. The PSD plot in Figure
<xref ref-type="fig" rid="F6">6</xref>
shows that a two-base deletion around position 7550 greatly enhances the TP signal (also shown in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S4). Comparing Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S4 with Figure
<xref ref-type="fig" rid="F2">2</xref>
, it can be seen that the amplitude of the last PSD peak is greatly increased. In Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S5, we show that the TP signal can be enhanced for an exon of adjacent gene F56F11.3 by using combinations of one- and two-base deletions at two different base positions.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Amplitude of PSD peak under various base deletions</bold>
. Fifth PSD peak in Figure 2 is investigated for comparison of the amplitude under 1-base deletion or 2-base deletion at various positions.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-6"></graphic>
</fig>
<p>Interestingly, nucleotide alignments between the
<italic>C. elegans </italic>
genome and related nematode species [
<xref ref-type="bibr" rid="B16">16</xref>
] detects indels around these regions (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S4 and Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S5), providing independent support for ancient frameshift mutations at these sites.</p>
</sec>
<sec>
<title>The effect of 6 bp periodicity</title>
<p>It has been pointed out that p bp periodicity peak will be observed in the power spectrum even if the sequence only has 2p bp periodicity when Fourier Transform is used [
<xref ref-type="bibr" rid="B17">17</xref>
]. This is an inherent character of FT and MWT since, for example, if there is a peak at k = N/6, there will be a peak at k = N/3 using equation (2) or (4). This drawback needs to be corrected when either FT or MWT is applied to the identification of DNA triplet periodicity.</p>
<p>To show that 6 bp or 9 bp periodicity can be captured by MWT with b = 3 for TP, we generated three simulated sequences with p = 3, 6, or 9 using equation (1). Figure
<xref ref-type="fig" rid="F7">7</xref>
shows the PSD signal captured by MWT with b = 3. The amplitude of the peak drops for larger p since we fixed the length of the sequence tested for periodicity. In theory, the amplitude of the peak for the same length sequence with only 6 bp or 9 bp periodicity will be 1/4 and 1/9 of that for 3 bp periodicity. For example, the number of repeats of 6 bp is 1/2 that of TP (e.g. sequence
<underline>100</underline>
100
<underline>100</underline>
100 has 4 repeats of 100 while sequence
<underline>100000</underline>
100000 has only 2 repeats of 100000) given the same sequence length, and the ratio becomes 1/4 when the power of 2 is taken when computing the power spectrum (equation (3)). The simulated signal indeed shows an approximate peak height of 0.25 (TP), 0.062 (6 bp), and 0.027 (9 bp). By setting b = 6 instead of 3, MWT will capture the 6 bp periodicity instead of 3 bp periodicity. The PSD plot for a periodicity of 6 of the simulated data is shown in Figure
<xref ref-type="fig" rid="F7">7</xref>
(red dashed dot line and labelled as 6 bp
<sub>2</sub>
); note that the peaks have almost same amplitude as that generated by MWT with b = 3.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>PSD plot of different periodicities</bold>
. PSD plot of simulated signal with p = 3 (dotted line), p = 6 (solid line) and p = 9 (dashed line) generated by MWT(b = 3), and PSD plot of simulated signal with p = 6 (red dotted dashed line) generated by MWT(b = 6).</p>
</caption>
<graphic xlink:href="1471-2105-11-550-7"></graphic>
</fig>
<p>Figure
<xref ref-type="fig" rid="F8">8</xref>
demonstrates an extreme case within the
<italic>C. elegans </italic>
sequence F56F11. There is a strong PSD peak (red solid line) around base 25650-25850 for TP but it is not defined as a CDS in the NCBI annotations. The PSD plot generated by MWT with b = 6 (dashed line) shows a very strong peak at the same location; after subtracting the b = 6 signal from the b = 3 signal, this false positive TP peak gets eliminated.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>Effect of 6 bp periodicity</bold>
. PSD plot of sequence F56F11 of
<italic>C. elegans </italic>
generated by MWT(b = 3, TP) and by MWT(b = 6, 6 bp). Scale of 2.5 is used.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-8"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Discussion</title>
<p>Several hypotheses have been advanced to explain the origin of the TP in coding sequence [
<xref ref-type="bibr" rid="B10">10</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B19">19</xref>
]. The simplest explanation is codon bias, which increases the probability that nucleotide triplets will appear in the same phase, but the matter is far from settled. Further it remains unclear why some CDS do not have apparent TP signal, and why some non-CDS segments have strong TP signals. Our analysis provides useful insights into TP properties at both the algorithmic and biological perspecives. It shows that MWT can better precisely delimit coding boundaries than STFT, and that a simple procedure of introducing artificial frame-shift mutations into protein-coding candidate regions can recover signal that was lost by presumptive ancient frame-shift mutations. While, this step might end up identifying regions that have lost coding capability during evolution, the identification of ancient coding sequences may still be of interest for comparative genomics.</p>
<p>In and of itself, the TP property is inadequate for gene prediction. However, it may be a useful adjunct to other techniques, particularly in the identification of protein-coding genes that are unusual in one way or another. For example, a recent study [
<xref ref-type="bibr" rid="B20">20</xref>
] showed that highly tissue-specific protein coding exons can be discovered via massive RNA-sequencing of 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals that were not predicted by conventional model-based gene prediction algorithms. One can speculate that some of these exons remained undiscovered because they depart from the expectations of the model. Protein coding segment prediction methods based on the model-independent TP property would provide a valuable complement to the conventional gene prediction algorithms.</p>
<p>Using simulated and real life sequencing data, we demonstrated that the MWT scale parameter can be reduced to give more robust prediction of broad TP regions or increased to provide higher resolution of the TP boundary. One way to achieve both will be running MWT with a smaller scale to identify TP regions followed by re-running the algorithm using larger values of the scale. Though it is not clear whether the edges of TP are completely consistent with the edges of protein coding regions (exons), TP edges could be further refined by known knowledge, such as GT-AG rule if done carefully.</p>
<p>A drawback of MWT is that the signal obtained for a periodicity of 3 overlaps with the 6 and 9 bp periodicity, which might be contained in introns and intergenic regions, especially when these regions are very long. Using simulated data, we showed that the 6 bp effect can be estimated by MWT with b = 6 and then subtracted from the b = 3 profile. After setting negative peaks to zero, the PSD plot of F56F11.4 (Figure
<xref ref-type="fig" rid="F9">9</xref>
) shows that noise is suppressed and real exon peaks are retained. Though not shown, the subtraction also removed the big false positive peak around base 25650-25850 for sequence F56F11. Alternatively, the 6 bp effect can also be used as a control for confirmation of true positives instead of direct subtraction. It is worthwhile to mention that a 9 bp periodicity can cause false positives as well but the chance is much lower given that same length sequence with 9 bp periodicity only contributes 1/9 the amplitude levels compared with those with TP. The subtraction of 6 bp and 9 bp periodicity effects could be substantial for species with much longer introns, as shown in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S6.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption>
<p>
<bold>PSD plot of sequence F56F11.4 after subtracting 6 bp effect. </bold>
After subtraction, negative values are changed to zero. Scale of 2.5 is used for MWT.</p>
</caption>
<graphic xlink:href="1471-2105-11-550-9"></graphic>
</fig>
<p>The most interesting result stemming from this analysis is the diminishing TP signal around the splice junction in mature mRNA sequences. The valley remains after deleting up to 90 bp of sequence immediately up and downstream of the junction. This implies that an extended region surrounding the splice site is under a different set of evolutionary constraints which reduces the TP signal. An alternative hypothesis is that areas of weak TP signal are favoured sites for intron birth. It is interesting to note that Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S4 shows that the last exon of gene F56F11.4 contains two different TP patterns even after making artificial frame-shift mutations. Perhaps this region once had an intron and lost it; alternatively this region might have an increased probability of acquiring an intron over the course of future evolution. In support of the first hypothesis, we note that there is a 657 bp deletion at the site of the TP valley in
<italic>C. elegans </italic>
relative to
<italic>P. pacificus </italic>
(Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Figure S4), suggesting the presence of an intron in the common ancestor of these two nematodes.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>MWT is a promising method for capturing triplet periodicity in DNA sequence. Artificial 'frame-shift' mutations and correction for the effect of the 6 bp periodicity signal could further improve the prediction. We also hypothesize that TP property of exons might carry evolution evidence about frame-shift mutations and the separation of exons by introns.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>LW carried out the work and drafted the manuscript with LDS. Both authors have read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>Supporting materials</bold>
. This file contains Figure S1 to S6 and Table S1.</p>
</caption>
<media xlink:href="1471-2105-11-550-S1.PDF" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We would like to thank Dr. Chunlao Tang, Sharon Wei and Andrew Olson for helpful discussions with LW. This research was supported in part by WormBase grant P41 HG02223 from National Institutes of Health and iPlant grant #EF-0735191 from National Science Foundation Plant Cyberinfrastructure program to LDS.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Tsonis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Elsner</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Tsonis</surname>
<given-names>PA</given-names>
</name>
<article-title>Periodicity in DNA coding sequences: implications in gene evolution</article-title>
<source>J Theor Biol</source>
<year>1991</year>
<volume>151</volume>
<issue>3</issue>
<fpage>323</fpage>
<lpage>331</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-5193(05)80381-9</pub-id>
<pub-id pub-id-type="pmid">1943144</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Anastassiou</surname>
<given-names>D</given-names>
</name>
<article-title>Frequency-domain analysis of biomolecular sequences</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<issue>12</issue>
<fpage>1073</fpage>
<lpage>1081</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.12.1073</pub-id>
<pub-id pub-id-type="pmid">11159326</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Tiwari</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ramachandran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bhattacharya</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bhattacharya</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ramaswamy</surname>
<given-names>R</given-names>
</name>
<article-title>Prediction of probable genes by Fourier analysis of genomic sequences</article-title>
<source>Computer Applications in the Biosciences</source>
<year>1997</year>
<volume>13</volume>
<issue>3</issue>
<fpage>263</fpage>
<lpage>270</lpage>
<pub-id pub-id-type="pmid">9183531</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Yan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>ZS</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>CT</given-names>
</name>
<article-title>A new fourier transform approach for protein coding measure based on the format of the Z curve</article-title>
<source>Bioinformatics</source>
<year>1998</year>
<volume>14</volume>
<issue>8</issue>
<fpage>685</fpage>
<lpage>690</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/14.8.685</pub-id>
<pub-id pub-id-type="pmid">9789094</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Mena-Chalco</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Carrer</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zana</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Cesar</surname>
<given-names>RM</given-names>
</name>
<article-title>Identification of protein coding regions using the modified Gabor-wavelet transform</article-title>
<source>Ieee-Acm Transactions on Computational Biology and Bioinformatics</source>
<year>2008</year>
<volume>5</volume>
<issue>2</issue>
<fpage>198</fpage>
<lpage>207</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2007.70259</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>George</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>T</given-names>
</name>
<article-title>Discrete wavelet transform de-noising in eukaryotic gene splicing</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 1</issue>
<fpage>S50</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-S1-S50</pub-id>
<pub-id pub-id-type="pmid">20122225</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Stanke</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Waack</surname>
<given-names>S</given-names>
</name>
<article-title>Gene prediction with a hidden Markov model and a new intron submodel</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>Suppl 2</issue>
<fpage>ii215</fpage>
<lpage>225</lpage>
<pub-id pub-id-type="pmid">14534192</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Liew</surname>
<given-names>AWC</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>MS</given-names>
</name>
<article-title>Pattern recognition techniques for the emerging field of bioinformatics: A review</article-title>
<source>Pattern Recognition</source>
<year>2005</year>
<volume>38</volume>
<issue>11</issue>
<fpage>2055</fpage>
<lpage>2073</lpage>
<pub-id pub-id-type="doi">10.1016/j.patcog.2005.02.019</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="book">
<name>
<surname>Daubechies</surname>
<given-names>I</given-names>
</name>
<source>Ten lectures on wavelets</source>
<year>1992</year>
<publisher-name>Philadelphia, Pa.: Society for Industrial and Applied Mathematics</publisher-name>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Tuqan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rushdi</surname>
<given-names>A</given-names>
</name>
<article-title>A DSP Approach for Finding the Codon Bias in DNA Sequences</article-title>
<source>Ieee Journal of Selected Topics in Signal Processing</source>
<year>2008</year>
<volume>2</volume>
<issue>3</issue>
<fpage>343</fpage>
<lpage>356</lpage>
<pub-id pub-id-type="doi">10.1109/JSTSP.2008.923851</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="book">
<name>
<surname>Chan</surname>
<given-names>YT</given-names>
</name>
<source>Wavelet basics</source>
<year>1995</year>
<publisher-name>Boston: Kluwer Academic Publishers</publisher-name>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Black</surname>
<given-names>DL</given-names>
</name>
<article-title>Mechanisms of alternative pre-messenger RNA splicing</article-title>
<source>Annual Review of Biochemistry</source>
<year>2003</year>
<volume>72</volume>
<fpage>291</fpage>
<lpage>336</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.biochem.72.121801.161720</pub-id>
<pub-id pub-id-type="pmid">12626338</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Fairbrother</surname>
<given-names>WG</given-names>
</name>
<name>
<surname>Yeh</surname>
<given-names>RF</given-names>
</name>
<name>
<surname>Sharp</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Burge</surname>
<given-names>CB</given-names>
</name>
<article-title>Predictive identification of exonic splicing enhancers in human genes</article-title>
<source>Science</source>
<year>2002</year>
<volume>297</volume>
<issue>5583</issue>
<fpage>1007</fpage>
<lpage>1013</lpage>
<pub-id pub-id-type="doi">10.1126/science.1073774</pub-id>
<pub-id pub-id-type="pmid">12114529</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book">
<name>
<surname>Lewis</surname>
<given-names>R</given-names>
</name>
<source>Human genetics: concepts and applications</source>
<year>2005</year>
<edition>6</edition>
<publisher-name>Boston: McGraw-Hill</publisher-name>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Okamura</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Feuk</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Marques-Bonet</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Scherer</surname>
<given-names>SW</given-names>
</name>
<article-title>Frequent appearance of novel protein-coding sequences by frameshift translation</article-title>
<source>Genomics</source>
<year>2006</year>
<volume>88</volume>
<issue>6</issue>
<fpage>690</fpage>
<lpage>697</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2006.06.009</pub-id>
<pub-id pub-id-type="pmid">16890400</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Sugnet</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<name>
<surname>Roskin</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Pringle</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Zahler</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<article-title>The human genome browser at UCSC</article-title>
<source>Genome Research</source>
<year>2002</year>
<volume>12</volume>
<issue>6</issue>
<fpage>996</fpage>
<lpage>1006</lpage>
<pub-id pub-id-type="pmid">12045153</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="other">
<name>
<surname>Epps</surname>
<given-names>J</given-names>
</name>
<article-title>A hybrid technique for the periodicity characterization of genomic sequence data</article-title>
<source>EURASIP J Bioinform Syst Biol</source>
<year>2009</year>
<fpage>924601</fpage>
<pub-id pub-id-type="pmid">19365578</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Gutierrez</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Oliver</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Marin</surname>
<given-names>A</given-names>
</name>
<article-title>On the origin of the periodicity of three in protein coding DNA sequences</article-title>
<source>J Theor Biol</source>
<year>1994</year>
<volume>167</volume>
<issue>4</issue>
<fpage>413</fpage>
<lpage>414</lpage>
<pub-id pub-id-type="doi">10.1006/jtbi.1994.1080</pub-id>
<pub-id pub-id-type="pmid">8207954</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Sanchez</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lopez-Villasenor</surname>
<given-names>I</given-names>
</name>
<article-title>A simple model to explain three-base periodicity in coding DNA</article-title>
<source>FEBS Lett</source>
<year>2006</year>
<volume>580</volume>
<issue>27</issue>
<fpage>6413</fpage>
<lpage>6422</lpage>
<pub-id pub-id-type="doi">10.1016/j.febslet.2006.10.056</pub-id>
<pub-id pub-id-type="pmid">17097640</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Pickrell</surname>
<given-names>JK</given-names>
</name>
<name>
<surname>Marioni</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Pai</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Degner</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Engelhardt</surname>
<given-names>BE</given-names>
</name>
<name>
<surname>Nkadori</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Veyrieras</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gilad</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
<article-title>Understanding mechanisms underlying human gene expression variation with RNA sequencing</article-title>
<source>Nature</source>
<year>2010</year>
<volume>464</volume>
<issue>7289</issue>
<fpage>768</fpage>
<lpage>772</lpage>
<pub-id pub-id-type="doi">10.1038/nature08872</pub-id>
<pub-id pub-id-type="pmid">20220758</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000481 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000481 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2992068
   |texte=   Localizing triplet periodicity in DNA and cDNA sequences
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21059240" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024