OrangerV1, Pmc, Corpus, bibRecord, 0009579

***** Acces problem to record *****\

Identifieur interne : 0009579 ( Pmc/Corpus ); précédent : 0009578; suivant : 0009580 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles</title>
<author><name sortKey="Wise, Michael J" sort="Wise, Michael J" uniqKey="Wise M" first="Michael J" last="Wise">Michael J. Wise</name>
<affiliation><nlm:aff id="I1">Department of Genetics Cambridge University Cambridge U.K</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">14583099</idno>
<idno type="pmc">280651</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC280651</idno>
<idno type="RBID">PMC:280651</idno>
<idno type="doi">10.1186/1471-2105-4-52</idno>
<date when="2003">2003</date>
<idno type="wicri:Area/Pmc/Corpus">000957</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles</title>
<author><name sortKey="Wise, Michael J" sort="Wise, Michael J" uniqKey="Wise M" first="Michael J" last="Wise">Michael J. Wise</name>
<affiliation><nlm:aff id="I1">Department of Genetics Cambridge University Cambridge U.K</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins, originally found in plants but now being found in non-plant species. Their precise function is unknown, though considerable evidence suggests that LEA proteins are involved in desiccation resistance. Using a number of statistically-based bioinformatics tools the classification of a large set of LEA proteins, covering all Groups, is reexamined together with some previous findings. Searches based on peptide composition return proteins with similar composition to different LEA Groups; keyword clustering is then applied to reveal keywords and phrases suggestive of the Groups' properties.</p>
</sec>
<sec><title>Results</title>
<p>Previous research has suggested that glycine is characteristic of LEA proteins, but it is only highly over-represented in Groups 1 and 2, while alanine, thought characteristic of Group 2, is over-represented in Group 3, 4 and 6 but under-represented in Groups 1 and 2. However, for LEA Groups 1 2 and 3 it is shown that glutamine is very significantly over-represented, while cysteine, phenylalanine, isoleucine, leucine and tryptophan are significantly under-represented. There is also evidence that the Group 4 LEA proteins are more appropriately redistributed to Group 2 and Group 3. Similarly, Group 5 is better found among the Group 3 LEA proteins.</p>
</sec>
<sec><title>Conclusions</title>
<p>There is evidence that Group 2 and Group 3 LEA proteins, though distinct, might be related. This relationship is also evident in the overlapping sets of keywords for the two Groups, emphasising alpha-helical structure and, at a larger scale, filaments, all of which fits well with experimental evidence that proteins from both Groups are natively unstructured, but become structured under stress conditions. The keywords support localisation of LEA proteins both in the nucleus and associated with the cytoskeleton, and a mode of action similar to chaperones, perhaps the cold shock chaperones, via a role in DNA-binding. In general, non-globular and low-complexity proteins, such as the LEA proteins, pose particular challenges in determining their functions and modes of action. Rather than masking off and ignoring low-complexity domains, novel tools and tool combinations are needed which are capable of analysing such proteins in their entirety.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title>BMC Bioinformatics</journal-title>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">14583099</article-id>
<article-id pub-id-type="pmc">280651</article-id>
<article-id pub-id-type="publisher-id">1471-2105-4-52</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-4-52</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles</article-title>
</title-group>
<contrib-group><contrib id="A1" corresp="yes" contrib-type="author"><name><surname>Wise</surname>
<given-names>Michael J</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mw263@cam.ac.uk</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Department of Genetics Cambridge University Cambridge U.K</aff>
<pub-date pub-type="collection"><year>2003</year>
</pub-date>
<pub-date pub-type="epub"><day>29</day>
<month>10</month>
<year>2003</year>
</pub-date>
<volume>4</volume>
<fpage>52</fpage>
<lpage>52</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/4/52"></ext-link>
<history><date date-type="received"><day>29</day>
<month>5</month>
<year>2003</year>
</date>
<date date-type="accepted"><day>29</day>
<month>10</month>
<year>2003</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2003 Wise; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</copyright-statement>
<copyright-year>2003</copyright-year>
<copyright-holder>Wise; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</copyright-holder>
</permissions>
<abstract><sec><title>Background</title>
<p>The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins, originally found in plants but now being found in non-plant species. Their precise function is unknown, though considerable evidence suggests that LEA proteins are involved in desiccation resistance. Using a number of statistically-based bioinformatics tools the classification of a large set of LEA proteins, covering all Groups, is reexamined together with some previous findings. Searches based on peptide composition return proteins with similar composition to different LEA Groups; keyword clustering is then applied to reveal keywords and phrases suggestive of the Groups' properties.</p>
</sec>
<sec><title>Results</title>
<p>Previous research has suggested that glycine is characteristic of LEA proteins, but it is only highly over-represented in Groups 1 and 2, while alanine, thought characteristic of Group 2, is over-represented in Group 3, 4 and 6 but under-represented in Groups 1 and 2. However, for LEA Groups 1 2 and 3 it is shown that glutamine is very significantly over-represented, while cysteine, phenylalanine, isoleucine, leucine and tryptophan are significantly under-represented. There is also evidence that the Group 4 LEA proteins are more appropriately redistributed to Group 2 and Group 3. Similarly, Group 5 is better found among the Group 3 LEA proteins.</p>
</sec>
<sec><title>Conclusions</title>
<p>There is evidence that Group 2 and Group 3 LEA proteins, though distinct, might be related. This relationship is also evident in the overlapping sets of keywords for the two Groups, emphasising alpha-helical structure and, at a larger scale, filaments, all of which fits well with experimental evidence that proteins from both Groups are natively unstructured, but become structured under stress conditions. The keywords support localisation of LEA proteins both in the nucleus and associated with the cytoskeleton, and a mode of action similar to chaperones, perhaps the cold shock chaperones, via a role in DNA-binding. In general, non-globular and low-complexity proteins, such as the LEA proteins, pose particular challenges in determining their functions and modes of action. Rather than masking off and ignoring low-complexity domains, novel tools and tool combinations are needed which are capable of analysing such proteins in their entirety.</p>
</sec>
</abstract>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins whose precise function is unknown. While considerable evidence suggests that LEA proteins are involved in desiccation resistance, a variety of mechanisms for achieving this end have been proposed including protecting cellular structures from the effects of water loss by retention of water, sequestration of ions, direct protection of other proteins or membranes, or renaturation of unfolded proteins [<xref ref-type="bibr" rid="B1">1</xref>
-<xref ref-type="bibr" rid="B4">4</xref>
]. LEA proteins are primarily found in plants, where they were originally found in seeds [<xref ref-type="bibr" rid="B5">5</xref>
-<xref ref-type="bibr" rid="B7">7</xref>
], and then other plant tissues. In addition, a number of putative LEA genes have been found in a non-plant species, including eubacteria <italic>Haemophilus influenzae </italic>
and <italic>Bacillus subtilis </italic>
[<xref ref-type="bibr" rid="B8">8</xref>
], extremophile <italic>Deinococcus radiodurans </italic>
[<xref ref-type="bibr" rid="B9">9</xref>
] and the nematodes <italic>Caenorhabditis elegans </italic>
and <italic>Aphelenchus avenae </italic>
[<xref ref-type="bibr" rid="B10">10</xref>
]. Most of the literature to date on LEA proteins has been in the form of reports on individual LEA proteins with general surveys appearing some time ago [<xref ref-type="bibr" rid="B1">1</xref>
,<xref ref-type="bibr" rid="B11">11</xref>
,<xref ref-type="bibr" rid="B12">12</xref>
]. The somewhat more recent survey by Close [<xref ref-type="bibr" rid="B13">13</xref>
] of Group 2 LEA proteins also includes a discussion of predicted secondary structure for this Group.</p>
<p>LEA proteins are generally grouped on the basis of their similarity to prototypical LEA proteins from the cotton plant <italic>Gossypium hirsutum</italic>
. In the Dure naming scheme, LEA protein groups are named after particular <italic>G. hirsutum </italic>
cDNA clones, resulting in Group names such as D7, D11, D19, D95 and D113. Many authors since Dure, however, use an assignment to Groups originating with [<xref ref-type="bibr" rid="B12">12</xref>
], though revised (and to some extent contradictory) assignments also appear in [<xref ref-type="bibr" rid="B3">3</xref>
] and [<xref ref-type="bibr" rid="B4">4</xref>
]. There is, however, a consensus only for three LEA protein groups: Group 1 (D19), Group 2 (also known as dehydrins, D11) and Group 3 (D7). Other LEA protein groups from [<xref ref-type="bibr" rid="B12">12</xref>
] are Group 4 (D113), Group 5 (D29) and Group 6 (D34). Four of the LEA protein groups are also represented by Pfam [<xref ref-type="bibr" rid="B14">14</xref>
] domain families:</p>
<p>• Small Hydrophilic Plant Seed Protein (PF00477) – Group 1</p>
<p>• Dehydrin (PF00257) – Group 2</p>
<p>• LEA (PF02987) – Group 3</p>
<p>• LEA-1 (PF03760) – Group 4</p>
<p>In addition, there are groups which do not appear in the Bray [<xref ref-type="bibr" rid="B1">1</xref>
] scheme: Lea5 (D73) and Lea14 (D95) [<xref ref-type="bibr" rid="B15">15</xref>
], although both are represented by Pfam families: Lea5(D73) by LEA-3, PF03242, and Lea14(D95) by LEA-2, PF03168.</p>
<p>Previous work, using just amino acid percentage composition and the Kyte Doolittle hydrophobicity metric, found that LEA proteins are characterised by a preponderance of hydrophilic amino acids together with high glycine content, resulting in their characterisation as "hydrophilins" [<xref ref-type="bibr" rid="B16">16</xref>
]. Certain LEA protein Groups are also said to be rich in alanine, but deficient in cysteine and tryptophan [<xref ref-type="bibr" rid="B3">3</xref>
,<xref ref-type="bibr" rid="B4">4</xref>
].</p>
<p>However, a significant, though often overlooked, feature of LEA proteins is that the majority are low complexity proteins. This is amply demonstrated through the use of the low complexity sequence demarcation tool, 0j.py [<xref ref-type="bibr" rid="B17">17</xref>
], which was applied, first to all the sequences above 40aa in SwissProt and SpTrEMBL (also called Swall) and then to a database of 112 LEA proteins, which will be described shortly. The sequences in the large database returned a median score of 3, with 13% having a score of 0 and 32% a score greater than then 3; a low score implies that the protein has high sequence complexity. By contrast, the LEA sequences had a median score of 11.5, and 80% return a score greater than 3 (equivalent to a p-value of 1. 1 × 10<sup>-25</sup>
).</p>
<p>Low complexity sequences pose a particular problem for the local alignment tools such as BLAST which owe much of their discriminative power to scoring schemes based on the extreme value distribution [<xref ref-type="bibr" rid="B18">18</xref>
]. For example, [<xref ref-type="bibr" rid="B19">19</xref>
] compares the efficacy of both BLAST and FASTA with an implementation of the Smith-Waterman algorithm, each both with and without the use of scoring schemes based on the extreme-value distribution. The benefit of having statistically based scoring schemes is conclusively demonstrated [<xref ref-type="bibr" rid="B19">19</xref>
]. However, it is well known that low complexity sequences prejudice extreme value distribution based statistical scoring [<xref ref-type="bibr" rid="B20">20</xref>
]. The standard way of dealing with low complexity regions in the context of database searches is to mask these off in the query sequence using applications such as SEG [<xref ref-type="bibr" rid="B21">21</xref>
]. When SEG was run across the set of 112 LEA proteins, 11 high complexity sequences are returned unaltered; the remainder were masked to a greater or lesser extent, with 57 having between 30% and 71% of their amino acids masked. The first effect of masking is to reduce the number of amino acids available for alignment. The second effect is to produce an asymmetry, because only the query sequence is masked, not the target (i.e. database) sequences, so the answer you obtain for an alignment between a masked query and the target sequence depends on which sequence is the query and which is the target.</p>
<p>The aims of this resurvey were therefore twofold. The first aim was to create a sizable set of the LEA proteins spanning all the Groups and then, using a number of software tools to lessen the impact of low sequence complexity, to reexamine the classification of this diverse set of proteins. In the light of this process, the previous findings are reviewed and expanded. Secondly, searches based on peptide composition were used to reveal proteins with similar composition to different LEA Groups; keyword clustering was then applied to the lists of search hits to suggest keywords and phrases indicative of the Groups' functions. These are the starting point for current and future experimental work.</p>
</sec>
<sec><title>Results</title>
<sec><title>The Rules Induced by Supervised Learning Application</title>
<p>The input to supervised machine learning application, Ripper, for each LEA protein was therefore 13 values (3 hydrophobicity; 3 predicted secondary structure and 7 amino acid class) plus the Group to which the protein had been assigned. The output was a set of rules for classifying putative LEA proteins into Groups based on the 13 values. When working on real-world (i.e. noisy) data all rule induction algorithms attempt to balance accuracy/correct-predictions with conciseness; at the extreme one could have 100% accuracy by creating a rule for each input protein, while at the other extreme one can achieve maximum conciseness by having a single rule predicting the largest output category, which would in this case mean categorising every input LEA protein as Group 2. Ripper was run several times until the error on the input set was minimised. Extra conditions were then added by hand to the rules to deal with the misclassified proteins until no further rules could be added without generating other misclassifications. The final rule set, which appears in Table <xref ref-type="table" rid="T11">11</xref>
, should be understood as operating in a top-down, if .. else if, manner.</p>
<p>The reader will have noticed that the table of Group 2 LEA proteins (Table <xref ref-type="table" rid="T2">2</xref>
) has been partitioned into three subsets; these correspond to the three rules under which Group 2 proteins are classified using the above rule-set. The rules have been labelled 2a, 2b and 2c. Notice that the Group 2 LEA proteins induced by cold stress are predominantly characterised by Rules 2b and 2c (particularly 2c), while the Group 2 proteins which have been shown not to be up-regulated by cold stress and all the canonical LEA proteins are encompassed by Rule 2a.</p>
<table-wrap position="float" id="T1"><label>Table 1</label>
<caption><p>LEA Protein Group 1 (D19) Exemplar(s): LE19_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="left">ID</td>
<td align="left">Species</td>
<td align="left">Tissue</td>
<td align="center">Expression</td>
<td align="left">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">EM1_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td></td>
<td align="center">4</td>
<td align="left">PF00477_hmm; L194_HORVU (1e-67)</td>
</tr>
<tr><td align="left">EM1_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_ma</td>
</tr>
<tr><td align="left">EM2_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_hmm, EMP1_ORYSA (2e-41)</td>
</tr>
<tr><td align="left">EM6_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_ma; EMB1_DAUCA (9e-38)</td>
</tr>
<tr><td align="left">EMB1_DAUCA</td>
<td align="left">DAUCA</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">4</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">EMB5_MAIZE</td>
<td align="left">MAIZE</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">EMP1_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">L193_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Seed</td>
<td align="left">ABA, notCold, <bold>Canon</bold>
, Mannitol</td>
<td align="left">1(3)</td>
<td align="center">4</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">L194_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Seed</td>
<td align="left">ABA, notCold <bold>Canon</bold>
, Mannitol</td>
<td align="left">1(4)</td>
<td align="center">4</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">L19A_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Seed</td>
<td align="left">ABA, notCold, <bold>Canon </bold>
Mannitol, Salt</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">L19B_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Seed</td>
<td align="left">ABA, notCold, <bold>Canon</bold>
, Mannitol, Salt</td>
<td align="left">1(1)</td>
<td align="center">6</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">LE10_HELAN</td>
<td align="left">HELAN</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
, Mannitol</td>
<td></td>
<td align="center">4</td>
<td align="left">PF00477_hmm</td>
</tr>
<tr><td align="left">LE19_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td></td>
<td align="center">4</td>
<td align="left">PF00477_ma</td>
</tr>
<tr><td align="left">SEEP_RAPSA</td>
<td align="left">RAPSA</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="left">1(1)</td>
<td align="center">3,6 (280)</td>
<td align="left">PF00477_hmm PF00477_hmm</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 1 proteins, with the exemplar being LE19_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether the LEA Group 1 motif is detected using agrep and the number of times it is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group 1.</p>
</table-wrap-foot>
</table-wrap>
<p>Four of the proteins would appear to have been misclassified: LE11_HELAN and LE25_LYCES are generally considered to be Group 4 (D113) based on the assignment in Dure (1993), but have here been assigned to Group 3 on the basis of their high predicted percentage alpha-helical content (0.6 and 0.56 versus a threshold of 0.34). While some care needs to be taken because Group 4 is the default category when all others rules have failed, the three Group 4 proteins covered by the default rule all have predicted percentage loop content greater than or equal to 0.25, while the two classified as Group 3 have loop content less than or equal to 0.12. In other words, there would appear to be other grounds for suspecting that LE11_HELAN and LE25_LYCES are not in the same Group as the three proteins assigned to the default, Group 4.</p>
<p>The other apparently misclassified proteins are LE29_GOSHI and Q93Y63, which are classified by Bray (1993) as Group 5, but which have been classified as Group 3 here. This is in line with recent reclassifications of Group 5 (D29) LEA proteins as Group 3 [<xref ref-type="bibr" rid="B4">4</xref>
,<xref ref-type="bibr" rid="B3">3</xref>
], although the Group is retained as a separate entity in [<xref ref-type="bibr" rid="B22">22</xref>
]. Members of the former Group 5 have the same domain composition as Group 3 LEA proteins, but with additional copies of those domains.</p>
<p>The classification rules described above were applied to the set of uncharacterised LEA proteins. As a result, O24439 is predicted to be a member of the first set of Group 2 LEA proteins by Rule 1, while Q9S7S3 is predicted to be in the Lea5/D73 Group by Rule 6 and O81483 is predicted to be Group 6 by Rule 8.</p>
<table-wrap position="float" id="T2"><label>Table 2</label>
<caption><p>LEA Protein Group 2 (D11) Exemplar(s): DH11_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="center">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">DH11_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">2Y(2), 2S(9), k(4)</td>
<td align="center">1</td>
<td align="left">PF00257_ma</td>
</tr>
<tr><td align="left">DH14_LYCES</td>
<td align="left">LYCES</td>
<td align="left">Root, Stem Leaf</td>
<td align="left">ABA, Salt notCold</td>
<td align="center">2Y(1), 2K(1), 2S(5)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1B_ORYSA (5e-17)</td>
</tr>
<tr><td align="left">DH15_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Root</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2) 2S(8)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1B_ORYSA (1e-37)</td>
</tr>
<tr><td align="left">DH18_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Leaf Stem</td>
<td align="left">ABA, Desc notCold</td>
<td align="center">2Y(2), 2K(2) 2S(5)</td>
<td align="center">1,3</td>
<td align="left">PF00257_ma EC40_DAUCA (3e-25)</td>
</tr>
<tr><td align="left">DH1B_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Seed, Shoot</td>
<td align="left">ABA, <bold>Canon</bold>
, Salt</td>
<td align="center">2Y(2), 2K(2), 2S(8)</td>
<td align="center">1,10</td>
<td align="left">PF00257_ma</td>
</tr>
<tr><td align="left">DH1C_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Seed, Shoot</td>
<td align="left">ABA, <bold>Canon</bold>
, Salt</td>
<td align="center">2Y(2), 2K(2), 2S(8)</td>
<td align="center">1,10</td>
<td align="left">PF00257_hmm</td>
</tr>
<tr><td align="left">DH1_MAIZE</td>
<td align="left">MAIZE</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2) 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_ma DH1B_ORYSA (4e-47)</td>
</tr>
<tr><td align="left">DH25_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Callus</td>
<td align="left">ABA, Desc notCold</td>
<td align="center">2Y(2), 2K(2), 2S(9), k(3)</td>
<td align="center">1,8</td>
<td align="left">PF00257_ma DHLE_RAPSA (2e-26)</td>
</tr>
<tr><td align="left">DHA_CRAPL</td>
<td align="left">CRAPL</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td align="center">2K(1), 2S(7)</td>
<td align="center">10</td>
<td align="left">PF00257_ma; DHLE_RAPSA (8e-20)</td>
</tr>
<tr><td align="left">DHB_CRAPL</td>
<td align="left">CRAPL</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2) 2S(8)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1B_ORYSA</td>
</tr>
<tr><td align="left">DHLE_RAPSA</td>
<td align="left">RAPSA</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="center">2Y(3), 2K(1) 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_ma</td>
</tr>
<tr><td align="left">DHN1_PEA</td>
<td align="left">PEA</td>
<td align="left">Shoot Cotyledon</td>
<td align="left">ABA, Desc notCanon</td>
<td align="center">2Y(3), 2K(2)</td>
<td align="center">1</td>
<td align="left">PF00257_ma DH1B_ORYSA (1e-17)</td>
</tr>
<tr><td align="left">EC40_DAUCA</td>
<td align="left">DAUCA</td>
<td align="left">Seed, Embryo cells</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">2Y(3), 2K(1) 2S(6)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm;</td>
</tr>
<tr><td align="left">O22623</td>
<td align="left">VACCO</td>
<td align="left">Floral buds, Leaf</td>
<td align="left">Cold</td>
<td align="center">None</td>
<td align="center">1,3</td>
<td align="left">DH1B_ORYSA (1e-5)</td>
</tr>
<tr><td align="left">O65216</td>
<td align="left">WHEAT</td>
<td align="left">Leaf, Root Crown, Seed</td>
<td align="left">ABA, Cold, Desc</td>
<td align="center">2K(6)</td>
<td align="center">1,3</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (8e-18)</td>
</tr>
<tr><td align="left">P93701</td>
<td align="left">VIGUN</td>
<td align="left">Leaf</td>
<td align="left">ABA, notCold Desc, Salt</td>
<td align="center">2Y(2), 2K(1)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; EC40_DAUCA (6e-23)</td>
</tr>
<tr><td align="left">Q39937</td>
<td align="left">HELAN</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(3), 2K(1), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; EC40_DAUCA (1e-38),</td>
</tr>
<tr><td align="left">Q39938</td>
<td align="left">HELAN</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td align="center">2K(2), 2S(5), k(3)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (6e-21)</td>
</tr>
<tr><td align="left">Q40331</td>
<td align="left">MEDFA</td>
<td align="left">Callus</td>
<td align="left">notABA, Cold, notDesc</td>
<td align="center">2K(1)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (1e-7)</td>
</tr>
<tr><td align="left">Q40968</td>
<td align="left">PRUPE</td>
<td align="left">Bark</td>
<td align="left">Cold, Desc</td>
<td align="center">2Y(2), 2K(4)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; EC40_DAUCA(3e-13)</td>
</tr>
<tr><td align="left">Q41306</td>
<td align="left">SOLCO</td>
<td align="left">Leaf, Stem</td>
<td align="left">ABA, Cold</td>
<td align="center">2Y(2), 2K(2), 2S(7), k(3)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DHLE_RAPSA (9e-29)</td>
</tr>
<tr><td align="left">Q41451</td>
<td align="left">SOLTU</td>
<td align="left">Leaf, Stem</td>
<td align="left">ABA, Cold</td>
<td align="center">2Y(2), 2K(2), 2S(7), k(3)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DHLE_RAPSA (2e-29)</td>
</tr>
<tr><td align="left">Q9SBI7</td>
<td align="left">HORVU</td>
<td align="left">Seedling</td>
<td align="left">Desc, ABA, Cold</td>
<td align="center">2K(9)</td>
<td align="center">1,3 (295)</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (5e-16)</td>
</tr>
<tr><td align="left">Q9SPL8</td>
<td align="left">VIGUN</td>
<td align="left">Seed</td>
<td align="left">Cold</td>
<td align="center">2Y(2), 2K(1)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; EC40_DAUCA (1.8e-25)</td>
</tr>
<tr><td align="left">Q9ZTR2</td>
<td align="left">HORVU</td>
<td align="left">Seedling</td>
<td align="left">Desc, notABA notCold</td>
<td align="center">2Y(2), 2K(3), 2S(9)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (4e-33)</td>
</tr>
<tr><td align="left">Q9ZTR3</td>
<td align="left">HORVU</td>
<td align="left">Seedling</td>
<td align="left">Desc, ABA notCold</td>
<td align="center">2Y(1), 2K(2), 2S(8)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (2e-44)</td>
</tr>
<tr><td align="left">Q9ZTR4</td>
<td align="left">HORVU</td>
<td align="left">Seedling</td>
<td align="left">Desc, ABA, notCold</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (1e-47)</td>
</tr>
<tr><td align="left">Q9ZTR5</td>
<td align="left">HORVU</td>
<td align="left">Seedling</td>
<td align="left">Desc, ABA, notCold</td>
<td align="center">2Y(2), 2K(3), 2S(9)</td>
<td align="center">1,9</td>
<td align="left">PF00257_hmm; EC40_DAUCA (1e-43)</td>
</tr>
<tr><td align="left">COR4_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Root, Leaf Crown</td>
<td align="left">ABA, Cold Desc</td>
<td align="center">2K(1), 2S(9), k(7)</td>
<td align="center">3</td>
<td align="left">PF00257_ma, EC40_DAUCA (1e-7)</td>
</tr>
<tr><td align="left">CS12_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">Cold, notABA notDesc</td>
<td align="center">2K(6)</td>
<td align="center">1,3</td>
<td align="left">PF00257_ma, EC40_DAUCA (8e-17)</td>
</tr>
<tr><td align="left">CS66_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">Cold, notABA notDesc</td>
<td align="center">2K(5)</td>
<td align="center">1,3</td>
<td align="left">PF00257_hmm, DH1B_ORYSA (6e-17)</td>
</tr>
<tr><td align="left">DH14_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Leaf, Stem, Root, Seed, Flower</td>
<td align="left">ABA, Desc notCanon Cold</td>
<td align="center">2K(2), 2S(7), k(10)</td>
<td align="center">3</td>
<td align="left">PF00257_ma; EC40_DAUCA (2e-10)</td>
</tr>
<tr><td align="left">DH1D_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Seed Shoot</td>
<td align="left">ABA Salt</td>
<td align="center">2Y(1), 2K(2), 2S(4)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1B_ORYSA (9e-43)</td>
</tr>
<tr><td align="left">DH1_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc Desc</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1B_ORYSA (2e-35)</td>
</tr>
<tr><td align="left">DH21_ORYSA</td>
<td align="left">ORYSA</td>
<td align="left">Seed</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (3e-50)</td>
</tr>
<tr><td align="left">DH2_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (4e-35)</td>
</tr>
<tr><td align="left">DH3_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_ma; DH1C_ORYSA (7e-45)</td>
</tr>
<tr><td align="left">DH4_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">2Y(1), 2K(2), 2S(7)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (2e-38)</td>
</tr>
<tr><td align="left">DH47_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Leaf, Stem Seed</td>
<td align="left">ABA, Cold, Desc, notCanon</td>
<td align="center">2K(3), 2S(7), k(4)</td>
<td align="center">3</td>
<td align="left">PF00257_hmm DHLE_RAPSA (2e-12)</td>
</tr>
<tr><td align="left">DHX2_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Leaf, Stem</td>
<td align="left">Cold, weak ABA weak Desc</td>
<td align="center">2K(3)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm</td>
</tr>
<tr><td align="left">O64939</td>
<td align="left">LOPEL</td>
<td align="left">Root</td>
<td align="left">Salt</td>
<td align="center">2K(6)</td>
<td align="center">1,3,9</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (7e-16)</td>
</tr>
<tr><td align="left">Q41347</td>
<td align="left">STELP</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc, PEG</td>
<td align="center">2K(1), 2S(5), 2S(6), k(3)</td>
<td align="center">8</td>
<td align="left">PF00257_hmm; DHLE_RAPSA (4e-8)</td>
</tr>
<tr><td align="left">Q42409</td>
<td align="left">TRITU</td>
<td align="left">Root, Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">2K(2)</td>
<td align="center">1</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (6e-16)</td>
</tr>
<tr><td align="left">Q43488</td>
<td align="left">HORVU</td>
<td align="left">Leaf</td>
<td align="left">notABA, Cold Desc</td>
<td align="center">2K(1), 2S(9), k(10)</td>
<td align="center">3</td>
<td align="left">PF00257_hmm; DH1B_ORYSA (4e-9)</td>
</tr>
<tr><td align="left">DH10_ARATH</td>
<td align="left">ARATH</td>
<td align="left">Leaf, Stem, Root, Seeds Flower</td>
<td align="left">weak ABA, weak Desc, notCanon Cold</td>
<td align="center">2K(2), 2S(7), k(11)</td>
<td align="center">3</td>
<td align="left">PF00257_ma EC40_DAUCA (4e-9)</td>
</tr>
<tr><td align="left">O04232</td>
<td align="left">SOLTU</td>
<td align="left">Tuber</td>
<td align="left">Cold</td>
<td align="center">2K(2), 2S(9), k(9)</td>
<td align="center">3</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (3e-11)</td>
</tr>
<tr><td align="left">O48622</td>
<td align="left">SPIOL</td>
<td align="left">Shoot</td>
<td align="left">Cold, Desc</td>
<td align="center">2K(4), k(3)</td>
<td align="center">3, (280)</td>
<td align="left">EC40_DAUCA (1e-6)</td>
</tr>
<tr><td align="left">Q41091</td>
<td align="left">PONTR</td>
<td align="left">Leaf Leaf</td>
<td align="left">Cold, notSalt notDesc</td>
<td align="center">2S(5), k(8)</td>
<td align="center">3</td>
<td align="left">DH1C_ORYSA (1e-8)</td>
</tr>
<tr><td align="left">Q9XEL3</td>
<td align="left">PICGL</td>
<td align="left">Bud, Stem</td>
<td align="left">ABA, Cold, Desc</td>
<td align="center">2K(3), 2S(8), k(12)</td>
<td align="center">3</td>
<td align="left">PF00257_hmm; DH1C_ORYSA (2e-11)</td>
</tr>
<tr><td align="left">Q9ZR21</td>
<td align="left">CITUN</td>
<td align="left">Leaf</td>
<td align="left">Cold</td>
<td align="center">2S(5), k(9)</td>
<td align="center">3</td>
<td align="left">DH1C_ORYSA (1e-9)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 2 proteins, with the exemplar being DH11_GOSH. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether any of the Close LEA Group 2 motifs 2Y, 2K or 2S, or poly-lysine stutters are detected using agrep and the number of times each is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group 2.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec><title>Results from the POPP Analysis of LEA Proteins by Group</title>
<p>Table <xref ref-type="table" rid="T12">12</xref>
 lists a selection of the most significant peptides which result from placing the sequences corresponding to the different Groups into separate databases and having popp_create.py applied to each such database. A negative p-value indicates a significant under-representation. Some care must be taken interpreting the probabilities generated by the binomial distribution statistic because larger datasets will give rise to much more significant p-values. For that reason, only those p-values that are less than a threshold are now considered, where the threshold is determined from the mean below-threshold log-probability value (i.e. average log probability for p-values less than 0.05) across the respective datasets. For these purposes, p-values above 0.05 are said to be significant, but those above the dataset mean value for each Group will be described as highly significant. If just the first three, more hydrophilic Groups are considered, the list of highly significant peptides found in all the groups is: -C, -F, +GE, -I, -L, -N, +Q and -W, where '+' before a peptide indicates over-representation, while a '-' indicates under-representation. In all three Groups, charged/polar residues feature highly; K is very highly represented in Groups 2 and 3, and moderately so (9. 7 × 10<sup>-6</sup>
) in Group 1. Group 1 also evidences highly significant over-representation of R. Similarly E is highly found in Groups 1 and 3, but is not highly over-represented in Group 2 (4. 9 × 10<sup>-13</sup>
). Of the other characteristics, glycine is highly represented in Group 1 and Group 2. However, in Group 3 glycine is found only marginally more than expected by chance (p-value 0.012). Overall, the description of these Groups as hydrophilins is not completely borne out; they are indeed characterised by hydrophilic residues, but glycine is only highly expressed in two of the three Groups.</p>
<p>The list of highly significant peptides confirms the previous finding that cysteine is lacking in Group 1, 2, 3 and 4 LEA proteins [<xref ref-type="bibr" rid="B3">3</xref>
]. In the current dataset, 86 of the 112 sequences had no cysteine residues at all, 17 had just one, six have two and only one each have, respectively, three, four and five, cysteine residues. Similarly for tryptophan, 91 sequences had no tryptophan residues, 12 have one tryptophan residue, seven have two, one sequence has three and one has four.</p>
<p>Another previous finding is that Group 2 LEA proteins are rich in glycine or alanine and proline [<xref ref-type="bibr" rid="B4">4</xref>
]. As noted above, G is highly significant for this group (in fact extremely so – p-value 0). On the other hand, A and P are under-represented, respectively – 1. 1 × 10<sup>-14 </sup>
and – 1. 3 × 10<sup>-7</sup>
. However, A is highly significant in Groups 3, 4, 5 and 6, which accords with the prediction that Groups 3, 5 and 6 have higher helical secondary structure content – something also seen, for example, in the alanine-rich, alpha-helical antifreeze protein (ANPA_PSEAM) from winter flounder, PDB code 1 wfb. From Table <xref ref-type="table" rid="T12">12</xref>
 it is evident that the highly significant peptides from Group 4 have disjoint overlaps with Group 2 (+GH, -V) and Group 3 (+A, +AA). Finally, if the four major Groups 1, 2, 3 and 6 are considered, the peptides that are highly significant in all four Groups are: -C, -F, -I, -L, +Q, and -W.</p>
</sec>
<sec><title>Results from Clustering LEA Protein Probability Profiles</title>
<p>Recalling that the aim of unsupervised machine learning is to cluster the input data so that related objects are associated, while dissimilar objects are in different clusters, a POPP vector was created for each LEA protein sequence, including the three members of the Uncharacterised set. The clustering application, popp_cmp.py, was then used to cluster the vectors. The significance threshold was set at 0.05. Bearing in mind that POPPs are not constrained to be in any particular cluster, and that the clusters can appear in any number of families and superfamilies, there is a remarkable level of agreement between the membership of the superfamilies versus the Groups derived from the literature and those observed in the supervised learning experiments discussed above.</p>
<p>In Tables <xref ref-type="table" rid="T1">1</xref>
 to <xref ref-type="table" rid="T9">9</xref>
, the column labelled SF lists the superfamilies in which each POPP has been placed. Because cluster, family and superfamily identifiers are created and numbered automatically, the specific numbers will bear no relation to LEA Group numbers; instead, what is significant are the sets of POPPs that appear in the same superfamily (i.e. share a superfamily identifier). Where an identifier appears in brackets, the corresponding POPP appears in a free-standing cluster, i.e. a cluster which is not sufficiently similar to any other cluster for it to have been included in a family. Table <xref ref-type="table" rid="T13">13</xref>
 lists, for each superfamily, the LEA Group it represents and the peptides making up the consensus POPP for the corresponding anchor family.</p>
<p>Scanning the superfamily column (labelled SF) in Tables <xref ref-type="table" rid="T1">1</xref>
,<xref ref-type="table" rid="T2">2</xref>
,<xref ref-type="table" rid="T3">3</xref>
,<xref ref-type="table" rid="T4">4</xref>
,<xref ref-type="table" rid="T5">5</xref>
,<xref ref-type="table" rid="T6">6</xref>
,<xref ref-type="table" rid="T7">7</xref>
,<xref ref-type="table" rid="T8">8</xref>
,<xref ref-type="table" rid="T9">9</xref>
 a number of observations can be made:</p>
<table-wrap position="float" id="T3"><label>Table 3</label>
<caption><p>LEA Protein Group 3 (D7) Exemplar(s): LE7_GOSHI, LE76_BRANA</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="left">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">DRPF_CRAPL</td>
<td align="left">CRAPL</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td align="center">3(1), k(3)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LE76_BRANA (3e-12)</td>
</tr>
<tr><td align="left">EDC8_DAUCA</td>
<td align="left">DAUCA</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">3(5)</td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">LE76_BRANA</td>
<td align="left">BRANA</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">3(5)</td>
<td align="center">5</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">LE7_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">3(2)</td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">LEA1_HORVU</td>
<td align="left">HORVU</td>
<td align="left">Aleurone</td>
<td align="left">ABA, Cold, Desc <bold>Canon</bold>
, Salt</td>
<td align="center">3(7)</td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">LEA3_MAIZE</td>
<td align="left">MAIZE</td>
<td align="left">Seed, Leaf Shoot</td>
<td align="left">ABA, Desc, <bold>Canon</bold>
</td>
<td align="center">3(4)</td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">LEA3_WHEAT</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">ABA, Desc</td>
<td align="center">3(7)</td>
<td align="center">2</td>
<td align="left">PF02987_hmm; LEA1_HORVU (1e-100)</td>
</tr>
<tr><td align="left">LED3_DAUCA</td>
<td align="left">DAUCA</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="center">3(4)</td>
<td align="center">2</td>
<td align="left">LE7_GOSHI (3e-27)</td>
</tr>
<tr><td align="left">O49816</td>
<td align="left">CICAR</td>
<td align="left">Mesocotyl</td>
<td align="left">notABA, notCold, Desc, Salt</td>
<td align="center">3(5)</td>
<td align="center">5</td>
<td align="left">PF02987_ma; LE76_BRANA (1e-46)</td>
</tr>
<tr><td align="left">O49817</td>
<td align="left">CICAR</td>
<td align="left">Mesocotyl</td>
<td align="left">notABA, notCold Desc, Salt</td>
<td align="center">3(4)</td>
<td align="center">5</td>
<td align="left">PF02987_ma; LE76_BRANA (5e-36)</td>
</tr>
<tr><td align="left">Q03967</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">Desc, <bold>Canon</bold>
</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">Q06540</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">Cold, notABA notDesc, notSalt</td>
<td align="center">2S(3)</td>
<td align="center">2</td>
<td align="left">Q39660 (6e-8) DRPF_CRAPL (5e-5)</td>
</tr>
<tr><td align="left">Q39058</td>
<td align="left">ARATH</td>
<td align="left">Shoot</td>
<td align="left">ABA, Cold, notDesc</td>
<td></td>
<td align="center">2</td>
<td align="left">Q39873 (7e-8)</td>
</tr>
<tr><td align="left">Q39660</td>
<td align="left">CHLVU</td>
<td align="left">Whole cells</td>
<td align="left">Cold</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_ma; LE76_BRANA (2e-5)</td>
</tr>
<tr><td align="left">Q39873</td>
<td align="left">SOYBN</td>
<td align="left">Seed, Leaf, Root</td>
<td align="left">ABA, <bold>Canon</bold>
, Salt</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_ma; EDC8_DAUCA (8e-37)</td>
</tr>
<tr><td align="left">Q40696</td>
<td align="left">ORYSA</td>
<td align="left">Root</td>
<td align="left">ABA, Salt</td>
<td align="center">3(5)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LEA1_HORVU (2e-51)</td>
</tr>
<tr><td align="left">Q40709</td>
<td align="left">ORYSA</td>
<td align="left">Shoot</td>
<td align="left">notABA, Cold Mannitol</td>
<td align="center">3(3)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LEA3_MAIZE (2e-44)</td>
</tr>
<tr><td align="left">Q40869</td>
<td align="left">PICGL</td>
<td align="left">Embryo</td>
<td align="left">ABA</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_ma; LEA1_HORVU (9e-14)</td>
</tr>
<tr><td align="left">Q40929</td>
<td align="left">PSEMZ</td>
<td align="left">Seed</td>
<td align="left">Cold, <bold>Canon</bold>
</td>
<td align="center">3(1)</td>
<td align="center">5</td>
<td align="left">PF02987_ma; LE76_BRANA (2e-23)</td>
</tr>
<tr><td align="left">Q41060</td>
<td align="left">PEA</td>
<td align="left">Seed</td>
<td align="left">notABA, Sucrose, notCanon</td>
<td align="center">3(1)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LE76_BRANA (8e-14)</td>
</tr>
<tr><td align="left">Q41154</td>
<td align="left">RICFL</td>
<td align="left">Thalli</td>
<td align="left">ABA, Desc</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_ma; EDC8_DAUCA (1e-31)</td>
</tr>
<tr><td align="left">Q41213</td>
<td align="left">BRANA</td>
<td align="left">Shoot, Seed</td>
<td align="left">notABA, Cold notDesc, notCanon</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_hmm; EDC8_DAUCA (2e-5)</td>
</tr>
<tr><td align="left">Q42386</td>
<td align="left">BRANA</td>
<td align="left">Leaf</td>
<td align="left">notABA, Cold</td>
<td></td>
<td align="center">2</td>
<td align="left">PF02987_hmm; EDC8_DAUCA (5e-6)</td>
</tr>
<tr><td align="left">Q42512</td>
<td align="left">ARATH</td>
<td align="left">Shoot</td>
<td align="left">ABA, Cold, Desc</td>
<td></td>
<td align="center">2</td>
<td align="left">LEA3_MAIZE (6e-5)</td>
</tr>
<tr><td align="left">Q95V77</td>
<td align="left">APHAV</td>
<td align="left">Whole animal</td>
<td align="left">Desc</td>
<td align="center">3(1)</td>
<td align="center">2</td>
<td align="left">LEA1_HORVU (1e-13)</td>
</tr>
<tr><td align="left">Q96246</td>
<td align="left">ARATH</td>
<td align="left">Seed, immature silique</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">3(1)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; EDC8_DAUCA (1e-73)</td>
</tr>
<tr><td align="left">Q9M4T9</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">ABA, Cold</td>
<td align="center">3(1)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LEA1_HORVU (1e-26)</td>
</tr>
<tr><td align="left">Q9SDV6</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">Cold, notABA notDesc, notSalt</td>
<td align="center">2S(3)</td>
<td align="center">2</td>
<td align="left">PF02987_ma Q39873 (2e-4)</td>
</tr>
<tr><td align="left">Q9XET0</td>
<td align="left">SOYBN</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="center">3(4)</td>
<td align="center">5</td>
<td align="left">PF02987_ma; LE7_GOSHI (2e-28)</td>
</tr>
<tr><td align="left">Q9XFD0</td>
<td align="left">WHEAT</td>
<td align="left">Shoot</td>
<td align="left">ABA, Cold</td>
<td align="center">3(4)</td>
<td align="center">2</td>
<td align="left">PF02987_ma; LEA1_HORVU (4e-56)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 3 proteins, with the exemplars being LE7_GOSHI and LE76_BRANA. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether the Group 3 motif or poly-lysine stutters are detected using agrep and the number of times each is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group 3. Note the presence in two cases of the 2S (i.e. poly-serine) motif.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T4"><label>Table 4</label>
<caption><p>LEA Protein Group 4 (D113) Exemplar(s): LE13_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">LE11_HELAN</td>
<td align="left">HELAN</td>
<td align="left">Seed, Shoot</td>
<td align="left">ABA, Desc, <bold>Canon</bold>
</td>
<td></td>
<td align="center">2</td>
<td align="left">PF03760_hmm; PM1_SOYBN (7e-27)</td>
</tr>
<tr><td align="left">LE13_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td></td>
<td align="center">9</td>
<td align="left">PF03760_hmm; PM1_SOYBN (1e-19)</td>
</tr>
<tr><td align="left">LE25_LYCES</td>
<td align="left">LYCES</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td></td>
<td align="center">2</td>
<td align="left">PF03760_hmm; PM1_SOYBN (9e-18)</td>
</tr>
<tr><td align="left">O24442</td>
<td align="left">PHAVU</td>
<td align="left">Root, Embryo</td>
<td align="left">ABA, Desc, <bold>Canon</bold>
</td>
<td></td>
<td align="center">1</td>
<td align="left">PF03760_hmm PM1_SOYBN (2e-32)</td>
</tr>
<tr><td align="left">PM1_SOYBN</td>
<td align="left">SOYBN</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">2Y(1)</td>
<td align="center">1</td>
<td align="left">PF03760_ma</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 4 proteins, with the exemplars being LE13_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether any of the LEA Group 1, 2 or 3 motifs or poly-lysine stutters are detected using agrep and the number of times each is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group 4. Note the presence in one case of the 2Y motif.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T5"><label>Table 5</label>
<caption><p>LEA Protein Group 5 (D29) Exemplar(s): LE29_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">LE29_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">k(3)</td>
<td align="center">2</td>
<td align="left">PF02987_ma</td>
</tr>
<tr><td align="left">Q93Y63</td>
<td align="left">MORBO</td>
<td align="left">Cortical parenchymal cells</td>
<td align="left">ABA, Cold, Desc</td>
<td align="center">3(2)</td>
<td align="center">2</td>
<td align="left">LE29_GOSHI (2e-26)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 5 proteins, with the exemplar being LE29_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether any of the LEA Group 1, 2 or 3 motifs are detected using agrep and the number of times it is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group 5. Note the presence of the Group 3 motif and poly-lysine.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T6"><label>Table 6</label>
<caption><p>LEA Protein Group 6 (D34) Exemplar(s): LE34_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">LE34_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Seed</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">7</td>
<td></td>
</tr>
<tr><td align="left">Q41850</td>
<td align="left">MAIZE</td>
<td align="left">Embryo, Leaf</td>
<td align="left">ABA, Desc, <bold>Canon</bold>
</td>
<td align="center">7</td>
<td align="left">LE34_GOSHI (6e-61)</td>
</tr>
<tr><td align="left">Q43424</td>
<td align="left">DAUCA</td>
<td align="left">Embryo</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">7</td>
<td align="left">LE34_GOSHI (4e-75)</td>
</tr>
<tr><td align="left">Q96245</td>
<td align="left">ARATH</td>
<td align="left">Seed</td>
<td align="left"><bold>Canon</bold>
</td>
<td align="center">7</td>
<td align="left">LE34_GOSHI (2e-77)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>LEA Group 6 proteins, with the exemplar being LE34_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) the superfamilies/stand-alone clusters in which the protein is found and 6) other evidence for accepting the protein as LEA Group 6. None of the LEA Group 1, 2 or 3 motifs match these protein sequences.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T7"><label>Table 7</label>
<caption><p>LEA Protein Group Lea5 (D73) Exemplar(s): LE5A_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">Pep</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">LE5A_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Leaf</td>
<td align="left">Desc</td>
<td align="center">2S(4)</td>
<td align="center">(299)</td>
<td align="left">PF03242_hmm</td>
</tr>
<tr><td align="left">LE5D_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Leaf</td>
<td align="left">Desc</td>
<td align="center">2S(4)</td>
<td align="center">(299)</td>
<td align="left">PF03242_ma</td>
</tr>
<tr><td align="left">Q39644</td>
<td align="left">CITSI</td>
<td align="left">Leaf, Ovule</td>
<td align="left">Salt, notCold</td>
<td></td>
<td></td>
<td align="left">PF03242_hmm, LE5D_GHOSHI (2.4e-46)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Lea5/D73 proteins – currently not part of any numbering scheme for LEA proteins – with the exemplar being LE5A_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) whether any of the LEA Group 1, 2 or 3 motifs or poly-lysine stutters are detected using agrep and the number of times each is found, 6) the superfamilies/stand-alone clusters in which the protein is found and 7) other evidence for accepting the protein as LEA Group Lea5/D73. Note the presence in two cases of the 2S motif (poly-serine stutter). Note also that two of the three proteins are found in a single, stand-alone cluster containing just the pair of proteins, while the other sequence is not found in any cluster.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T8"><label>Table 8</label>
<caption><p>LEA Protein Group Lea14 (D95) Exemplar(s): LE14_GOSHI</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">SF</td>
<td align="center">Evidence</td>
</tr>
</thead>
<tbody><tr><td align="left">DRPD_CRAPL</td>
<td align="left">CRAPL</td>
<td align="left">Leaf</td>
<td align="left">ABA, Desc</td>
<td></td>
<td align="left">PF03168_ma; LE14_SOYBN (2e-52)</td>
</tr>
<tr><td align="left">LE14_GOSHI</td>
<td align="left">GOSHI</td>
<td align="left">Leaf</td>
<td align="left">Desc</td>
<td align="center">(297)</td>
<td align="left">PF03168_ma; LE14_SOYBN (2e-64)</td>
</tr>
<tr><td align="left">LE14_SOYBN</td>
<td align="left">SOYBN</td>
<td align="left">Leaf</td>
<td align="left">ABA, <bold>Canon</bold>
</td>
<td align="center">(297)</td>
<td align="left">PF03168_ma</td>
</tr>
<tr><td align="left">Q40159</td>
<td align="left">LYCES</td>
<td align="left">Root</td>
<td align="left">notABA, notDesc, possible osmotic stress</td>
<td align="center">3</td>
<td align="left">PF03168_ma, LE14_SOYBN (8e-53)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Lea14/D95 proteins with the exemplar being LE5A_GOSHI. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, 5) the superfamilies/stand-alone clusters in which the protein is found and 6) other evidence for accepting the protein as LEA Group 6. None of the LEA Group 1, 2 or 3 motifs match these protein sequences. Note that two of the three proteins are found in a single, stand-alone cluster containing just the pair of proteins, one protein is found clustered with Group 2 LEA proteins in SF 3, while the other sequence is not found in any cluster.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T9"><label>Table 9</label>
<caption><p>Uncharacterised LEA Proteins</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">ID</td>
<td align="left">Species</td>
<td align="center">Tissue</td>
<td align="center">Expression</td>
<td align="center">SF</td>
</tr>
</thead>
<tbody><tr><td align="left">O24439</td>
<td align="left">PHAVU</td>
<td align="left">Root, Stem, Embryo</td>
<td align="left">ABA, Desc, <bold>Canon</bold>
</td>
<td align="center">(295)</td>
</tr>
<tr><td align="left">O81483</td>
<td align="left">ARATH</td>
<td align="left">Seed</td>
<td align="left">notABA, notDesc, notSalt, <bold>Canon</bold>
</td>
<td align="center">(279)</td>
</tr>
<tr><td align="left">Q9S7S3</td>
<td align="left">ARATH</td>
<td align="left">Seed</td>
<td align="left">notABA, Cold, notDesc, notSalt, <bold>Canon</bold>
</td>
<td align="center">(279)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Currently uncharacterised proteins which have expression patterns that are literally late embryogenesis, but which have no similarity to any of the previously described proteins. The columns are: 1) the protein identifier, 2) a code for the species (see Table <xref ref-type="table" rid="T10">10</xref>
), 3) the tissue(s) in which it has been found, 4) the conditions that give rise (or fail to give rise) to the expression of the gene, and 5) the stand-alone clusters in which the protein is found. Note that one pair only cluster with each other, while the third is found in a stand-alone cluster together with a Group 2 LEA protein.</p>
</table-wrap-foot>
</table-wrap>
<p>• The Group 4 LEA proteins are split between superfamilies covering Group 2 LEA proteins (PM1_SOYBN, LE13_GOSHI, O24442) and superfamilies comprising Group 3 LEA proteins (LE11_HELAN, LE25_LYCES).</p>
<p>• The two Group 5 proteins, LE29_GOSHI and Q93Y63 are clustered among the Group 3 LEA proteins (Superfamily 2).</p>
<p>• The Group 1 LEA proteins are split across two superfamilies, with clusters involving EM1_ARATH, EMB1_DAUCA, L193_HORVU, LE19_GOSHI, L194_HORVU and LE10_HELAN appear in Superfamily 4 while clusters involving EM1_WHEAT, EM2_WHEAT, EMB5_MAIZE, EMP1_ORYSA, L19A_HORVU, L19B_HORVU, EM6_ARATH. SEEP_RAPSA are found in a different superfamily, Superfamily 6.</p>
<p>• The Group 2 LEA proteins are split across five superfamilies. Looking at the consensus POPPs of the corresponding anchor families one notices that all the superfamilies have peptides from the 2K motif, while Superfamily 8 and Superfamily 10 have peptides from 2S. None of the anchor families have peptides from the 2Y motif, but they are present in other Families in Superfamily 1 (data not shown).</p>
<p>• Two of the Uncharacterised canonical LEA proteins, O81483 and Q9S7S3 only cluster with each other, while the third in this set, O24439, clusters with a Group 2 LEA protein, Q9SBI7. This situation persists even when the clustering thresholds are lowered to the point where significant numbers of Group 3 LEA proteins were found clustered with Group 2 LEA proteins. Furthermore, it is worth noting that the clustering of O24439 with Q9SBI7 is free-standing, i.e. not in a superfamily, which suggests that the relationship (supported by the supervised machine-learning rules) is a distant one.</p>
</sec>
<sec><title>Results from Keyword Clustering of POPP Search Hits</title>
<p>Table <xref ref-type="table" rid="T14">14</xref>
 summarises some of the keywords and phrases associated with each superfamily (thence Group) through the application of the Protein Annotators' Assistant to the sets of hits returned by popp_search.py when given as queries the consensus POPPs for each anchor family. Lea5 and Lea14 are presented by the consensus POPP for the single cluster respectively representing the two Groups. For compactness, only the most significant, distinct keywords are listed.</p>
<table-wrap position="float" id="T10"><label>Table 10</label>
<caption><p>Mapping from SwissProt Species Codes toSpecies Names Used in LEA Protein Group Tables</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="left">Code</td>
<td align="left">Species</td>
</tr>
</thead>
<tbody><tr><td align="left">APHAV</td>
<td align="left"><italic>Aphelenchus avenae</italic>
</td>
</tr>
<tr><td align="left">ARATH</td>
<td align="left"><italic>Arabidopsis thaliana</italic>
</td>
</tr>
<tr><td align="left">BRANA</td>
<td align="left"><italic>Brassica napus</italic>
</td>
</tr>
<tr><td align="left">CHLVU</td>
<td align="left"><italic>Chlorella vulgaris</italic>
</td>
</tr>
<tr><td align="left">CICAR</td>
<td align="left"><italic>Cicer arietinum</italic>
</td>
</tr>
<tr><td align="left">CITSI</td>
<td align="left"><italic>Citrus sinensis</italic>
</td>
</tr>
<tr><td align="left">CITUN</td>
<td align="left"><italic>Citrus unshiu</italic>
</td>
</tr>
<tr><td align="left">CRAPL</td>
<td align="left"><italic>Craterostigma plantagineum</italic>
</td>
</tr>
<tr><td align="left">DAUCA</td>
<td align="left"><italic>Daucus carota</italic>
</td>
</tr>
<tr><td align="left">DAUCA</td>
<td align="left"><italic>Daucus carota</italic>
</td>
</tr>
<tr><td align="left">GOSHI</td>
<td align="left"><italic>Gossypium hirsutum</italic>
</td>
</tr>
<tr><td align="left">HELAN</td>
<td align="left"><italic>Helianthus annuus</italic>
</td>
</tr>
<tr><td align="left">HORVU</td>
<td align="left"><italic>Hordeum vulgare</italic>
</td>
</tr>
<tr><td align="left">LOPEL</td>
<td align="left"><italic>Lophopyrum elongatum</italic>
</td>
</tr>
<tr><td align="left">LYCES</td>
<td align="left"><italic>Lycopersicon esculentum</italic>
</td>
</tr>
<tr><td align="left">MAIZE</td>
<td align="left"><italic>Zea mays</italic>
</td>
</tr>
<tr><td align="left">MEDFA</td>
<td align="left"><italic>Medicago falcata</italic>
</td>
</tr>
<tr><td align="left">MORBO</td>
<td align="left"><italic>Morus bombycis</italic>
</td>
</tr>
<tr><td align="left">ORYSA</td>
<td align="left"><italic>Oryza sativa</italic>
</td>
</tr>
<tr><td align="left">PEA</td>
<td align="left"><italic>Pisum sativum</italic>
</td>
</tr>
<tr><td align="left">PHAVU</td>
<td align="left"><italic>Phaseolus vulgaris</italic>
</td>
</tr>
<tr><td align="left">PICGL</td>
<td align="left"><italic>Picea glauca</italic>
</td>
</tr>
<tr><td align="left">PONTR</td>
<td align="left"><italic>Poncirus trifoliata</italic>
</td>
</tr>
<tr><td align="left">PRUPE</td>
<td align="left"><italic>Prunus persica</italic>
</td>
</tr>
<tr><td align="left">PSEMZ</td>
<td align="left"><italic>Pseudotsuga menziesii</italic>
</td>
</tr>
<tr><td align="left">RAPSA</td>
<td align="left"><italic>Raphanus sativus</italic>
</td>
</tr>
<tr><td align="left">RICFL</td>
<td align="left"><italic>Riccia fluitans</italic>
</td>
</tr>
<tr><td align="left">SOLCO</td>
<td align="left"><italic>Solanum commersonii</italic>
</td>
</tr>
<tr><td align="left">SOLTU</td>
<td align="left"><italic>Solanum tuberosum</italic>
</td>
</tr>
<tr><td align="left">SOYBN</td>
<td align="left"><italic>Glycine max</italic>
</td>
</tr>
<tr><td align="left">SPIOL</td>
<td align="left"><italic>Spinacia oleracea</italic>
</td>
</tr>
<tr><td align="left">STELP</td>
<td align="left"><italic>Stellaria longipes</italic>
</td>
</tr>
<tr><td align="left">TRITU</td>
<td align="left"><italic>Triticum turgidum subsp. durum</italic>
</td>
</tr>
<tr><td align="left">VACCO</td>
<td align="left"><italic>Vaccinium corymbosum</italic>
</td>
</tr>
<tr><td align="left">VIGUN</td>
<td align="left"><italic>Vigna unguiculata</italic>
</td>
</tr>
<tr><td align="left">WHEAT</td>
<td align="left"><italic>Triticum aestivum</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>The codes are those used in forming SwissProt protein identifiers. With a small number of exceptions for the most common species such as PEA, WHEAT and MAIZE, identifiers are generally made up of the first three letters of the genus name followed by the first two letters of the species name.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T11"><label>Table 11</label>
<caption><p>LEA classification rule set induced by supervised learning</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">Group</td>
<td align="left">Rule</td>
</tr>
</thead>
<tbody><tr><td align="center">2a</td>
<td align="left">H < = 0.15 and aromatic > = 0.077 and min_hyph < = -1.97 and charged < = 0.42</td>
</tr>
<tr><td align="center">2b</td>
<td align="left">L > = 0.23 and H < = 0.3 and ave_hyph > = -1.233 and ave_hyph < = -0.978</td>
</tr>
<tr><td align="center">2c</td>
<td align="left">aromatic > = 0.077 and min_hyph < = -2.743 and charged > = 0.4</td>
</tr>
<tr><td align="center">3</td>
<td align="left">H > = 0.34</td>
</tr>
<tr><td align="center">1</td>
<td align="left">E > = 0.02 and ave_hyph < = -1.241</td>
</tr>
<tr><td align="center">LE5</td>
<td align="left">max_hyph > = 1.0 and ave_hyph < = -0.3</td>
</tr>
<tr><td align="center">LE14</td>
<td align="left">aliphatic > = 0.25</td>
</tr>
<tr><td align="center">6</td>
<td align="left">H > = 0.25 and max_hyph > = 0.5</td>
</tr>
<tr><td align="center">4</td>
<td align="left">Otherwise</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>The rules are to be applied in a top-down, if .. else if, manner, so, for example, if the percentage of predicted helical conformation (expressed as a number in the range 0 .. 1.0) is greater than or equal to 0.34 then the protein is classified as a Group 3 LEA protein, but only if each of the rules above has failed, e.g. because the percentage of aromatic residues is less than 0.077 in the Group 2 rules. min_hyph and max_hyph are, respectively, the values of minimum and maximum hydrophobicity windows, while ave_hyph is the average across all the hydrophobicity windows. H, E and L refer, respectively, to the percentage composition of amino acids that are found by ProteinPredict (in four-state mode) to be alpha-helical, beta-sheet or loop.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T12"><label>Table 12</label>
<caption><p>Highly significantly over- and under-represented peptides across LEA Protein Groups</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">Grp</td>
<td align="center">Threshold Pr</td>
<td align="center">Representation</td>
<td align="center">Sample of Significant Peptides (negative p-values indicate under-representation)</td>
</tr>
</thead>
<tbody><tr><td align="center">1</td>
<td align="center">3.9e-07</td>
<td align="center">over</td>
<td align="center">G (2.6e-49), E (7.9e-36), Q (2.6e-10), R (4.2e-08), GG (5.8e-47), KGG (1.6e-41), EMG (9.5e-33), QMG (2.1e-19)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">I (-3.7e-16), V (-1.1e-14), P (-1.7e-14), F (-7.4e-14), N (-1.3e-12), L (-4.2e-11), C (-6.6e-11), W (-1.7e-09)</td>
</tr>
<tr><td align="center">2</td>
<td align="center">1.1e-10</td>
<td align="center">over</td>
<td align="center">G (0), TG (0), H (7.4e-291), GG (6.4e-178), T (6.8e-122), K (1.4e-59), Q (8.7e-28), HG (5.6e-187), KLP (2.8e-170), EK (1.1e-120), YG (4.7e-101), SSS (2.0e-43)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">L (-5.6e-123), F (-2.2e-81), I (-1.1e-52), V (-9.8e-47), N (-3.7e-43), R (-3.8e-39), C (-1.0e-36), W (-1.7e-35), S (-5.2e-25),</td>
</tr>
<tr><td align="center">3</td>
<td align="center">4.3e-08</td>
<td align="center">over</td>
<td align="center">A (3.6e-246), K (7.3e-140), T (3.2e-48), E (2.9e-37), Q (7.8e-32), KD (1.6e-93), AKD (1.6e-83), AKE (2.4e-45), KDY (2.1e-46), EK (1.8e-45)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">L (-1.2e-89), I (-3.9e-61), P (-2.8e-51), F (-6.1e-35), W (-8.3e-27), C (-4.2e-19), N (-1.5e-14), R (-2.3e-12)</td>
</tr>
<tr><td align="center">4</td>
<td align="center">4.1e-04</td>
<td align="center">over</td>
<td align="center">TG (1.9e-17), G (4.7e-14), T (9.5e-12), GH (8.7e-09), A (7.8e-08), AKA (1.9e-07), EK (8.6e-07), AA (3.8e-05)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">L (-5.1e-19), I (-4.8e-09), F (-1.4e-08), V (-5.6e-06), C (-4.6e-05), W (-1.8e-04), S (-8.5e-04)</td>
</tr>
<tr><td align="center">5</td>
<td align="center">6.2e-04</td>
<td align="center">over</td>
<td align="center">AKE (1.2e-17), K (9.2e-13), A (4.1e-10), E (1.0e-08), EK (2.7e-05)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">L (-5.2e-11), P (-4.3e-08), I (-1.5e-06)</td>
</tr>
<tr><td align="center">6</td>
<td align="center">8.4e-04</td>
<td align="center">over</td>
<td align="center">A (6.5e-30), AA (3.1e-18), AT (1.7e-08), AE (1.8e-07), QS (2.2e-06), GV (3.7e-06), GG (8.6e-06), Q (4.1e-05), V (1.6e-04), QSA (2e-13)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">L (-2.2e-10), F (-8.9e-09), C (-7.9e-07), Y (-6.0e-06), I (-3.5e-05), K (-3.3e-04), W (-3.0e-04)</td>
</tr>
<tr><td align="center">Lea5</td>
<td align="center">1.2e-03</td>
<td align="center">over</td>
<td align="center">A (4.4e-05), GA (4.1e-05), GY (8.8e-05), SS (1.3e-04), R (2.6e-04), S (6.9e-04)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">Q (-4.9e-04)</td>
</tr>
<tr><td align="center">Lea14</td>
<td align="center">1.3e-03</td>
<td align="center">over</td>
<td align="center">IP (1.1e-07), D (7.7e-05), K (3.3e-04), I (1.2e-03)</td>
</tr>
<tr><td></td>
<td></td>
<td align="center">under</td>
<td align="center">R (-4.1e-06), Q (-8.0e-06), F (-3.3e-03)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Applying popp_create.py to each group of LEA protein sequence taken as a whole, the table lists a sample of the peptides that are highly over-represented or highly under-represented, i.e. their probabilities are more stringent than the thresholds listed in the second column. The different thresholds arise due to differences in the numbers of sequences, hence differing amino acid counts, corresponding to each Group.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T13"><label>Table 13</label>
<caption><p>Consensus POPPs for the anchor families of each superfamily</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">Group</td>
<td align="center">SF</td>
<td align="center">Anchor Family Consensus POPP</td>
</tr>
</thead>
<tbody><tr><td align="center">1</td>
<td align="center">4</td>
<td align="center">+E, +G, +EG, +GE, +GG, +KG, +QE, +RK, +GGE, +KGG</td>
</tr>
<tr><td align="center">1</td>
<td align="center">6</td>
<td align="center">+E, +G, +DE, +EG, +ES, +GG, +GQ, +RE, +RK, +ARE, +DES, +REG</td>
</tr>
<tr><td align="center">2</td>
<td align="center">1</td>
<td align="center">+G, -L, +EK, +GG, +GT, +EKL, +IKE, +KEK, +KIK, +KKG, +KLP, +LPG</td>
</tr>
<tr><td align="center">2</td>
<td align="center">3</td>
<td align="center">-F, -I, -L, -R, -W, +DK, +EK, +KK, +KL, +LP, +TH, +EKK, +KEK, +KLP, +LPG</td>
</tr>
<tr><td align="center">2</td>
<td align="center">8</td>
<td align="center">+EK, +SS, +EKI, +KEK, +KIK, +SSS</td>
</tr>
<tr><td align="center">2</td>
<td align="center">9</td>
<td align="center">-F, +G, -I, -L, -V, +AG, +EK, +GG, +GH, +GT, +TA, +TG, +GGT, +GTG, +TAG, +TGG</td>
</tr>
<tr><td align="center">2</td>
<td align="center">10</td>
<td align="center">-F, +G, -I, +AG, +EK, +GG, +GQ, +KE, +SS, +EKL, +GAG, +IKE, +KEK, +KLP, +LPG, +SSS</td>
</tr>
<tr><td align="center">3</td>
<td align="center">2</td>
<td align="center">+A, -C, +E, -F, -I, +K, -L, -P, +AE, +AK, +EK, +ET, +GE, +GK, +KE, +AAE, +AKD, +EKA</td>
</tr>
<tr><td align="center">3</td>
<td align="center">5</td>
<td align="center">+A, -I, +K, -L, -P, +Q, +T, -V, +AA, +AQ, +EK, +KE, +KT, +QA, +QQ, +QS, +QT, +TQ, +AAK, +AQA, +EKT, +QAA, +TQQ</td>
</tr>
<tr><td align="center">6</td>
<td align="center">7</td>
<td align="center">+A, -F, -L, +AA, +AE, +MQ, +QS, +VA, +AAA, +GVA, +QSA, +SAA</td>
</tr>
<tr><td align="center">Lea5</td>
<td align="center">299</td>
<td align="center">+A, +R, +S, +AM, +GA, +GY, +RP, +SF, +SS, +YS</td>
</tr>
<tr><td align="center">Lea14</td>
<td align="center">297</td>
<td align="center">+D, -R, +AS, +IP, +KV, +VS, +TIP</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Clusters, families and superfamilies closely mirror the structure of the LEA Groups, with the exception of Group 4 and Group 5. Against each LEA Group are listed the superfamilies that contain proteins from that Group (column 2) and the peptides forming the consensus POPP of the anchor (i.e. most typical) family in the superfamily. '+' before a peptide indicates significant over-representation; '-' indicates significant under-representation.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T14"><label>Table 14</label>
<caption><p>Keywords/Phrases for each Group and Superfamily</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">Group</td>
<td align="center">SF</td>
<td align="left">Principal Keywords/Phrases</td>
</tr>
</thead>
<tbody><tr><td align="center">1</td>
<td align="center">4</td>
<td align="left">histone H4, chromosomal protein, nuclear protein, DNA binding</td>
</tr>
<tr><td align="center">1</td>
<td align="center">6</td>
<td align="left">dsRNA binding, DNA gyrase, breakage, CLP, ATP binding</td>
</tr>
<tr><td align="center">2</td>
<td align="center">1</td>
<td align="left">break, ATP binding, DNA topoisomerase, protein biosynthesis, topoisomerase, repair</td>
</tr>
<tr><td align="center">2</td>
<td align="center">3</td>
<td align="left">coiled, coil, nuclear protein, caldesmon, histone H1, chaperone, tropomyosin filament, break, DNA topoisomerase</td>
</tr>
<tr><td align="center">2</td>
<td align="center">8</td>
<td align="left">DNA topoisomerase, nuclear protein, HMG box, coiled coil</td>
</tr>
<tr><td align="center">2</td>
<td align="center">9</td>
<td align="left">transcriptional inhibition, glycosyl hydrolase, nuclear protein,</td>
</tr>
<tr><td align="center">2</td>
<td align="center">10</td>
<td align="left">nuclear protein, DNA binding, transcription regulation, intermediate filament, keratin, chaperone, homeobox, coiled coil, HMG box domain, cytoskeletal</td>
</tr>
<tr><td align="center">3</td>
<td align="center">2</td>
<td align="left">chaperone, coiled coil, tropomyosin, stress, filament, phosphorylation, caldesmon elongation factor, neurofilament, actin binding, cytoskeleton, rotamase</td>
</tr>
<tr><td align="center">3</td>
<td align="center">5</td>
<td align="left">coiled coil, histone H1, filament, nuclear protein, neurofilament, flagella, HAMP domain, synuclein, DNA binding, hsp70</td>
</tr>
<tr><td align="center">6</td>
<td align="center">7</td>
<td align="left">groel protein, nuclear protein, histone H1, chaperonin, DNA binding, HAMP domain, synuclein, transcription regulation</td>
</tr>
<tr><td align="center">Group</td>
<td align="center">Cluster</td>
<td align="left">Principal Keywords/Phrases</td>
</tr>
<tr><td align="center">Lea5</td>
<td align="center">299</td>
<td align="left">DNA binding, transcription regulation, nuclear protein, gata, zinc finger, homeobox</td>
</tr>
<tr><td align="center">Lea14</td>
<td align="center">297</td>
<td align="left">esterase, gapdh, chaperone protein DNA, glycoprotein</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>For each superfamily, the consensus POPP (Table <xref ref-type="table" rid="T13">13</xref>
) has been set as a query against a database of POPP vectors representing SwissProt. The protein hits, excluding LEA proteins for each query were submitted to the Protein Annotators' Assistant, which returns a list of keywords and phrases shared by sets of the submitted proteins. A sample of the most prominent are listed against the Group/Class and the corresponding superfamilies/clusters. Rather than being understood as the actual functions which the search hits share with the LEA proteins, matches based on shared biases in peptide composition can indicate shared mechanisms or structural elements.</p>
</table-wrap-foot>
</table-wrap>
<p>When scanning Table <xref ref-type="table" rid="T14">14</xref>
 it is worth bearing in mind that rather than being understood as the actual functions which the search hits share with the LEA proteins, matches based on shared biases in peptide composition can indicate shared mechanisms or structural elements. In this, POPP searching is similar in spirit to testing a sequence against the motifs in the PROSITE database [<xref ref-type="bibr" rid="B23">23</xref>
] or against the fingerprints in the PRINTS database [<xref ref-type="bibr" rid="B24">24</xref>
]. The difference, in principle, is that motifs and fingerprints can be seen as a conjunction of gapped or ungapped patterns and are relatively long, while POPPs are a disjunction of short patterns which are are distinguished by being significantly over- or under-represented.</p>
</sec>
</sec>
<sec><title>Discussion</title>
<p>As mentioned in the Introduction, one source of confusion in the coverage to date of LEA proteins has been the overlapping and sometimes contradictory assignments to Groups. For example, if [<xref ref-type="bibr" rid="B12">12</xref>
] is taken as a starting point, [<xref ref-type="bibr" rid="B3">3</xref>
] differs from the former by coalescing the proteins corresponding to LEA protein Group 6 and Lea14 into a single Group (which in that paper is called Group 5); Lea5 is not found in any Group in this scheme. On the other hand, in [<xref ref-type="bibr" rid="B4">4</xref>
], the Group 4 of [<xref ref-type="bibr" rid="B12">12</xref>
] has been renamed Group 5, while the Group labelled Lea14 in this study is called Group 4. There is agreement, however, on the first three Groups. Given the new findings on this sizable sample taken from the spectrum of LEA proteins, it is now possible to revisit the different LEA Groups.</p>
<p>Group 1 LEA proteins are strongly hydrophilic and each cluster has the peptides E and RK over-represented (not found in any other Group). The phrase DNA binding appears in various guises connected with this group (Table <xref ref-type="table" rid="T14">14</xref>
). As can be seen from the respective entries in Table <xref ref-type="table" rid="T13">13</xref>
, consensus POPPs for the two superfamilies representing Group 1 LEA proteins are in fact very similar. In addition, from the input data used for the supervised machine learning experiments (not shown) it is noted that the members of Superfamily 4 generally have a higher percentage of charged amino acids than Superfamily 6 (and some of the highest percentages overall). The LEA proteins covered by Superfamily 4 also include those with repeats of the Group 1 motif.</p>
<p>Analysing the Group 2 LEA proteins exposes a difficulty with the methodology of retrospective reanalysis; the data that would be required to settle questions of group membership are often not available from the original publications. However, Group 2 appears to split into three subgroups, labelled 2a, 2b and 2c, with the line of demarcation being between those Group 2 LEA proteins which are cold-tolerant versus those which are sensitive to cold stress. The split is evident in the three rules proposed by the classification engine Ripper. Subgroup 2a has low predicted helix content and medium to high percentage of aromatic residues while Subgroup 2b has high predicted loop content. All three subgroups are hydrophilic, but the third and smallest subgroup, Subgroup 2c, is very hydrophilic. The eight proteins which were found not to be up-regulated by cold stress are in 2a, while all members of 2c are up-regulated by cold stress. The proteins in Subgroup 2a, and in particular the proteins not up-regulated by cold stress, are covered by the Superfamily 1 and Superfamily 9. All the members of Subgroup 2c have poly-lysine stutters (versus 5 in the muchlarger Subgroup 2b and 4 in Subgroup 2a), and most of those with those with poly-lysine stutters are found to be cold tolerant; for the remainder, data on cold tolerance has not been presented. In general, tolerance of cold is found associated with Superfamily 3.</p>
<p>The entire Group is characterised by an over-representation of either H or SSS (often both); O24442 and PM1_SOYBN (from Group 4, though arguably Group 2) and EM1_ARATH, also have an over-representation of H, while Q06540 and Q9SDV6 from Group 3 and LE5A_GOSHI and LE5D_GOSHI from the Lea5 have poly-serine stutters of at least 3aa. The poly-serine stutters are all the more remarkable when one notes that serine by itself is highly under-represented. PM1_SOYBN also matches the 2Y motif corresponding to the Close (1997) Y-segment, which accounts in part for its presence in Superfamily 3. Fourteen of the 22 Subgroup 2a proteins have an over-representation of GNP or YGN, corresponding to the Y-segment of [<xref ref-type="bibr" rid="B13">13</xref>
], which suggests that Subgroup 2a is distinct from the other Subgroups. On the other hand, K is over-represented in all six the Subgroup2c LEA proteins and 9 out of 16 Subgroup 2b LEA proteins. (It is also over-expressed in 20 of 23 Group 3 LEA proteins and the subset of Group 1 LEA proteins discussed above.) The suggestion, therefore, is that while most Subgroup 2a LEA proteins have a Close (1997) K-segment, it is is less significant than those of Subgroups 2b and 2c LEA proteins (cf. DH11_GOSHI and DH14_LYCES versus DH47_ARATH, which suggests a role in cold stress resistance. It is therefore likely that many of the Subgroup 2b proteins which are found in Superfamily 1 but which are not specifically cold induced, such as DH1_HORVU and DH21_HORVU, might in fact also be induced by cold stress. The association of the K-segment with cold tolerance has been noted by other researchers [<xref ref-type="bibr" rid="B13">13</xref>
]. Finally, as mentioned above the non-ABA dependent protein Q40159, characterised by sequence similarity and the classification rules as Lea14, ends up clustered with Group 2 LEA proteins associated with resistance to cold stress, in particular DH14_ARATH, but also DH10_ARATH, DH47_ARATH and O04232, so the role of this protein, which is neither induced by ABA nor desiccation stress, might in fact be related to cold-stress resistance.</p>
<p>The picture with the Group 3 LEA proteins is rather more straight forward, with a crisp rule encompassing all members of this group, namely that they have high helix content. The similarity of members of this Group is also borne out by the fact that they are all clustered in a single superfamily. It should be noted, however, that Group 4 LEA proteins LE11_HELAN and LE25_LYCES are clustered in families within Superfamily 3, which mirrors what was observed with rules induced by Ripper.</p>
<p>Looking at the major LEA Groups it is interesting to note that when the threshold scores for adding to a cluster and for merging two clusters are reduced (see [<xref ref-type="bibr" rid="B25">25</xref>
] for more details), the Group 1 and Group 6 LEA proteins remain distinct with unique superfamilies being created for each, but clusters representing Group 2 and Group 3 LEA proteins merge into a single superfamily. This is a little less surprising when one notes the number of Group 3 proteins that are also up-regulated by cold stress. In addition, if the number of mismatches allowed for the Group 3 motif TAQAAKEKAXE is increased by 1 to 5, the set of matching LEA proteins includes the Group 2 LEA proteins DH1_HORVU, DH1D_ORYSA and DH2_HORVU, Group 4 LEA proteins LE13_GOSHI and LE25_LYCES and Group 5 LEA protein LE29_GOSHI. In addition, the Group 3 LEA protein DRPF_CRAPL has a poly-lysine stutter, while the Group 3 LEA proteins Q06540 and Q9SDV6 have poly-serine stutters (i.e. the 2S motif). Taken together, it would appear that Group 2 and Group 3 LEA proteins might be related. K is over-represented across both Groups, while L and, generally I, are both under-represented, suggesting a connection with charged amino acids. Theconnection with Group 2 LEA proteins is, perhaps, less surprising if one considers the association of the (Group 2) K-segment with cold tolerance (noted above), the fact that many Group 3 LEA proteins are associated with cold tolerance (see Table <xref ref-type="table" rid="T3">3</xref>
), and that the K-segment consensus for gymnosperms differs in up to six places from the canonical K-segment (noted in [<xref ref-type="bibr" rid="B13">13</xref>
]), versus the five that were allowed in the motif-search described above.</p>
<p>Returning to Table <xref ref-type="table" rid="T14">14</xref>
, there is considerable overlap across the sets of keywords, particularly across Group 2 and Group 3 LEA proteins. A remarkable, and seemingly paradoxical, recent result has been the demonstration that a nematode Group 3 LEA protein, AavLEA1 (Q95V77), is unstructured in the native state, but then becomes structured on desiccation, showing significant alpha-helical content and possible coiled-coil structures [<xref ref-type="bibr" rid="B26">26</xref>
]. In other words, the consistent prediction of high alpha-helical content for Group 3 LEA proteins appears to be borne out, but only in response to desiccation stress. Coiled coil is one of the phrases evident from the keyword analysis of Group 3 LEA proteins; it is also characteristic of Group 2. The keyword filament and related keywords such as keratin and neurofilament are also prominent in the list, mirroring a suggestion in [<xref ref-type="bibr" rid="B26">26</xref>
] that the coiled coils might form larger structures related to intermediate filaments, which would provide mechanical support to plant cells undergoing desiccation stress. The conundrum of some keywords being associated with the cytoskeleton while others are nuclear has already been noted via localisation experiments reported in [<xref ref-type="bibr" rid="B13">13</xref>
], at least for the Group 2 LEA proteins. Table <xref ref-type="table" rid="T14">14</xref>
 would suggest that the observation is generally true. A number of other themes are also apparent in the list of keywords and phrases: DNA binding, stress and chaperone activity. While dealing with stress, particularly cold stress, has long been associated with LEA proteins, mechanisms suggested by the keywords "DNA binding" and "chaperone" require experimental verification.</p>
<p>Turning to the Group 4 LEA proteins, as noted in discussion of the supervised classification experiments, two of the five Group 4 LEA proteins, LE11_HELAN and LE25_LYCES, are subsumed into Group 3. In the unsupervised clustering, those same proteins are also subsumed in Group 3, while the remainder, PM1_SOYBN, LE13_GOSHI and O24442, appear in Group 2. Even when the probability threshold is made more stringent – 0.005 – the five putative Group 4 LEA proteins do not cluster separately. In addition, as was noted above, PM1_SOYBN has a hit against the 2Y motif, while LE11_HELAN and LE25_LYCES each have hits against the Group 3 motif once the number of allowed mismatches is increased by 1 (a level which still leaves out some acknowledged Group 3 LEA proteins). In other words, there is mounting evidence that Group 4 should not be considered as a separate Group, but that its members be absorbed into Group 2 and Group 3. This stands in apparent contrast to the evidence from sequence alignments which suggests that the five members of this group should remain together. However, the weight to be given to this evidence must be tempered by the knowledge that each of these is a low complexity protein and numbers of the amino acids will need to be masked: PM1_SOYBN (15.6% masked), LE25_LYCES (18.2%), O24442 (28.6%), LE13_GOSHI (47.3%) and LE11_HELAN (53.8%). The effect of this is that when LE13_GOSHI is run as a BLAST query with SEG masking in place, the only hits returned (at p-value of 0.79) are Group 2 LEA sequences, Q39876 and Q39805. While on balance the Group 4 proteins are best reassigned to Group 2 and Group 3 it is also arguable on the basis of motif hits and the weak alignment evidence that the Group 4 LEA proteins form a link between the Group 2 and Group 3 LEA proteins, particularly PM1_SOYBN, which matches the Group 3 motif twice at the N terminal and the 2Y motif from Group 2 at the C terminal; LE13_GOSHI and LE25_LYCES have their Group 3 motif matches also at the N terminal.</p>
<p>A similar line of reasoning – in this case supported by other investigators authors [<xref ref-type="bibr" rid="B3">3</xref>
,<xref ref-type="bibr" rid="B4">4</xref>
] applies tothe former Group 5 (D29) LEA proteins, which were folded into the Group 3 LEA proteins by both the supervised and unsupervised algorithms.</p>
<p>By contrast, it is proposed in [<xref ref-type="bibr" rid="B3">3</xref>
] that proteins corresponding to LEA protein Group 6 and Lea14 form a single Group (which in that paper is called Group 5), while Lea5 LEA proteins are not mentioned. In this study, all three groups appear at the top of the list of average hydrophobicity scores (either just over 0 or just below it, with Lea14 > Group 6 > Lea5). They also gather at the bottom of the list for percentage polar residues. On the other hand, Group 6 proteins are just behind Group 3 in predicted helix content, with Lea5 and Lea14 some way below, while in the Lea5 Group, long loop segments are evident. Group 6 have an over-representation of both MQ and AAA, while the three Lea5 LEA proteins have an over-representation of A and R. By contrast, the Lea14 LEA proteins have an over-representation of IP and an under-representation of R. The three groups are sufficiently different for crisp classification rules to have been created, although the rules must be treated with caution due to the small numbers of examples on which they are based. In addition, the clusters involving Group 6 LEA proteins persist even when cluster-merging thresholds are lowered or significance thresholds made less stringent. At the same time the Lea5 and Lea14 proteins form independent clusters neither of which merge with Group 6.</p>
</sec>
<sec><title>Conclusions</title>
<p>The study of a carefully selected set of 112 LEA protein sequences has revealed a number of aspects of these proteins, which can be summarised in the following conclusions:</p>
<p>• There is a high level of agreement between the different machine learning methods on the one hand, and the previous assignments on the other. However, given the previous contradictory revisions and current findings a new scheme for naming groups of LEA proteins is proposed, based on Classes. In particular, while it is generally accepted that the former LEA Group 5 is not distinct from Class III, the balance of evidence is that the members of former Group 4 are more appropriately housed in Class II and Class III.</p>
<p>• There is evidence from overlapping motifs, overlapping POPP clusters, from the split of former LEA Group 4 and from similarities in the modes of induction related to cold stress that Class II and Class III LEA proteins, though distinct might be related, perhaps through the LEA Class II K-segment motif, which mirrors the Class III motif. The major difference between Class II and Class III is that the former contains different combinations of three motifs/domains, while Class III has often multiple instances of the one motif/domain.</p>
<p>• In the same way that not all sequence alignment hits are necessarily relevant, it is possible that not all the keywords will turn out to be relevant. However, there is confirmation in the keywords concerning subcellular localisation which sees LEA proteins being associated with the cytoskeleton, the cytoplasm and with the nucleus (though these are unlikely to apply to the same protein). However, each possibility has been noted for dehydrins [<xref ref-type="bibr" rid="B13">13</xref>
].</p>
<p>• Keywords related to chaperones and to DNA-binding are also present, suggesting a role similar to the DNA-binding cold-shock proteins found in bacteria, but also in eukaryotes, e.g. DBPA_HUMAN (P16989). DBPA_HUMAN is found both in the nucleus and in cytosol. However, such suggestions await experimental verification.</p>
<p>• Keywords emphasising alpha-helical structure (coiled coil) and, at a larger scale, filaments also support the recent finding that Class III LEA proteins show high alpha helical content, and possibly coiled-coil structures, except that this occurs under conditions of desiccation stress; the protein has no defined structure in its native state [<xref ref-type="bibr" rid="B26">26</xref>
]. High alpha helical content is also consistent with the over-representation of alanine, particularly in Class III and Class IV (former Group 6) LEA proteins.</p>
<p>• Apart from the near total lack of cysteine and tryptophan, the study has found that isoleucine, leucine and phenylalanine are highly under-represented across the four major Classes, while glutamine is highly over-represented. Glutamate and lysine are highly over-represented in two of the first three LEA Classes, and moderately in the third, so the description of these as hydrophilins [<xref ref-type="bibr" rid="B16">16</xref>
] is borne out.</p>
<p>• Glycine is highly over-represented in Class I and overwhelmingly so in Class II, but only in line with chance in Class III, which is consistent with the first two Classes having the highest predicted loop content, particularly Class II LEA proteins. The high proportion of predicted loop content is supported by the observation that at least one dehydrin has no defined structure in its nature state [<xref ref-type="bibr" rid="B27">27</xref>
]. However, as with the Class III LEA proteins, Class II LEA proteins acquire alpha-helical content under stress conditions, e.g. application of sodium dodecyl sulfate (SDS) [<xref ref-type="bibr" rid="B28">28</xref>
].</p>
<p>In general, non-globular and, particularly, low-complexity proteins such as the LEA proteins pose special challenges in determining their functions and modes of action. Therefore, rather than relying solely on evidence from sequence alignments, a combination of data sources can be used, particularly software tools less affected by such unusual proteins. Further work involves expanding the analysis to examine the large number of putative LEA proteins found in genomic sequences, particularly from non-plant species.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<sec><title>Defining a LEA Protein for this Study</title>
<p>There are two parts to a working definition of what constitutes a LEA protein. The first is that a LEA protein is a plant protein which has no – or at most limited – expression in the stages up to and including maturation of the ovule, and sharply rising expression post-abscission, peaking at desiccation, with expression disappearing at germination [<xref ref-type="bibr" rid="B7">7</xref>
]. In other words, LEA proteins are characterised in the first instance by raised levels of expression in mature seeds, with expression disappearing at germination. However, proteins homologous to LEA proteins have also been found in other plant tissues, so although they are not involved in embryogenesis, let alone late in embryogenesis, they too are now considered to be LEA proteins. The latter set are characterised by sharply raised expression due to desiccation, raised salinity, cold or induction by abscisic acid (ABA), followed by a sharp decline in expression once the stress condition has been removed [<xref ref-type="bibr" rid="B29">29</xref>
]. As a result, where the distinction is useful, the former set of LEA proteins will be termed "canonical LEA" proteins in this study.</p>
<p>Unfortunately, sharply raised expression under the conditions such as desiccation or cold stress is not sufficient to unambiguously characterise a protein as an LEA protein because plants use a number of metabolic pathways to respond to such abiotic stresses and there are a number of other protein families which are induced under similar conditions. For example, the <italic>Arabidopsis thaliana </italic>
gene <italic>RD22 </italic>
(RD22_ARATH) is expressed in the early and middle stages of seed maturation, but is also induced by desiccation, salinity or application of ABA [<xref ref-type="bibr" rid="B30">30</xref>
]. Similarly, the gene <italic>PCC13-62 </italic>
(DRPE_CRAPL) is up-regulated in the leaves of the resurrection plant, <italic>Craterostigma plantagineum</italic>
, by desiccation or the application of ABA [<xref ref-type="bibr" rid="B31">31</xref>
]. Neither of these have any sequence similarity to LEA proteins.</p>
<p>On the other hand, sequence similarity to canonical LEA proteins, by itself, is also not sufficient to accurately classify all putative non-canonical LEA proteins because there are several proteins with significant similarity to canonical LEA proteins which are not expressed under conditions typical of LEA proteins. Examples are: Q06431 (BP8 protein) – which is among the "seed" proteins underpinning Pfam family PF02987 – and Q43430 (Dehydrin cognate), which is found among the proteins recovered by the Hidden Markov Model for Pfam family PF00257. In the case of Q39846 (labelled as: LEA Protein) there is some evidence of similarity to Group 3 LEA proteins via BLAST hits to Q41060 and EDC8_DAUCA, but the level and timing of expression is such that [<xref ref-type="bibr" rid="B32">32</xref>
] concludes: "Since the GmPM4 proteins do not appear to fulfil the biochemical properties of LEA proteins, their messages are not very abundant in mature seeds and will not express in water-stressed seedlings, we suggest that the physiological roles of GmPM4 protein might differ from those of the LEA proteins, i.e. desiccation protection." (pg 489). However, the most striking case of this problem is the putative LEA protein DHX1_ARATH, which has been classified as a D11 (Group 2) LEA protein in the Dure survey [<xref ref-type="bibr" rid="B11">11</xref>
], but which is only expressed constitutively, i.e. not as a stress response nor late in embryogenesis. In a related manner, the protein O48672, superficially a Group 2 LEA protein, is largely constitutively expressed although there is some increased expression due to cold stress.</p>
<p>The problem of interpreting purely sequence-based data becomes more acute for the putative LEA proteins found in non-plant species, e.g. the LEA Group 3 motif found on avian developmental gene <italic>px19 </italic>
[<xref ref-type="bibr" rid="B33">33</xref>
]. As a second example, while no claim is made that gene <italic>gvpQ </italic>
of <italic>Bacillus megaterium </italic>
is a LEA protein – it is thought to be a negative regulator of gas vesicle synthesis – the corresponding sequence, O68678, is annotated as a Group 3 LEA protein by Pfam and is one of the sequences used in the multiple alignment that defines the Pfam family PF02987. In other words, significant sequence similarity to known (and in particular canonical) LEA proteins might indicate homology, but once the functions have diverged doubts can arise – proteins with different functions, arising perhaps due to paralogy, face different conservation pressures. Automated classification studies, also known as machine learning, require a strict notion of which objects are members of the categories under study (the "universe of discourse") and which are not. Therefore, a conservative strategy in building a database of sequences for categorisation experiments is to only accept proteins that have related functions or, as a surrogate, related mRNA expression patterns when the functions are not known. In the latter case there is the assumption that proteins which have expression patterns unrelated to LEA proteins will turn out to have different functions.</p>
<p>In summary, to ensure that only true members of the set of LEA proteins are used in this study, a LEA protein is either a canonical LEA protein or one whose expression is sharply up-regulated by desiccation, salinity, cold or exogenous ABA and which has sequence similarity to canonical LEA proteins.</p>
</sec>
<sec><title>Obtaining the Sequences</title>
<p>The sequences were drawn in the first instance from the SwissProt and SpTrEMBL databases (containing between them around 700,000 proteins) using the SRS sequence retrieval system [<xref ref-type="bibr" rid="B34">34</xref>
]. Because different authors have, over time, used different words to describe LEA proteins a number of keywords were used to extract the sequences from the databases, including: "LEA", "small hydrophilic plant seed", "late embryogenesis abundant", "dehydrin" and "seed maturation". A second source of LEA protein sequences were those revealed by BLAST similarity searches using other LEA proteins as search queries. However, irrespective of the path by which a putative sequence was uncovered, as discussed above there also needed to be evidence of expression of the protein under conditions associated with LEA proteins, as revealed in the cited literature. In other words, the literature corresponding to the sequence had to be examined for evidence, typically via Northern blots, of expression patterns conforming to the definition outlined above; in order to have confidence in the provenance of the hits, putative LEA proteins unsupported by expression evidence were passed over.</p>
</sec>
<sec><title>Assignment to Historical Groups</title>
<p>The LEA proteins were initially assigned to a Group based on a number of criteria. The first is an assessment by the authors and/or inclusion in the 1993 survey by Dure. A second is whether the protein is covered by one of the Pfam families listed above. Finally, BLAST was used to determine if there are any close hits against one or other canonical LEA protein or, in default, to known members of a Group. (Given the problems outlined earlier, low complexity sequence masking was not used for this.)</p>
<p>Members of each LEA protein Group are listed in Tables numbered from 1 to 9. The first two columns in the tables are the protein's SwissProt/SpTrEMBL identifier and the species from which the protein was taken, represented by a SwissProt species code. (A mapping from the SwissProt codes to the species names can be found in Table <xref ref-type="table" rid="T10">10</xref>
.) This is followed by the tissues used for the expression evaluation and a list of the conditions that give rise (or fail to give rise) to the expression of the gene. The possible conditions are: ABA (application of abscisic acid to aerial parts of the plants), Cold, Desc (desiccation) and Salt. As mentioned above, the descriptor canonical is used to indicate that high levels of the mRNA are to be found in dry seeds, i.e. the protein is literally late embryogenesis (abbreviated Canon). The appearance of 'not' before any of these descriptors indicates that expression has been tested for this condition and no significant expression was seen. For example, notDesc indicates that there was no significant increase in the expression of the corresponding gene under conditions of desiccation stress.</p>
<p>For LEA protein Groups 1, 2 and 3, consensus sequence motifs have been reported [<xref ref-type="bibr" rid="B3">3</xref>
]: GGQTRREQLGEEGYSQMGRK (Group 1), DEYGNP and EKKGIMDKIKEKLPG (Group 2, patterns 2Y and 2K, in the nomenclature of [<xref ref-type="bibr" rid="B13">13</xref>
]) and TAQAAKEKAXE (Group 3). Being consensus sequences, matching against any particular protein sequence implies accepting a certain number of insertions, deletions or substitutions. Using an implementation of the string searching application, Agrep [<xref ref-type="bibr" rid="B35">35</xref>
], each consensus peptide was tested against the LEA protein sequences, allowing up to 5, 2, 4 and 4 mismatches, respectively, for the four consensus patterns. In addition, Group 2 LEA proteins generally have a poly-serine stutter. If a consensus peptide matches without exceeding the stated maximum number of amino acid mismatches, or a poly-serine stutter is found (which is labelled 2S after [<xref ref-type="bibr" rid="B13">13</xref>
]), it is noted in the fifth column, with the number of repetitions noted in brackets (or the length of the poly-serine stutter, which must be at least 4aa). While the 2S segment is highly characteristic of Group 2 LEA proteins (occurring in 36 of the 50 sequences in the set used in this study, versus an expected count of 1.98 sequences – corresponding to a probability of 1. 7 × 10<sup>-39</sup>
) it was noticed that poly-lysine stutters with a length of at least 3aa are also relatively common, although the stutters are generally not contiguous. The label k(N), with N in the range 3 to 11 is the sum of the lengths of the poly-lysine stutters, assuming a minimum of 3aa. Of the set of Group 2 LEA proteins, 16 have at poly-lysine stutters totalling at least 3aa (versus an expected count of 4.93, corresponding to a probability of 1. 5 × 10<sup>-5</sup>
). The application 0j.py [<xref ref-type="bibr" rid="B17">17</xref>
] was used to find the poly-serine and poly-lysine stutters. The lists of hits against the different sequence motifs is followed by a column labelled SF (short for SuperFamily). This will be discussed in the section below on automated clustering of the LEA proteins. The final column in the tables, labelled Evidence, lists evidence supporting the protein's inclusion in the particular Group, beyond the articles cited in the SwissProt record. If the protein is included in a Pfam family, the family's identifier is listed, followed by either '_ml' or '_hmm'. The suffix '_ml' is used to indicate that the protein has been included in the edited multiple-sequence (or "seed") alignment that forms the basis for the family. The proteins annotated with '_hmm' are those recovered by the hidden Markov model that has been trained from the multiple sequence alignment (called by Pfam the "full" family). This is somewhat weaker evidence than the curated multiple alignment. Finally, if a SwissProt or SpTrEMBL identifier is shown, it is followed by a p-value and represents the closest match found by BLAST (without masking) from among the canonical LEA proteins in that Group or, in default, to a protein that in turn matches a canonical LEA protein.</p>
<p>The tables of sequences by Group are:</p>
<p>1 LEA protein Group 1 (D19) Exemplar: LE19_GOSHI</p>
<p>2 LEA protein Group 2 (D11) Exemplar: DH11_GOSHI</p>
<p>The set of Group 2 LEA proteins is subdivided into three parts. The reasons for this are canvassed below.</p>
<p>3 LEA protein Group 3 (D7) Exemplars: LE7_GOSHI, LE76_BRANA</p>
<p>4 LEA protein Group 4 (D113) Exemplar: LE13_GOSHI</p>
<p>5 LEA protein Group 5 (D29) Exemplar: LE29_GOSHI</p>
<p>6 LEA protein Group 6 (D34) Exemplar: LE34_GOSHI</p>
<p>7 LEA protein Group Lea5 (D73) Exemplar: LE5A_GOSHI</p>
<p>8 LEA protein Group Lea14 (D95) Exemplar: LE14_GOSHI</p>
<p>9 Uncharacterised LEA proteins</p>
<p>Three proteins where uncovered which are canonical LEA proteins but for which little or no similarity exists with known LEA protein sequences. One of this group also has expression levels due to ABA or desiccation/cold stress which closely follow the patterns viewed as characteristic of LEA proteins.</p>
</sec>
<sec><title>Machine Learning Applied to the LEA Protein Sequence Sets</title>
<p>Machine learning software takes a set of descriptions of objects, in this case proteins, and brings related ones together to form groups. There are two basic sorts of machine-learning algorithms-supervised and unsupervised learning [<xref ref-type="bibr" rid="B36">36</xref>
,<xref ref-type="bibr" rid="B37">37</xref>
]. Both sorts have been employed in this study. Supervised algorithms are given values for an array of features, such as maximum hydrophobicity or percentage composition of aliphatic residues, and an output class, e.g. Group 1, Group 2, etc. Rules are then induced which categorise each of the input examples into one of the set of output classes. The aim of the rule induction process is to minimise miscategorisation. In unsupervised machine-learning, (also known as "classification" or "data mining"), similar objects are clustered based on a metric, e.g. sequence similarity score. The aim is to maximise scores between members of clusters, while minimising inter-cluster scores.</p>
</sec>
<sec><title>Supervised Machine Learning Applied to LEA proteins – Ripper</title>
<p>From the surveys listed above different protein properties have been used to characterise the various LEA protein Groups. The most commonly noted are hydrophilicity and predicted secondary structure. To these have now been added percentage composition by amino-acid class, i.e. acid, basic, aliphatic, etc. Scores summarising these attributes, calculated from the protein sequences, formed the input to the supervised learning application Ripper [<xref ref-type="bibr" rid="B38">38</xref>
].</p>
<sec><title>Hydrophilicity</title>
<p>The EMBOSS [<xref ref-type="bibr" rid="B39">39</xref>
] application Pepinfo was used to calculated hydrophobicity values based on the method of Kyte and Doolittle. A larger window, 21aa versus the default 9aa, was used at each amino acid in order to favour larger structures over smaller ones. That is, an average hydrophobicity value was calculated at each amino acid based on the hydrophobicity values of that amino acid, the previous 10 and the following 10. Three values were returned for each sequence: the minimum and maximum windows together with the average across all the windows. The ranges of these values were, respectively: -3.21 .. 0, -0.73 .. 2.25 and -1.70 .. 0.07; negative hydrophobicity values indicate hydrophilicity.</p>
</sec>
<sec><title>Predicted Secondary Structure Percentage Composition</title>
<p>No structures have been determined for any of the LEA proteins, so all analyses of structure for these proteins have been done on the basis of predictions based on the amino acid sequence. In this study, four-state predictions were obtained for each amino acid in the LEA proteins using PHDsec from the ProteinPredict server [<xref ref-type="bibr" rid="B40">40</xref>
,<xref ref-type="bibr" rid="B41">41</xref>
]. PHDsec takes a neural network approach. The ProteinPredict server returns two predictions for each amino acid: a three-state prediction (H/E/L) together with a value indicating the degree of confidence in that value, or a more stringent, four-state prediction, with the additional option of none of H, E or L being recorded if none prove significant. This is indicated by a '.'. The four-state predictions used in this study were converted to percentage composition values (e.g. the count of H predictions divided by the protein length), which minimises effects due to differences in length across the sequences. However, before the percentage composition values were calculated, some preprocessing was done to remove possible prediction artefacts, in particular predicted features encompassing a single amino acid, though beta-sheets of spanning just one amino acid could be beta-turns. Remembering that values must be in the range 0. . 1. 0, the ranges of values for H, E and L were respectively: 0. . 0. 85, 0. . 0. 17 and 0. 04. . 0. 60.</p>
<p>A number of alternative secondary structure prediction servers were tried, including NPS@ secondary structure consensus server [<xref ref-type="bibr" rid="B42">42</xref>
], Prof, which combines different classifiers with a neural network [<xref ref-type="bibr" rid="B43">43</xref>
,<xref ref-type="bibr" rid="B44">44</xref>
] and SAM-T02 which uses Hidden Markov Model methods [<xref ref-type="bibr" rid="B45">45</xref>
,<xref ref-type="bibr" rid="B46">46</xref>
]. It is worth noting that all secondary structure predictors have been trained on the relatively small number of distinct globular proteins for which structures have been determined, typically from X-ray or NMR data. Bearing in mind that most of the LEA proteins have low sequence complexity and are probably not globular, any predications need to be viewed a little skeptically. In addition, three-state predictors have the problem that coil or loop is the default category so will tend to be over-predicted. Building a consensus of such values might therefore compound the problem. For example, when Prof was used to examine the Group 1 LEA protein EM1_ARATH, 150 of the 152 amino acids were labelled as coil. For the same protein, the NPS@ gave a percentage of 25.7% for helix and 67.8 for coil, PHDsec in its three-state mode returned 26.3% helix and 53.3% coil, while SAM-T02 returned 34.2% helix and 65.8% coil. By contrast, the PHDsec four-state mode gave 11.2% helix and 23.7% loop. The four-state prediction returned by PHDsec is more conservative and therefore was used for this study. In addition, use of percentage composition values should average out any point inaccuracies.</p>
</sec>
<sec><title>Amino Acid Class Percentage Composition</title>
<p>While issues of biases in the peptide composition of LEA proteins will be more fully explored using unsupervised machine learning, it was believed that a general classification could provide added detail to that afforded by the hydrophobicity values. The amino acid types and the ranges in their values are: Aliphatic (0. 03. . 0. 29), Aromatic (0. 01. . 0. 15), Non-polar (0. 32. . 0. 59), Polar (0. 41. . 0. 68), Charged (0. 19. . 0. 52), Basic (0. 08. . 0. 28) and Acidic (0. 07. . 0. 28). The only point to note in the membership of the different sets is that the set of Aromatic residues includes histidine, as well as phenylalanine, tryptophan and tyrosine.</p>
</sec>
</sec>
<sec><title>Unsupervised Machine Learning Applied to LEA Proteins – The POPPs</title>
<p>The method of choice for most biologists faced with protein sequence data is to compare their sequences against those in a protein database such as SwissProt using the Smith-Waterman algorithm, e.g. Scanps [<xref ref-type="bibr" rid="B47">47</xref>
] or approximations to the Smith-Waterman algorithm, such as BLAST [<xref ref-type="bibr" rid="B48">48</xref>
]. The POPPs suite of tools [<xref ref-type="bibr" rid="B25">25</xref>
], available under license from the author, employs an alternative approach, based on comparisons of sets of peptides that are "unusual" in the proteins under comparison.</p>
<sec><title>Significant LEA Protein Peptides</title>
<p>The first application in the suite is called popp_create.py. Given one or more sequences or files of sequences popp_create.py compares the distributions of peptides of length 1aa – 3aa (typically), found in the individual sequences or across files of sequences, versus their distributions across a suitably large database (currently SwissProt plus SpTrEMBL, also called Swall). A single-sided binomial distribution statistic is used to produce a list of those peptides that are either significantly over-represented in the samples versus the database or significantly under-represented, both with respect to a user-specified threshold p-value. Peptides whose absolute probability is greater than the threshold are not reported. This list, called a Protein or Oligonucleotide Probability Profile, or "POPP", can provide useful information about the sorts of peptides that are characteristic of the sequence or group of sequences. Sequences corresponding to the different Groups were placed into separate databases and popp_create.py was then applied to each database.</p>
</sec>
<sec><title>Clustering LEA proteins</title>
<p>An alternative output format available to popp_create.py is the creation of a POPP vector for each input sequence. POPP vectors contain the same information as the profiles but in a compressed form; the profiles are formatted for inspection by users while the vectors are used by the second component of The POPPs, popp_cmp.py. popp_cmp.py applies a clustering algorithm to the POPP vectors so that related proteins are formed into groups around a consensus POPP, i.e. a POPP composed of those peptides that are significantly under or over represented in all the component POPPs. Details of the algorithm can be found in [<xref ref-type="bibr" rid="B25">25</xref>
]. However, from the user's point of view an important feature is that POPP vectors are not forced to belong to a single cluster but can appear in any cluster where this is appropriate.</p>
<p>The same clustering algorithms are also used to perform meta-clustering. That is, the consensusPOPPs found in the first pass are themselves clustered into families. Furthermore, if the various families are sufficiently similar, groups of families are brought together into superfamilies, which are distinguished by the fact that each family in a superfamily shares at least one cluster with at least one of the other families. The most highly connected (i.e. most representative) family is selected as the "anchor" of its superfamily.</p>
<p>In the context of the current investigations, the application popp_create.py was used to create a POPP vector for each of the LEA protein sequences – Group 1 to Group 6, plus Groups Lea5 and Lea14 – together with the Uncharacterised set. The application popp_cmp.py was then used to cluster the POPP vectors; the results are discussed below.</p>
</sec>
</sec>
<sec><title>Keyword Clustering Applied to Sets of Related POPPs Vectors</title>
<p>When POPPs are gathered into clusters, families and superfamilies a consensus POPP is also reported. The consensus POPP contains the peptides that significantly under- or over-represented in all the POPPs making up the cluster, family or superfamily. Another POPP analysis tool, popp_search.py, can then be used to search a POPP-vector database (in this case created from SwissProt) for proteins related to a query sequence by similar biases in their peptide compositions. Searches were undertaken based on the consensus POPPs from the anchor family in each superfamily. In the final step of this process, ignoring the hits against the sequences forming the consensus (i.e. search) POPPs, the remaining hits were submitted to the protein keyword clustering application, Protein Annotators' Assistant [<xref ref-type="bibr" rid="B49">49</xref>
,<xref ref-type="bibr" rid="B50">50</xref>
]. This web-based application takes a list of SwissProt identifiers or accession numbers and returns a list of keywords or phrases that characterise subsets of the input proteins, automating a process that is typically done by hand, e.g. from BLAST hits.</p>
</sec>
</sec>
<sec><title>Additional Material</title>
<p>Additional material can be found by unzipping the <xref ref-type="supplementary-material" rid="S1">Additional file: 1</xref>
. The resulting web pages list the data used in the experiments and the outputs that resulted, in particular from the unsupervised machine learning experiments using The POPPs suite.</p>
<table-wrap position="float" id="T15"><label>Table 15</label>
<caption><p>Comparison of New LEA Protein Classes with Previous Group Classifications</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><td align="center">Class</td>
<td align="center">Baker/Dure</td>
<td align="center">Bray 1994</td>
<td align="center">Bray 2000</td>
<td align="center">Cuming</td>
<td align="center">Comments</td>
</tr>
</thead>
<tbody><tr><td align="center">I</td>
<td align="center">D19</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">1</td>
<td></td>
</tr>
<tr><td align="center">IIa</td>
<td align="center">D11</td>
<td align="center">2</td>
<td align="center">2</td>
<td align="center">2</td>
<td align="center">Includes some Group 4 (D113); Subgroup 2a from Rules</td>
</tr>
<tr><td align="center">IIb</td>
<td align="center">D11</td>
<td align="center">2</td>
<td align="center">2</td>
<td align="center">2</td>
<td align="center">Subgroups 2b and 2c from the Rules</td>
</tr>
<tr><td align="center">III</td>
<td align="center">D7</td>
<td align="center">3</td>
<td align="center">3</td>
<td align="center">3</td>
<td align="center">Includes Group 5 (D29) and remainder of Group 4 (D113)</td>
</tr>
<tr><td align="center">IV</td>
<td align="center">D34</td>
<td align="center">6</td>
<td align="center">-</td>
<td align="center">5</td>
<td></td>
</tr>
<tr><td align="center">V</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">Lea5/D73 – Named in [<xref ref-type="bibr" rid="B15">15</xref>
]</td>
</tr>
<tr><td align="center">VI</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">4</td>
<td align="center">5</td>
<td align="center">Lea14/D95 – Named in [<xref ref-type="bibr" rid="B15">15</xref>
]</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>A proposed LEA Class numbering scheme (column 1), encompassing all the Groups listed above with the exception of Group 4 and Group 5, is compared with past numbering schemes from: Baker/Dure (column 2), Bray 1994 (column 3), Bray 2000 (column 4) and Cuming (column 5).</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
</caption>
<media xlink:href="1471-2105-4-52-S1.zip" mimetype="application" mime-subtype="x-zip-compressed"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><ack><sec><title>Acknowledgements</title>
<p>I would like to thank Dr Alan Tunnacliffe, Institute of Biotechnology, Cambridge University, for making me aware of the LEA proteins, and for making extremely useful comments on the results of the investigations that I have undertaken on them. This paper has also benefited greatly from his comments, and from the comments of the reviewers. I would also like to acknowledge the generous support for my Fellowship provided by Bristol-Myers Squibb.</p>
</sec>
</ack>
<ref-list><ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bray</surname>
<given-names>EA</given-names>
</name>
</person-group>
<article-title>Molecular Responses to Water Deficit</article-title>
<source>Plant Physiol</source>
<year>1993</year>
<volume>103</volume>
<fpage>1035</fpage>
<lpage>1040</lpage>
<pub-id pub-id-type="pmid">12231998</pub-id>
</citation>
</ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ingram</surname>
<given-names>J</given-names>
</name>
<name><surname>Bartels</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>The Molecular Basis of Dehydration Tolerance in Plants</article-title>
<source>Annu Rev Plant Physiol Plant Mol Biol</source>
<year>1996</year>
<volume>47</volume>
<fpage>377</fpage>
<lpage>403</lpage>
<pub-id pub-id-type="pmid">15012294</pub-id>
<pub-id pub-id-type="doi">10.1146/annurev.arplant.47.1.377</pub-id>
</citation>
</ref>
<ref id="B3"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Cuming</surname>
<given-names>AC</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Peter R. Shewry and Rod Casey</surname>
</name>
</person-group>
<article-title>LEA Proteins,</article-title>
<source>In Seed Proteins</source>
<year>1999</year>
<publisher-name>Kluwer Academic Publishers</publisher-name>
<fpage>753</fpage>
<lpage>780</lpage>
</citation>
</ref>
<ref id="B4"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Bray</surname>
<given-names>EA</given-names>
</name>
<name><surname>Bailey-Serres</surname>
<given-names>J</given-names>
</name>
<name><surname>Weretilnyk</surname>
<given-names>E</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Bob B. Buchanan, Wilhelm Gruissem and Russell L. Jones</surname>
</name>
</person-group>
<article-title>Responses to Abiotic Stress,</article-title>
<source>In Biochemistry and Molecular Biology of Plants</source>
<year>2000</year>
<publisher-name>American Society of Plant Physiologists</publisher-name>
<fpage>1158</fpage>
<lpage>1203</lpage>
</citation>
</ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname>
<given-names>J</given-names>
</name>
<name><surname>Steele</surname>
<given-names>C</given-names>
</name>
<name><surname>Dure</surname>
<given-names>L</given-names>
<suffix>III</suffix>
</name>
</person-group>
<article-title>Sequence and Characterization of 6 <italic>Lea </italic>
Proteins and their Genes from Cotton</article-title>
<source>Plant Mol Biol</source>
<year>1988</year>
<volume>11</volume>
<fpage>277</fpage>
<lpage>291</lpage>
</citation>
</ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dure</surname>
<given-names>L</given-names>
<suffix>III</suffix>
</name>
<name><surname>Crouch</surname>
<given-names>M</given-names>
</name>
<name><surname>Harada</surname>
<given-names>J</given-names>
</name>
<name><surname>Ho</surname>
<given-names>T.-HD</given-names>
</name>
<name><surname>Mundy</surname>
<given-names>J</given-names>
</name>
<name><surname>Quatrano</surname>
<given-names>R</given-names>
</name>
<name><surname>Thomas</surname>
<given-names>T</given-names>
</name>
<name><surname>Sung</surname>
<given-names>ZR</given-names>
</name>
</person-group>
<article-title>Common Amino Acid Sequence Domains among the LEA Proteins of Higher Plants</article-title>
<source>Plant Mol Biol</source>
<year>1989</year>
<volume>12</volume>
<fpage>475</fpage>
<lpage>486</lpage>
</citation>
</ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hughes</surname>
<given-names>DW</given-names>
</name>
<name><surname>Galau</surname>
<given-names>GA</given-names>
</name>
</person-group>
<article-title>Temporally Modular Gene Expression During Cotyledon Development</article-title>
<source>Genes Dev</source>
<year>1989</year>
<volume>3</volume>
<fpage>358</fpage>
<lpage>369</lpage>
<pub-id pub-id-type="pmid">2721959</pub-id>
</citation>
</ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stacy</surname>
<given-names>RAP</given-names>
</name>
<name><surname>Aalen</surname>
<given-names>RB</given-names>
</name>
</person-group>
<article-title>Identification of Sequence Homology Between the Internal Hydrophilic Repeated Motifs in Group 1 Late-Embryogenesis-Abundant Proteins in Plants and Hydrophilic Repeats of the General Stress Protein GsiB of <italic>Bacillus subtilis.</italic>
</article-title>
<source>Planta</source>
<year>1998</year>
<volume>206</volume>
<fpage>476</fpage>
<lpage>478</lpage>
<pub-id pub-id-type="pmid">9763714</pub-id>
<pub-id pub-id-type="doi">10.1007/s004250050424</pub-id>
</citation>
</ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Makarova</surname>
<given-names>KS</given-names>
</name>
<name><surname>Aravind</surname>
<given-names>L</given-names>
</name>
<name><surname>Wolf</surname>
<given-names>YI</given-names>
</name>
<name><surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
<name><surname>Minton</surname>
<given-names>KW</given-names>
</name>
<name><surname>Koonin</surname>
<given-names>EV</given-names>
</name>
<name><surname>Daly</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Genome of the Extremely Radiation-Resistant Bacterium <italic>Deinococcus radiodurans </italic>
Viewed from the Perspective of Comparative Genomics</article-title>
<source>Microbiol Mol Biol Rev</source>
<year>2001</year>
<volume>65</volume>
<fpage>44</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="pmid">11238985</pub-id>
<pub-id pub-id-type="doi">10.1128/MMBR.65.1.44-79.2001</pub-id>
</citation>
</ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Browne</surname>
<given-names>J</given-names>
</name>
<name><surname>Tunnacliffe</surname>
<given-names>A</given-names>
</name>
<name><surname>Burnell</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Plant Desiccation Gene Found in a Nematode</article-title>
<source>Nature</source>
<year>2002</year>
<volume>416</volume>
<fpage>38</fpage>
<pub-id pub-id-type="pmid">11882885</pub-id>
<pub-id pub-id-type="doi">10.1038/416038a</pub-id>
</citation>
</ref>
<ref id="B11"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Dure III</surname>
<given-names>L</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Timothy J. Close and Elizabeth A. Bray</surname>
</name>
</person-group>
<article-title>Structural Motifs in LEA Proteins,</article-title>
<source>In Plant Responses to Cellular Dehydration During Environmental Stress</source>
<year>1993</year>
<publisher-name>American Society of Plant Physiologists</publisher-name>
<fpage>91</fpage>
<lpage>103</lpage>
</citation>
</ref>
<ref id="B12"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Bray</surname>
<given-names>EA</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Amarjit S. Basra</surname>
</name>
</person-group>
<article-title>Alterations in Gene Expression in Response to Water Deficit,</article-title>
<source>In Stress-Induced Gene Expression in Plants</source>
<year>1994</year>
<publisher-name>Harwood Academic</publisher-name>
<fpage>1</fpage>
<lpage>23</lpage>
</citation>
</ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Close</surname>
<given-names>TJ</given-names>
</name>
</person-group>
<article-title>Dehydrins: A Commonalty in the Response of Plants to Dehydration and Low Temperature</article-title>
<source>Physiol Plant</source>
<year>1997</year>
<volume>100</volume>
<fpage>291</fpage>
<lpage>296</lpage>
<pub-id pub-id-type="doi">10.1034/j.1399-3054.1997.1000210.x</pub-id>
</citation>
</ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<name><surname>Birney</surname>
<given-names>E</given-names>
</name>
<name><surname>Cerruti</surname>
<given-names>L</given-names>
</name>
<name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
<name><surname>Etwiller</surname>
<given-names>L</given-names>
</name>
<name><surname>Eddy</surname>
<given-names>S</given-names>
</name>
<name><surname>Griffiths-Jones</surname>
<given-names>S</given-names>
</name>
<name><surname>Howe</surname>
<given-names>KL</given-names>
</name>
<name><surname>Marshall</surname>
<given-names>M</given-names>
</name>
<name><surname>Sonnhammer</surname>
<given-names>ELL</given-names>
</name>
</person-group>
<article-title>The Pfam Protein Families Database</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<fpage>276</fpage>
<lpage>280</lpage>
<pub-id pub-id-type="pmid">11752314</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/30.1.276</pub-id>
</citation>
</ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Galau</surname>
<given-names>GA</given-names>
</name>
<name><surname>Wang</surname>
<given-names>HY.-C</given-names>
</name>
<name><surname>Hughes</surname>
<given-names>DW</given-names>
</name>
</person-group>
<article-title>Cotton <italic>Lea5 </italic>
and <italic>Lea14 </italic>
Encode Atypical Late Embryogenesis-Abundant Proteins</article-title>
<source>Plant Physiol</source>
<year>1993</year>
<volume>101</volume>
<fpage>695</fpage>
<lpage>696</lpage>
<pub-id pub-id-type="pmid">8278514</pub-id>
<pub-id pub-id-type="doi">10.1104/pp.101.2.695</pub-id>
</citation>
</ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garay-Arroyo</surname>
<given-names>A</given-names>
</name>
<name><surname>Colmenero-Flores</surname>
<given-names>JM</given-names>
</name>
<name><surname>Garciarrubio</surname>
<given-names>A</given-names>
</name>
<name><surname>Covarrubias</surname>
<given-names>AA</given-names>
</name>
</person-group>
<article-title>Highly Hydrophilic Proteins in Prokaryotes and Eukaryotes are Common during Conditions of Water Deficit</article-title>
<source>J Biol Chem</source>
<year>2000</year>
<volume>275</volume>
<fpage>5668</fpage>
<lpage>5674</lpage>
<pub-id pub-id-type="pmid">10681550</pub-id>
<pub-id pub-id-type="doi">10.1074/jbc.275.8.5668</pub-id>
</citation>
</ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wise</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>0j.py: A Software Tool for Low Complexity Proteins and Protein Domains</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>Suppl17</volume>
<fpage>288</fpage>
<lpage>295</lpage>
</citation>
</ref>
<ref id="B18"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Russell F. Doolittle</surname>
</name>
</person-group>
<article-title>Local Alignment Statistics,</article-title>
<source>In Computer Methods for Macromolecular Sequence Analysis</source>
<year>1996</year>
<publisher-name>Academic Press</publisher-name>
<fpage>460</fpage>
<lpage>480</lpage>
</citation>
</ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
<name><surname>Chothia</surname>
<given-names>C</given-names>
</name>
<name><surname>Hubbard</surname>
<given-names>TJP</given-names>
</name>
</person-group>
<article-title>Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>1998</year>
<volume>95</volume>
<fpage>6073</fpage>
<lpage>6078</lpage>
<pub-id pub-id-type="pmid">9600919</pub-id>
<pub-id pub-id-type="doi">10.1073/pnas.95.11.6073</pub-id>
</citation>
</ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Boguski</surname>
<given-names>MS</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
<name><surname>Wootton</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>Issues in Searching Molecular Sequence Databases</article-title>
<source>Nat Genet</source>
<year>1994</year>
<volume>6</volume>
<fpage>119</fpage>
<lpage>129</lpage>
<pub-id pub-id-type="pmid">8162065</pub-id>
</citation>
</ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wootton</surname>
<given-names>JC</given-names>
</name>
<name><surname>Federhen</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases</article-title>
<source>Comput Chem</source>
<year>1993</year>
<volume>17</volume>
<fpage>149</fpage>
<lpage>163</lpage>
<pub-id pub-id-type="doi">10.1016/0097-8485(93)85006-X</pub-id>
</citation>
</ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dure</surname>
<given-names>L</given-names>
<suffix>III</suffix>
</name>
</person-group>
<article-title>Occurrence of a Repeating 11-mer Amino Acid Sequence Motif in Diverse Organisms</article-title>
<source>Protein Pept Lett</source>
<year>2001</year>
<volume>8</volume>
<fpage>115</fpage>
<lpage>122</lpage>
</citation>
</ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Falquet</surname>
<given-names>L</given-names>
</name>
<name><surname>Pagni</surname>
<given-names>M</given-names>
</name>
<name><surname>Bucher</surname>
<given-names>P</given-names>
</name>
<name><surname>Hulo</surname>
<given-names>N</given-names>
</name>
<name><surname>Sigrist</surname>
<given-names>CJA</given-names>
</name>
<name><surname>Hofmann</surname>
<given-names>K</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>The PROSITE Database, its Status in 2002</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<fpage>235</fpage>
<lpage>238</lpage>
<pub-id pub-id-type="pmid">11752303</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/30.1.235</pub-id>
</citation>
</ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Attwood</surname>
<given-names>TK</given-names>
</name>
<name><surname>Bradley</surname>
<given-names>P</given-names>
</name>
<name><surname>Flower</surname>
<given-names>DR</given-names>
</name>
<name><surname>Gaulton</surname>
<given-names>A</given-names>
</name>
<name><surname>Maudling</surname>
<given-names>N</given-names>
</name>
<name><surname>Mitchell</surname>
<given-names>AL</given-names>
</name>
<name><surname>Moulton</surname>
<given-names>G</given-names>
</name>
<name><surname>Nordle</surname>
<given-names>A</given-names>
</name>
<name><surname>Paine</surname>
<given-names>K</given-names>
</name>
<name><surname>Taylor</surname>
<given-names>P</given-names>
</name>
<name><surname>Uddin</surname>
<given-names>A</given-names>
</name>
<name><surname>Zygouri</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>PRINTS and its Automatic Supplement, prePRINTS</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>400</fpage>
<lpage>402</lpage>
<pub-id pub-id-type="pmid">12520033</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkg030</pub-id>
</citation>
</ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wise</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>The POPPs: Clustering and Searching Using Peptide Probability Profiles</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>Suppl18</volume>
<fpage>38</fpage>
<lpage>45</lpage>
</citation>
</ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goyal</surname>
<given-names>K</given-names>
</name>
<name><surname>Tisi</surname>
<given-names>L</given-names>
</name>
<name><surname>Basran</surname>
<given-names>A</given-names>
</name>
<name><surname>Browne</surname>
<given-names>J</given-names>
</name>
<name><surname>Burnell</surname>
<given-names>A</given-names>
</name>
<name><surname>Zurdo</surname>
<given-names>J</given-names>
</name>
<name><surname>Tunnacliffe</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Transition from Natively Unfolded to Folded State Induced by Desiccation in an Anhydrobiotic Nematode Protein</article-title>
<source>J Biol Chem</source>
<year>2003</year>
<volume>278</volume>
<fpage>12977</fpage>
<lpage>12984</lpage>
<pub-id pub-id-type="pmid">12569097</pub-id>
<pub-id pub-id-type="doi">10.1074/jbc.M212007200</pub-id>
</citation>
</ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lisse</surname>
<given-names>T</given-names>
</name>
<name><surname>Bartels</surname>
<given-names>D</given-names>
</name>
<name><surname>Kalbitzer</surname>
<given-names>HR</given-names>
</name>
<name><surname>Jaenicke</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>The Recombinant Dehydrin-Like Desiccation Stress Protein from the Resurrection Plant <italic>Craterostigma plantagineum </italic>
Displays No Defined Three-Dimensional Structure in Its Native State</article-title>
<source>Biol Chem</source>
<year>1996</year>
<volume>377</volume>
<fpage>555</fpage>
<lpage>561</lpage>
<pub-id pub-id-type="pmid">9067253</pub-id>
</citation>
</ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ismail</surname>
<given-names>AM</given-names>
</name>
<name><surname>Hall</surname>
<given-names>AE</given-names>
</name>
<name><surname>Close</surname>
<given-names>TJ</given-names>
</name>
</person-group>
<article-title>Purification and Partial Characterization of a Dehydrin Involved in Chilling Tolerance during Seedling Emergence of Cowpea</article-title>
<source>Plant Physiol</source>
<year>1999</year>
<volume>120</volume>
<fpage>237</fpage>
<lpage>244</lpage>
<pub-id pub-id-type="pmid">10318701</pub-id>
<pub-id pub-id-type="doi">10.1104/pp.120.1.237</pub-id>
</citation>
</ref>
<ref id="B29"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Berge</surname>
<given-names>SK</given-names>
</name>
<name><surname>Bartholomew</surname>
<given-names>DM</given-names>
</name>
<name><surname>Quatrano</surname>
<given-names>RS</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Robert Goldberg. Alan R. Liss</surname>
</name>
</person-group>
<article-title>Control of the Expression of Wheat Embryo Genes by Abscisic Acid,</article-title>
<source>In The Molecular Basis of Plant Development</source>
<year>1989</year>
<fpage>193</fpage>
<lpage>201</lpage>
</citation>
</ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamaguchi-Shinozaki</surname>
<given-names>K</given-names>
</name>
<name><surname>Shinozaki</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>The Plant Hormone Abscisic Acid Mediates the Drought-Induced Expression but not the Seed-Specific Expression of <italic>rd22</italic>
, a Gene Responsive to Dehydration Stress in</article-title>
<source>Arabidopsis thaliana Mol Gen Genet</source>
<year>1993</year>
<volume>238</volume>
<fpage>17</fpage>
<lpage>25</lpage>
</citation>
</ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bartels</surname>
<given-names>D</given-names>
</name>
<name><surname>Schneider</surname>
<given-names>K</given-names>
</name>
<name><surname>Terstappen</surname>
<given-names>G</given-names>
</name>
<name><surname>Piatkowski</surname>
<given-names>D</given-names>
</name>
<name><surname>Salamani</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Molecular Cloning of Abscisic Acid-Modulated Genes which are Induced during Desiccation of the Resurrection Plant</article-title>
<source>Craterostigma plantagineum Planta</source>
<year>1990</year>
<volume>181</volume>
<fpage>27</fpage>
<lpage>34</lpage>
</citation>
</ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hsing</surname>
<given-names>YC</given-names>
</name>
<name><surname>Tsou</surname>
<given-names>C</given-names>
</name>
<name><surname>Hsu</surname>
<given-names>T</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Z</given-names>
</name>
<name><surname>Hsieh</surname>
<given-names>K</given-names>
</name>
<name><surname>Hsieh</surname>
<given-names>J</given-names>
</name>
<name><surname>Chow</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Tissue and Stage-Specific Expression of a Soybean (<italic>Glycine max </italic>
L.) Seed Maturaion, Biotinylated Protein</article-title>
<source>Plant Mol Biol</source>
<year>1998</year>
<volume>38</volume>
<fpage>481</fpage>
<lpage>490</lpage>
<pub-id pub-id-type="pmid">9747855</pub-id>
<pub-id pub-id-type="doi">10.1023/A:1006079926339</pub-id>
</citation>
</ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Niu</surname>
<given-names>S</given-names>
</name>
<name><surname>Antin</surname>
<given-names>PB</given-names>
</name>
<name><surname>Morkin</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Cloning and Sequencing of a Developmentally Regulated Avian mRNA Containing the LEA Motif Found in Plant Seed Proteins</article-title>
<source>Gene</source>
<year>1996</year>
<volume>175</volume>
<fpage>187</fpage>
<lpage>191</lpage>
<pub-id pub-id-type="pmid">8917097</pub-id>
<pub-id pub-id-type="doi">10.1016/0378-1119(96)00146-1</pub-id>
</citation>
</ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zdobnov</surname>
<given-names>EM</given-names>
</name>
<name><surname>Lopez</surname>
<given-names>R</given-names>
</name>
<name><surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name><surname>Etzold</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>The EBI SRS Server – Recent Developments</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>368</fpage>
<lpage>373</lpage>
<pub-id pub-id-type="pmid">11847095</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.2.368</pub-id>
</citation>
</ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname>
<given-names>S</given-names>
</name>
<name><surname>Manber</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Fast Text Searching Allowing Errors</article-title>
<source>Commun ACM</source>
<year>1992</year>
<volume>35</volume>
<fpage>83</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="doi">10.1145/135239.135244</pub-id>
</citation>
</ref>
<ref id="B36"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Shavlik</surname>
<given-names>JW</given-names>
</name>
<name><surname>Dietterich</surname>
<given-names>TG</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Jude W. Shavlik and Thomas G. Dietterich</surname>
</name>
</person-group>
<article-title>General Aspects of Machine Learning,</article-title>
<source>In Readings in Machine Learning</source>
<year>1990</year>
<publisher-name>Morgan Kaufmann</publisher-name>
<fpage>1</fpage>
<lpage>10</lpage>
</citation>
</ref>
<ref id="B37"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Mitchell</surname>
<given-names>TM</given-names>
</name>
</person-group>
<source>Machine Learning</source>
<year>1997</year>
<publisher-name>McGraw Hill</publisher-name>
</citation>
</ref>
<ref id="B38"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohen</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>Fast Effective Rule Induction</article-title>
<source>In Twelfth International Conference on Machine Learning: July 9–12, 1995</source>
<year>1995</year>
<publisher-name>Lake Tahoe, U.S.A. Morgan Kaufmann</publisher-name>
<fpage>115</fpage>
<lpage>123</lpage>
</citation>
</ref>
<ref id="B39"><citation citation-type="other"><article-title>EMBOSS</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.hgmp.mrc.ac.uk/Software/EMBOSS/"></ext-link>
</citation>
</ref>
<ref id="B40"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Rost</surname>
<given-names>B</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Russell F. Doolittle</surname>
</name>
</person-group>
<article-title>PHD: Predicting 1D Protein Structure by Profile Based Neural Networks</article-title>
<source>In Methods in Enzymology 266</source>
<year>1996</year>
<publisher-name>Academic Press</publisher-name>
<fpage>525</fpage>
<lpage>539</lpage>
</citation>
</ref>
<ref id="B41"><citation citation-type="other"><article-title>ProteinPredict</article-title>
<ext-link ext-link-type="uri" xlink:href="http://cubic.bioc.columbia.edu/predictprotein"></ext-link>
</citation>
</ref>
<ref id="B42"><citation citation-type="other"><article-title>NPS@ (Network Protein Sequence @nalysis) Server</article-title>
<ext-link ext-link-type="uri" xlink:href="http://npsa-pbil.ibcp.fr/"></ext-link>
</citation>
</ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ouali</surname>
<given-names>M</given-names>
</name>
<name><surname>King</surname>
<given-names>RD</given-names>
</name>
</person-group>
<article-title>Cascaded Multiple Classifiers for Secondary Structure Prediction</article-title>
<source>Protein Sci</source>
<year>2000</year>
<volume>9</volume>
<fpage>1162</fpage>
<lpage>1176</lpage>
<pub-id pub-id-type="pmid">10892809</pub-id>
</citation>
</ref>
<ref id="B44"><citation citation-type="other"><article-title>PROF – Secondary Structure Prediction System</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.aber.ac.uk/~phiwww/prof/"></ext-link>
</citation>
</ref>
<ref id="B45"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Karplus</surname>
<given-names>K</given-names>
</name>
<name><surname>Karchin</surname>
<given-names>R</given-names>
</name>
<name><surname>Draper</surname>
<given-names>J</given-names>
</name>
<name><surname>Casper</surname>
<given-names>J</given-names>
</name>
<name><surname>Mandel-Gutfreund</surname>
<given-names>Y</given-names>
</name>
<name><surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<name><surname>Hughey</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Combining Local-Structure, Fold-Recognition, and new-Fold Methods for Protein Structure Prediction</article-title>
<source>Proteins</source>
<year>2003</year>
</citation>
</ref>
<ref id="B46"><citation citation-type="other"><article-title>HMM-based Protein Structure Prediction, SAM-T02</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.soe.ucsc.edu/research/compbio/SAM_T02/T02-query.html"></ext-link>
</citation>
</ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barton</surname>
<given-names>GJ</given-names>
</name>
</person-group>
<article-title>An Efficient Algorithm to Locate all Locally Optimal Alignments between Two Sequences Allowing for Gaps</article-title>
<source>CABIOS</source>
<year>1993</year>
<volume>9</volume>
<fpage>729</fpage>
<lpage>734</lpage>
<pub-id pub-id-type="pmid">8143160</pub-id>
</citation>
</ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W</given-names>
</name>
<name><surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic Local Alignment Search Tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
<pub-id pub-id-type="doi">10.1006/jmbi.1990.9999</pub-id>
</citation>
</ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wise</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Protein Annotators' Assistant</article-title>
<source>Trends Biochem Sci</source>
<year>2000</year>
<volume>25</volume>
<fpage>252</fpage>
<lpage>253</lpage>
<pub-id pub-id-type="pmid">10782098</pub-id>
<pub-id pub-id-type="doi">10.1016/S0968-0004(00)01554-1</pub-id>
</citation>
</ref>
<ref id="B50"><citation citation-type="other"><article-title>Protein Annotators' Assistant</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/paa"></ext-link>
</citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Bois/explor/OrangerV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0009579 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0009579 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Bois
   |area=    OrangerV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.25.
Data generation: Sat Dec 3 17:11:04 2016. Site generation: Wed Mar 6 18:18:32 2024

	Serveur d'exploration sur l'oranger
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'oranger

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri