Serveur d'exploration sur les relations entre la France et l'Australie

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 002272 ( Pmc/Corpus ); précédent : 0022719; suivant : 0022730 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems</title>
<author>
<name sortKey="Le Cao, Kim Anh" sort="Le Cao, Kim Anh" uniqKey="Le Cao K" first="Kim-Anh" last="Lê Cao">Kim-Anh Lê Cao</name>
<affiliation>
<nlm:aff id="I1">Queensland Facility for Advanced Bioinformatics, University of Queensland, 4072 St Lucia, QLD, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boitard, Simon" sort="Boitard, Simon" uniqKey="Boitard S" first="Simon" last="Boitard">Simon Boitard</name>
<affiliation>
<nlm:aff id="I2">UMR444 Laboratoire de Génétique Cellulaire, INRA, BP 52627, F-31326 Castanet Tolosan, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Besse, Philippe" sort="Besse, Philippe" uniqKey="Besse P" first="Philippe" last="Besse">Philippe Besse</name>
<affiliation>
<nlm:aff id="I3">Institut de Mathématiques de Toulouse, Université de Toulouse et CNRS (UMR 5219), F-31062 Toulouse, France</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21693065</idno>
<idno type="pmc">3133555</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3133555</idno>
<idno type="RBID">PMC:3133555</idno>
<idno type="doi">10.1186/1471-2105-12-253</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">002272</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">002272</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems</title>
<author>
<name sortKey="Le Cao, Kim Anh" sort="Le Cao, Kim Anh" uniqKey="Le Cao K" first="Kim-Anh" last="Lê Cao">Kim-Anh Lê Cao</name>
<affiliation>
<nlm:aff id="I1">Queensland Facility for Advanced Bioinformatics, University of Queensland, 4072 St Lucia, QLD, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boitard, Simon" sort="Boitard, Simon" uniqKey="Boitard S" first="Simon" last="Boitard">Simon Boitard</name>
<affiliation>
<nlm:aff id="I2">UMR444 Laboratoire de Génétique Cellulaire, INRA, BP 52627, F-31326 Castanet Tolosan, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Besse, Philippe" sort="Besse, Philippe" uniqKey="Besse P" first="Philippe" last="Besse">Philippe Besse</name>
<affiliation>
<nlm:aff id="I3">Institut de Mathématiques de Toulouse, Université de Toulouse et CNRS (UMR 5219), F-31062 Toulouse, France</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.</p>
</sec>
<sec>
<title>Results</title>
<p>A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the
<monospace>R package mixOmics</monospace>
, which is dedicated to the analysis of large biological data sets.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Golub, T" uniqKey="Golub T">T Golub</name>
</author>
<author>
<name sortKey="Slonim, D" uniqKey="Slonim D">D Slonim</name>
</author>
<author>
<name sortKey="Tamayo, P" uniqKey="Tamayo P">P Tamayo</name>
</author>
<author>
<name sortKey="Huard, C" uniqKey="Huard C">C Huard</name>
</author>
<author>
<name sortKey="Gaasenbeek, M" uniqKey="Gaasenbeek M">M Gaasenbeek</name>
</author>
<author>
<name sortKey="Mesirov, J" uniqKey="Mesirov J">J Mesirov</name>
</author>
<author>
<name sortKey="Coller, H" uniqKey="Coller H">H Coller</name>
</author>
<author>
<name sortKey="Loh, M" uniqKey="Loh M">M Loh</name>
</author>
<author>
<name sortKey="Downing, J" uniqKey="Downing J">J Downing</name>
</author>
<author>
<name sortKey="Caligiuri, M" uniqKey="Caligiuri M">M Caligiuri</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dudoit, S" uniqKey="Dudoit S">S Dudoit</name>
</author>
<author>
<name sortKey="Fridlyand, J" uniqKey="Fridlyand J">J Fridlyand</name>
</author>
<author>
<name sortKey="Speed, T" uniqKey="Speed T">T Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guyon, I" uniqKey="Guyon I">I Guyon</name>
</author>
<author>
<name sortKey="Elisseefi, A" uniqKey="Elisseefi A">A Elisseefi</name>
</author>
<author>
<name sortKey="Kaelbling, L" uniqKey="Kaelbling L">L Kaelbling</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M Ashburner</name>
</author>
<author>
<name sortKey="Ball, C" uniqKey="Ball C">C Ball</name>
</author>
<author>
<name sortKey="Blake, J" uniqKey="Blake J">J Blake</name>
</author>
<author>
<name sortKey="Botstein, D" uniqKey="Botstein D">D Botstein</name>
</author>
<author>
<name sortKey="Butler, H" uniqKey="Butler H">H Butler</name>
</author>
<author>
<name sortKey="Cherry, J" uniqKey="Cherry J">J Cherry</name>
</author>
<author>
<name sortKey="Davis, A" uniqKey="Davis A">A Davis</name>
</author>
<author>
<name sortKey="Dolinski, K" uniqKey="Dolinski K">K Dolinski</name>
</author>
<author>
<name sortKey="Dwight, S" uniqKey="Dwight S">S Dwight</name>
</author>
<author>
<name sortKey="Eppig, J" uniqKey="Eppig J">J Eppig</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Bonnet, A" uniqKey="Bonnet A">A Bonnet</name>
</author>
<author>
<name sortKey="Gadat, S" uniqKey="Gadat S">S Gadat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
<author>
<name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
<author>
<name sortKey="Olshen, R" uniqKey="Olshen R">R Olshen</name>
</author>
<author>
<name sortKey="Stone, C" uniqKey="Stone C">C Stone</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vapnik, Vn" uniqKey="Vapnik V">VN Vapnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Narasimhan, B" uniqKey="Narasimhan B">B Narasimhan</name>
</author>
<author>
<name sortKey="Chu, G" uniqKey="Chu G">G Chu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Goncalves, O" uniqKey="Goncalves O">O Goncalves</name>
</author>
<author>
<name sortKey="Besse, P" uniqKey="Besse P">P Besse</name>
</author>
<author>
<name sortKey="Gadat, S" uniqKey="Gadat S">S Gadat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bair, E" uniqKey="Bair E">E Bair</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Paul, D" uniqKey="Paul D">D Paul</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jombart, T" uniqKey="Jombart T">T Jombart</name>
</author>
<author>
<name sortKey="Devillard, S" uniqKey="Devillard S">S Devillard</name>
</author>
<author>
<name sortKey="Balloux, F" uniqKey="Balloux F">F Balloux</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wold, H" uniqKey="Wold H">H Wold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Antoniadis, A" uniqKey="Antoniadis A">A Antoniadis</name>
</author>
<author>
<name sortKey="Lambert Lacroix, S" uniqKey="Lambert Lacroix S">S Lambert-Lacroix</name>
</author>
<author>
<name sortKey="Leblanc, F" uniqKey="Leblanc F">F Leblanc</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boulesteix, A" uniqKey="Boulesteix A">A Boulesteix</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dai, J" uniqKey="Dai J">J Dai</name>
</author>
<author>
<name sortKey="Lieu, L" uniqKey="Lieu L">L Lieu</name>
</author>
<author>
<name sortKey="Rocke, D" uniqKey="Rocke D">D Rocke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoerl, A" uniqKey="Hoerl A">A Hoerl</name>
</author>
<author>
<name sortKey="Kennard, R" uniqKey="Kennard R">R Kennard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zou, H" uniqKey="Zou H">H Zou</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jolliffe, I" uniqKey="Jolliffe I">I Jolliffe</name>
</author>
<author>
<name sortKey="Trendafilov, N" uniqKey="Trendafilov N">N Trendafilov</name>
</author>
<author>
<name sortKey="Uddin, M" uniqKey="Uddin M">M Uddin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shen, H" uniqKey="Shen H">H Shen</name>
</author>
<author>
<name sortKey="Huang, Jz" uniqKey="Huang J">JZ Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Waaijenborg, S" uniqKey="Waaijenborg S">S Waaijenborg</name>
</author>
<author>
<name sortKey="De Witt Hamer, V" uniqKey="De Witt Hamer V">V de Witt Hamer</name>
</author>
<author>
<name sortKey="Philip, C" uniqKey="Philip C">C Philip</name>
</author>
<author>
<name sortKey="Zwinderman, A" uniqKey="Zwinderman A">A Zwinderman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parkhomenko, E" uniqKey="Parkhomenko E">E Parkhomenko</name>
</author>
<author>
<name sortKey="Tritchler, D" uniqKey="Tritchler D">D Tritchler</name>
</author>
<author>
<name sortKey="Beyene, J" uniqKey="Beyene J">J Beyene</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Witten, D" uniqKey="Witten D">D Witten</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Rossouw, D" uniqKey="Rossouw D">D Rossouw</name>
</author>
<author>
<name sortKey="Robert Granie, C" uniqKey="Robert Granie C">C Robert-Granié</name>
</author>
<author>
<name sortKey="Besse, P" uniqKey="Besse P">P Besse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Martin, P" uniqKey="Martin P">P Martin</name>
</author>
<author>
<name sortKey="Robert Granie, C" uniqKey="Robert Granie C">C Robert-Granié</name>
</author>
<author>
<name sortKey="Besse, P" uniqKey="Besse P">P Besse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chun, H" uniqKey="Chun H">H Chun</name>
</author>
<author>
<name sortKey="Kele, S" uniqKey="Kele S">S Keleş</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Pan, W" uniqKey="Pan W">W Pan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Pan, W" uniqKey="Pan W">W Pan</name>
</author>
<author>
<name sortKey="Park, S" uniqKey="Park S">S Park</name>
</author>
<author>
<name sortKey="Han, X" uniqKey="Han X">X Han</name>
</author>
<author>
<name sortKey="Miller, L" uniqKey="Miller L">L Miller</name>
</author>
<author>
<name sortKey="Hall, J" uniqKey="Hall J">J Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chung, D" uniqKey="Chung D">D Chung</name>
</author>
<author>
<name sortKey="Keles, S" uniqKey="Keles S">S Keles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marx, B" uniqKey="Marx B">B Marx</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ding, B" uniqKey="Ding B">B Ding</name>
</author>
<author>
<name sortKey="Gentleman, R" uniqKey="Gentleman R">R Gentleman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fort, G" uniqKey="Fort G">G Fort</name>
</author>
<author>
<name sortKey="Lambert Lacroix, S" uniqKey="Lambert Lacroix S">S Lambert-Lacroix</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author>
<name sortKey="Tuck, D" uniqKey="Tuck D">D Tuck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, T" uniqKey="Yang T">T Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K Liu</name>
</author>
<author>
<name sortKey="Xu, C" uniqKey="Xu C">C Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barker, M" uniqKey="Barker M">M Barker</name>
</author>
<author>
<name sortKey="Rayens, W" uniqKey="Rayens W">W Rayens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tan, Y" uniqKey="Tan Y">Y Tan</name>
</author>
<author>
<name sortKey="Shi, L" uniqKey="Shi L">L Shi</name>
</author>
<author>
<name sortKey="Tong, W" uniqKey="Tong W">W Tong</name>
</author>
<author>
<name sortKey="Gene Hwang, G" uniqKey="Gene Hwang G">G Gene Hwang</name>
</author>
<author>
<name sortKey="Wang, C" uniqKey="Wang C">C Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meinshausen, N" uniqKey="Meinshausen N">N Meinshausen</name>
</author>
<author>
<name sortKey="Buhlmann, P" uniqKey="Buhlmann P">P Bühlmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bach, F" uniqKey="Bach F">F Bach</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ahdesm Ki, M" uniqKey="Ahdesm Ki M">M Ahdesmäki</name>
</author>
<author>
<name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Gonzalez, I" uniqKey="Gonzalez I">I González</name>
</author>
<author>
<name sortKey="Deejean, S" uniqKey="Deejean S">S Déejean</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khan, J" uniqKey="Khan J">J Khan</name>
</author>
<author>
<name sortKey="Wei, Js" uniqKey="Wei J">JS Wei</name>
</author>
<author>
<name sortKey="Ringner, M" uniqKey="Ringner M">M Ringnér</name>
</author>
<author>
<name sortKey="Saal, Lh" uniqKey="Saal L">LH Saal</name>
</author>
<author>
<name sortKey="Ladanyi, M" uniqKey="Ladanyi M">M Ladanyi</name>
</author>
<author>
<name sortKey="Westermann, F" uniqKey="Westermann F">F Westermann</name>
</author>
<author>
<name sortKey="Berthold, F" uniqKey="Berthold F">F Berthold</name>
</author>
<author>
<name sortKey="Schwab, M" uniqKey="Schwab M">M Schwab</name>
</author>
<author>
<name sortKey="Antonescu, Cr" uniqKey="Antonescu C">CR Antonescu</name>
</author>
<author>
<name sortKey="Peterson, C" uniqKey="Peterson C">C Peterson</name>
</author>
<author>
<name sortKey="Meltzer, Ps" uniqKey="Meltzer P">PS Meltzer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pomeroy, S" uniqKey="Pomeroy S">S Pomeroy</name>
</author>
<author>
<name sortKey="Tamayo, P" uniqKey="Tamayo P">P Tamayo</name>
</author>
<author>
<name sortKey="Gaasenbeek, M" uniqKey="Gaasenbeek M">M Gaasenbeek</name>
</author>
<author>
<name sortKey="Sturla, L" uniqKey="Sturla L">L Sturla</name>
</author>
<author>
<name sortKey="Angelo, M" uniqKey="Angelo M">M Angelo</name>
</author>
<author>
<name sortKey="Mclaughlin, M" uniqKey="Mclaughlin M">M McLaughlin</name>
</author>
<author>
<name sortKey="Kim, J" uniqKey="Kim J">J Kim</name>
</author>
<author>
<name sortKey="Goumnerova, L" uniqKey="Goumnerova L">L Goumnerova</name>
</author>
<author>
<name sortKey="Black, P" uniqKey="Black P">P Black</name>
</author>
<author>
<name sortKey="Lau, C" uniqKey="Lau C">C Lau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ramaswamy, S" uniqKey="Ramaswamy S">S Ramaswamy</name>
</author>
<author>
<name sortKey="Tamayo, P" uniqKey="Tamayo P">P Tamayo</name>
</author>
<author>
<name sortKey="Rifkin, R" uniqKey="Rifkin R">R Rifkin</name>
</author>
<author>
<name sortKey="Mukherjee, S" uniqKey="Mukherjee S">S Mukherjee</name>
</author>
<author>
<name sortKey="Yeang, C" uniqKey="Yeang C">C Yeang</name>
</author>
<author>
<name sortKey="Angelo, M" uniqKey="Angelo M">M Angelo</name>
</author>
<author>
<name sortKey="Ladd, C" uniqKey="Ladd C">C Ladd</name>
</author>
<author>
<name sortKey="Reich, M" uniqKey="Reich M">M Reich</name>
</author>
<author>
<name sortKey="Latulippe, E" uniqKey="Latulippe E">E Latulippe</name>
</author>
<author>
<name sortKey="Mesirov, J" uniqKey="Mesirov J">J Mesirov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yeung, K" uniqKey="Yeung K">K Yeung</name>
</author>
<author>
<name sortKey="Burmgarner, R" uniqKey="Burmgarner R">R Burmgarner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jakobsson, M" uniqKey="Jakobsson M">M Jakobsson</name>
</author>
<author>
<name sortKey="Scholz, S" uniqKey="Scholz S">S Scholz</name>
</author>
<author>
<name sortKey="Scheet, P" uniqKey="Scheet P">P Scheet</name>
</author>
<author>
<name sortKey="Gibbs, J" uniqKey="Gibbs J">J Gibbs</name>
</author>
<author>
<name sortKey="Vanliere, J" uniqKey="Vanliere J">J VanLiere</name>
</author>
<author>
<name sortKey="Fung, H" uniqKey="Fung H">H Fung</name>
</author>
<author>
<name sortKey="Szpiech, Z" uniqKey="Szpiech Z">Z Szpiech</name>
</author>
<author>
<name sortKey="Degnan, J" uniqKey="Degnan J">J Degnan</name>
</author>
<author>
<name sortKey="Wang, K" uniqKey="Wang K">K Wang</name>
</author>
<author>
<name sortKey="Guerreiro, R" uniqKey="Guerreiro R">R Guerreiro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guyon, I" uniqKey="Guyon I">I Guyon</name>
</author>
<author>
<name sortKey="Weston, J" uniqKey="Weston J">J Weston</name>
</author>
<author>
<name sortKey="Barnhill, S" uniqKey="Barnhill S">S Barnhill</name>
</author>
<author>
<name sortKey="Vapnik, V" uniqKey="Vapnik V">V Vapnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Chabrier, P" uniqKey="Chabrier P">P Chabrier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nguyen, D" uniqKey="Nguyen D">D Nguyen</name>
</author>
<author>
<name sortKey="Rocke, D" uniqKey="Rocke D">D Rocke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boulesteix, A" uniqKey="Boulesteix A">A Boulesteix</name>
</author>
<author>
<name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoskuldsson, A" uniqKey="Hoskuldsson A">A Höskuldsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wold, S" uniqKey="Wold S">S Wold</name>
</author>
<author>
<name sortKey="Sjostrom, M" uniqKey="Sjostrom M">M Sjöström</name>
</author>
<author>
<name sortKey="Eriksson, L" uniqKey="Eriksson L">L Eriksson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chih Yu Wang, C" uniqKey="Chih Yu Wang C">C Chih-Yu Wang</name>
</author>
<author>
<name sortKey="Chiang, C" uniqKey="Chiang C">C Chiang</name>
</author>
<author>
<name sortKey="Shueng Tsong Young, S" uniqKey="Shueng Tsong Young S">S Shueng-Tsong Young</name>
</author>
<author>
<name sortKey="Chiang, H" uniqKey="Chiang H">H Chiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nguyen, D" uniqKey="Nguyen D">D Nguyen</name>
</author>
<author>
<name sortKey="Rocke, D" uniqKey="Rocke D">D Rocke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le Cao, Ka" uniqKey="Le Cao K">KA Lê Cao</name>
</author>
<author>
<name sortKey="Meugnier, E" uniqKey="Meugnier E">E Meugnier</name>
</author>
<author>
<name sortKey="Mclachlan, G" uniqKey="Mclachlan G">G McLachlan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qiao, X" uniqKey="Qiao X">X Qiao</name>
</author>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21693065</article-id>
<article-id pub-id-type="pmc">3133555</article-id>
<article-id pub-id-type="publisher-id">1471-2105-12-253</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-12-253</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Lê Cao</surname>
<given-names>Kim-Anh</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>k.lecao@uq.edu.au</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Boitard</surname>
<given-names>Simon</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>simon.boitard@toulouse.inra.fr</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Besse</surname>
<given-names>Philippe</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>philippe.besse@math.univ-toulouse.fr</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Queensland Facility for Advanced Bioinformatics, University of Queensland, 4072 St Lucia, QLD, Australia</aff>
<aff id="I2">
<label>2</label>
UMR444 Laboratoire de Génétique Cellulaire, INRA, BP 52627, F-31326 Castanet Tolosan, France</aff>
<aff id="I3">
<label>3</label>
Institut de Mathématiques de Toulouse, Université de Toulouse et CNRS (UMR 5219), F-31062 Toulouse, France</aff>
<pub-date pub-type="collection">
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>22</day>
<month>6</month>
<year>2011</year>
</pub-date>
<volume>12</volume>
<fpage>253</fpage>
<lpage>253</lpage>
<history>
<date date-type="received">
<day>3</day>
<month>11</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>6</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2011 Lê Cao et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2011</copyright-year>
<copyright-holder>Lê Cao et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/12/253"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.</p>
</sec>
<sec>
<title>Results</title>
<p>A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the
<monospace>R package mixOmics</monospace>
, which is dedicated to the analysis of large biological data sets.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>High throughput technologies, such as microarrays or single nucleotide polymorphisms (SNPs) are seen as a great potential to gain new insights into cell biology, biological pathways or to assess population genetic structure. Microarray technique has been mostly used to further delineate cancers subgroups or to identify candidate genes for cancer prognosis and therapeutic targeting. To this aim, various classification techniques have been applied to analyze and understand gene expression data resulting from DNA microarrays ([
<xref ref-type="bibr" rid="B1">1</xref>
-
<xref ref-type="bibr" rid="B3">3</xref>
], to cite only a few). Genome wide association studies using SNPs aim to identify genetic variants related to complex traits. Thousands of SNPs are genotyped for a small number of phenotypes with genomic information, and clustering methods such as Bayesian cluster analysis and multidimensional scaling were previously applied to infer about population structure [
<xref ref-type="bibr" rid="B4">4</xref>
].</p>
<sec>
<title>Variable selection</title>
<p>As these high throughput data are characterized by thousands of variables (genes, SNPs) and a small number of samples (the microarrays or the patients), they often imply a high degree of multicollinearity, and, as a result, lead to severely ill-conditioned problems. In a supervised classification framework, one solution is to reduce the dimensionality of the data either by performing feature selection, or by introducing artificial variables that summarize most of the information. For this purpose, many approaches have been proposed in the microarray literature. Commonly used statistical tests such as t- or F-tests are often sensitive to highly correlated variables, which might be neglected in the variable selection. These tests may also discard variables that would be useful to distinguish classes that are difficult to classify [
<xref ref-type="bibr" rid="B5">5</xref>
]. Machine Learning approaches, such as Classification and Regression Trees (CART, [
<xref ref-type="bibr" rid="B6">6</xref>
]), Support Vector Machines (SVM, [
<xref ref-type="bibr" rid="B7">7</xref>
]) do not necessarily require variable selection for predictives purposes. However, in the case of highly dimensional data sets, the results are often difficult to interpret given the large number of variables. To circumvent this problem, several authors developed wrapper and embedded approaches for microarray data: Random Forests (RF, [
<xref ref-type="bibr" rid="B8">8</xref>
]), Recursive Feature Elimination (RFE, [
<xref ref-type="bibr" rid="B3">3</xref>
]), Nearest Shrunken Centroids (NSC, [
<xref ref-type="bibr" rid="B9">9</xref>
]), and more recently Optimal Feature Weighting (OFW, [
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
]). Other approaches were also used for exploratory purposes and to give more insight into biological studies. This is the case of Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA, see [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
] for a supervised version), Partial Least Squares Regression (PLS, [
<xref ref-type="bibr" rid="B13">13</xref>
], see also [
<xref ref-type="bibr" rid="B14">14</xref>
-
<xref ref-type="bibr" rid="B16">16</xref>
] for discrimination purposes), to explain most of the variance/covariance structure of the data using linear combinations of the original variables. LDA has often been shown to produce the best classification results. However, it has numerical limitations. In particular, for large data sets with too many correlated predictors, LDA uses too many parameters that are estimated with a high variance. There is therefore a need to either regularize LDA or introduce sparsity in LDA to obtain a parsimonious model. Another limitation of the approaches cited above is the lack of interpretability when dealing with a large number of variables.</p>
<p>Numerous
<italic>sparse </italic>
versions have therefore been proposed for feature selection purpose. They adapt well known ideas in the regression context by introducing penalties in the model. For example, a
<italic>l</italic>
<sub>2 </sub>
norm penalty leads to Ridge regression [
<xref ref-type="bibr" rid="B17">17</xref>
] to regularize non invertible singular matrices. In particular, penalties of type
<italic>l</italic>
<sub>1 </sub>
norm, also called Lasso [
<xref ref-type="bibr" rid="B18">18</xref>
], or
<italic>l</italic>
<sub>0 </sub>
norm, were also proposed to perform feature selection, as well as a combination of
<italic>l</italic>
<sub>1 </sub>
and
<italic>l</italic>
<sub>2 </sub>
penalties [
<xref ref-type="bibr" rid="B19">19</xref>
]. These penalties (
<italic>l</italic>
<sub>1 </sub>
and/or
<italic>l</italic>
<sub>2</sub>
) were applied to the variable weight vectors in order to select the relevant variables in PCA [
<xref ref-type="bibr" rid="B20">20</xref>
,
<xref ref-type="bibr" rid="B21">21</xref>
] and more recently in Canonical Correlation Analysis [
<xref ref-type="bibr" rid="B22">22</xref>
-
<xref ref-type="bibr" rid="B24">24</xref>
] and in PLS [
<xref ref-type="bibr" rid="B25">25</xref>
-
<xref ref-type="bibr" rid="B27">27</xref>
]. [
<xref ref-type="bibr" rid="B28">28</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
] also proposed a penalized version of the PLS for binary classification problems. Recently, [
<xref ref-type="bibr" rid="B30">30</xref>
] extended the SPLS from [
<xref ref-type="bibr" rid="B27">27</xref>
] for multiclass classification problems and demonstrated that both SPLSDA and SPLS with an incorporated generalized framework (SGPLS) improved classification accuracy compared to classical PLS [
<xref ref-type="bibr" rid="B31">31</xref>
-
<xref ref-type="bibr" rid="B33">33</xref>
].</p>
</sec>
<sec>
<title>Multiclass problems</title>
<p>In this study, we specifically focus on multiclass classification problems. Multiclass problems are commonly encountered in microarray studies, and have recently given rise to several contributions in the literature [
<xref ref-type="bibr" rid="B34">34</xref>
] and more recently [
<xref ref-type="bibr" rid="B35">35</xref>
,
<xref ref-type="bibr" rid="B36">36</xref>
]. Extending binary classification approaches to multiclass problems is not a trivial task. Some approaches can naturally extend to multiclass problems, this is the case of CART or LDA. Other approaches require the decomposition of the multiclass problem into several binary problems, or the definition of multiclass objective functions. This is the case, for example, of SVM one-vs.-one, one-vs.-rest or multiclass SVM.</p>
</sec>
<sec>
<title>Sparse PLS-DA</title>
<p>We introduce a sparse version of the PLS for discrimination purposes (sPLS-Discriminant Analysis) which is a natural extension to the sPLS proposed by [
<xref ref-type="bibr" rid="B25">25</xref>
,
<xref ref-type="bibr" rid="B26">26</xref>
]. Although PLS is principally designed for regression problems, it performs well for classification problems [
<xref ref-type="bibr" rid="B37">37</xref>
,
<xref ref-type="bibr" rid="B38">38</xref>
]. Using this exploratory approach in a supervised classification context enables to check the generalization properties of the model and be assured that the selected variables can help predicting the outcome status of the patients. It is also important to check the stability of the selection, as proposed by [
<xref ref-type="bibr" rid="B39">39</xref>
,
<xref ref-type="bibr" rid="B40">40</xref>
]. We show that sPLS-DA has very satisfying predictive performances and is well able to select informative variables. In contrary to the two-stages approach recently proposed by [
<xref ref-type="bibr" rid="B30">30</xref>
], sPLS-DA performs variable selection and classification in a one step procedure. We also give a strong focus to graphical representations to aid the interpretation of the results. We show that the computational efficiency of sPLS-DA, combined with graphical outputs clearly give sPLS-DA a strong advantage to the other types of one step procedure variable selection approaches in the multiclass case.</p>
</sec>
<sec>
<title>Outline of the paper</title>
<p>We will first discuss the number of dimensions to choose in sPLS-DA, and compare its classification performance with multivariate projection-based approaches: variants of sLDA [
<xref ref-type="bibr" rid="B41">41</xref>
], variants of SPLSDA and with SGPLS from [
<xref ref-type="bibr" rid="B30">30</xref>
]; and with five multiclass wrapper approaches (RFE, NSC, RF, OFW-cart, OFW-svm) on four public multiclass microarray data sets and one public SNP data set. All approaches perform internal variable selection and are compared based on their generalization performance and their computational time. We discuss the stability of the variable selection performed with sPLS-DA and the biological relevancy of the selected genes. Unlike the other projection-based sparse approaches tested, we show that sPLS-DA proposes valuable graphical outputs, also available from our
<monospace>R </monospace>
package
<monospace>mixOmics</monospace>
, to guide the interpretation of the results [
<xref ref-type="bibr" rid="B42">42</xref>
,
<xref ref-type="bibr" rid="B43">43</xref>
].</p>
</sec>
</sec>
<sec>
<title>Results and Discussion</title>
<p>In this section, we compare our proposed sPLS-DA approach with other sparse exploratory approaches such as two sparse Linear Discriminant Analyses (LDA) proposed by [
<xref ref-type="bibr" rid="B41">41</xref>
], and three other versions of sparse PLS from [
<xref ref-type="bibr" rid="B30">30</xref>
]. We also include in our comparisons several wrapper multiclass classification approaches. Comparisons are made on four public cancer microarray data sets and on one SNP data set. All these approaches perform variable selection in a supervised classification setting, i.e. we are looking for the genes/SNPs which can help classifying the different sample classes.</p>
<p>We first discuss the choice of the number of dimensions
<italic>H </italic>
to choose with sPLS-DA, the classification performance obtained with the tested approaches and the computational time required for the exploratory approaches. We then perform a stability analysis with sPLS-DA that can help tuning the number of variables to select and we illustrate some useful graphical outputs resulting from the by-products of sPLS-DA. We finally assess the biological relevancy of the list of genes obtained on one data set.</p>
<sec>
<title>Data sets</title>
<sec>
<title>Leukemia</title>
<p>The 3-class Leukemia version [
<xref ref-type="bibr" rid="B1">1</xref>
] with 7129 genes compares the lymphocytes B and T in ALL (Acute Lymphoblastic Leukemia, 38 and 9 cases) and the AML class (Acute Myeloid Leukemia, 25 cases). The classes AML-B and AML-T are known to be biologically very similar, which adds some complexity in the data set.</p>
</sec>
<sec>
<title>SRBCT</title>
<p>The Small Round Blue-Cell Tumor Data of childhood (SRBCT, [
<xref ref-type="bibr" rid="B44">44</xref>
]) includes 4 different types of tumors with 23, 20, 12 and 8 microarrays per class and 2308 genes.</p>
</sec>
<sec>
<title>Brain</title>
<p>The Brain data set compares 5 embryonal tumors [
<xref ref-type="bibr" rid="B45">45</xref>
] with 5597 gene expression. Classes 1, 2 and 3 count 10 microarrays each, the remaining classes 4 and 8.</p>
</sec>
<sec>
<title>GCM</title>
<p>The Multiple Tumor data set initially compared 14 tumors [
<xref ref-type="bibr" rid="B46">46</xref>
] and 7129 gene expressions. We used the normalized data set from [
<xref ref-type="bibr" rid="B47">47</xref>
] with 11 types of tumor. The data set contains 90 samples coming from different tumor types: breast (7), central nervous system (12), colon (10), leukemia (29), lung (6), lymphoma (19), melanoma (5), mesotheolima (11), pancreas (7), renal (8) and uterus (9).</p>
</sec>
<sec>
<title>SNP data</title>
<p>The SNP data set considered in our study is a subsample of the data set studied by [
<xref ref-type="bibr" rid="B48">48</xref>
] in the context of the Human Genome Diversity Project, which was initiated for the purpose of assessing worldwide genetic diversity in human. The original data set of [
<xref ref-type="bibr" rid="B48">48</xref>
] included the genotypes at 525,910 single-nucleotide polymorphisms (SNPs) of 485 individuals from a worldwide sample of 29 populations. In order to work on a smaller sample size data set with still a large number of classes or populations (
<italic>K </italic>
= 7) and with a high complexity classification, we chose to keep only the African populations: Bantu Kenya, Bantu South Africa, Biaka Pygmy, Mandenka, Mbuty Pygmy, San and Yoruba. We filtered the SNPs with a Minor Allele Frequency> 0.05. For computational reasons, in particular to run the evaluation procedures using the wrapper methods, we randomly sampled 20,000 SNPs amongst the ones of the original dataset. The aim of this preliminary analysis is to show that sPLS-DA is well able to give satisfying results on biallelic discrete ordinal data (coded 0, 1 or 2, i.e. the number of mutant alleles at one SNP for one individual) compared to the other approaches.</p>
</sec>
</sec>
<sec>
<title>Choosing the number of sPLS-DA dimensions</title>
<p>In the case of LDA or sparse LDA (sLDA), it is of convention to choose the number of discriminant vectors
<italic>H </italic>
≤ min(
<italic>p</italic>
,
<italic>K </italic>
- 1), where p is the total number of variables and
<italic>K </italic>
is the number of classes. The
<italic>p</italic>
-dimensional data will be projected onto a
<italic>H</italic>
-dimensional space spanned by the first
<italic>H </italic>
discriminant vectors, also called
<italic>dimensions </italic>
in the case of sPLS.</p>
<p>To check if the same applies to sPLS-DA, we have plotted the mean classification error rate (10 cross-validation averaged 10 times) for each sPLS-DA dimension (Figure
<xref ref-type="fig" rid="F1">1</xref>
for the Brain and SNP data sets, see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
for the other data sets). We can observe that the estimated error rate is stabilized after the first
<italic>K </italic>
- 1 dimensions for any number of selected variables for the microarray data sets. For the SNP data set,
<italic>H </italic>
should be set to
<italic>K </italic>
- 2. The latter result is surprising, but can be explained by the high similarity between two of the classes: the Bantu Kenya and Banty South Africa populations, as illustrated later in the text.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Choosing the number of dimensions in sPLS-DA</bold>
. Estimated classification error rates for Brain and SNP (10 cross-validation averaged 10 times) with respect to each sPLS-DA dimension. The different lines represent the number of variables selected on each dimension (going from 5 to
<italic>p</italic>
).</p>
</caption>
<graphic xlink:href="1471-2105-12-253-1"></graphic>
</fig>
<p>Therefore, according to these graphics, reducing the subspace to the first
<italic>K </italic>
- 1 (
<italic>K </italic>
- 2) dimensions is sufficient to explain the covariance structure of the microarray (SNP) data. In the following, we only record the classification error rate obtained after
<italic>K </italic>
- 1 (
<italic>K </italic>
- 2) deflation steps have been performed with sPLS-DA - this also applies to the tested variants of SPLS from [
<xref ref-type="bibr" rid="B30">30</xref>
].</p>
</sec>
<sec>
<title>Comparisons with other multiclass classification approaches</title>
<p>We compared the classification performance obtained with state-of-the-art classification approaches: RFE [
<xref ref-type="bibr" rid="B49">49</xref>
], NSC [
<xref ref-type="bibr" rid="B9">9</xref>
] and RF [
<xref ref-type="bibr" rid="B8">8</xref>
], as well as a recently proposed approach: OFW [
<xref ref-type="bibr" rid="B10">10</xref>
] that has been implemented with two types of classifiers, CART or SVM and has also been extended to the multiclass case [
<xref ref-type="bibr" rid="B5">5</xref>
]. These wrapper approaches include an internal variable selection procedure to perform variable selection.</p>
<p>We compared the classification performance of sPLS-DA to sLDA variants proposed by [
<xref ref-type="bibr" rid="B41">41</xref>
] based on a pooled centroids formulation of the LDA predictor function. The authors introduced feature selection by using correlation adjusted t-scores to deal with highly dimensional problems. Two shrinkage approaches were proposed, with the classical LDA (subsequently called sLDA) as well as with the diagonal discriminant analysis (sDDA). The reader can refer to [
<xref ref-type="bibr" rid="B41">41</xref>
] for more details and the associated
<monospace>R </monospace>
package
<monospace>sda</monospace>
.</p>
<p>Finally, we included the results obtained with 3 other versions of sparse PLS proposed by [
<xref ref-type="bibr" rid="B30">30</xref>
]. The SPLSDA formulation is very similar to what we propose in sPLS-DA, except that the variable selection and the classification is performed in two stages - whereas the prediction step in sPLS-DA is directly obtained from the by-products of the sPLS - see Section Methods. The authors in [
<xref ref-type="bibr" rid="B30">30</xref>
] therefore proposed to apply different classifiers once the variable selection is performed: Linear Discriminant Analysis (SPLSDA-LDA) or a logistic regression (SPLSDA-LOG). The authors also proposed a one-stage approach SGPLS by incorporating SPLS into a generalized linear model framework for a better sensitivity for multiclass classification. These approaches are implemented in the
<monospace>R</monospace>
package
<monospace>spls</monospace>
.</p>
<p>Figure
<xref ref-type="fig" rid="F2">2</xref>
displays the classification error rates estimated on each of the five data sets for all the tested approaches and Table
<xref ref-type="table" rid="T1">1</xref>
records the computational time required by the exploratory approaches to train the data on a given number of selected variables. Table
<xref ref-type="table" rid="T2">2</xref>
indicates the minimum estimated classification error rate obtained on each data set and for most of the approaches. Note that this table should be interpreted in conjunction with the results displayed in Figure
<xref ref-type="fig" rid="F2">2</xref>
to obtain a better comprehensive understanding of how all approaches perform in relation with each other.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Comparisons of the classification performance with other variable selection approaches</bold>
. Estimated classification error rates for Leukemia, SRBCT, Brain, GCM and the SNP data set (10 cross-validation averaged 10 times) with respect to the number of selected genes (from 5 to
<italic>p</italic>
) for the wrapper approaches and the sparse exploratory approaches.</p>
</caption>
<graphic xlink:href="1471-2105-12-253-2"></graphic>
</fig>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Computational time</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Data set</th>
<th align="center">sDDA</th>
<th align="center">sLDA</th>
<th align="center">sPLS-DA</th>
<th align="center">SPLS-LDA</th>
<th align="center">SPLS-LOG</th>
<th align="center">SGPLS</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Leukemia</td>
<td align="center">10</td>
<td align="center">32</td>
<td align="center">6</td>
<td align="center">31</td>
<td align="center">29</td>
<td align="center">8</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">SRBCT</td>
<td align="center">1</td>
<td align="center">3</td>
<td align="center">2</td>
<td align="center">3</td>
<td align="center">3</td>
<td align="center">6</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Brain</td>
<td align="center">1</td>
<td align="center">39</td>
<td align="center">6</td>
<td align="center">22</td>
<td align="center">23</td>
<td align="center">29</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">GCM</td>
<td align="center">1</td>
<td align="center">34</td>
<td align="center">11</td>
<td align="center">52</td>
<td align="center">53</td>
<td align="center">252</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">SNP</td>
<td align="center">2</td>
<td align="center">NA</td>
<td align="center">17</td>
<td align="center">749</td>
<td align="center">731</td>
<td align="center">NA</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Computational time in seconds on a Intel(R) Core (TM) 2 Duo CPU 2.40 GHz machine with 4 GB of RAM to run the approaches on the training data for a chosen number of selected variables (50 for the microarray data and 200 for the SNP data).</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Minimum classification error rate estimated for each data set for the first best approaches (percentage) and the associated number of genes/SNPs that were selected.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Data set</th>
<th align="left">rank 1</th>
<th align="left">rank 2</th>
<th align="left">rank 3</th>
<th align="left">rank 4</th>
<th align="left">rank 5</th>
<th align="left">rank 6</th>
<th align="left">rank 7</th>
<th align="left">rank 8</th>
<th align="left">rank 9</th>
</tr>
<tr>
<th align="center">Leukemia</th>
<th align="left">RFE</th>
<th align="left">SPLSDA-
<break></break>
LDA</th>
<th align="left">LDA</th>
<th align="left">SPLSDA-
<break></break>
LOG</th>
<th align="left">RF</th>
<th align="left">DDA</th>
<th align="left">sPLS</th>
<th align="left">NSC</th>
<th align="left">SGPLS</th>
</tr>
<tr>
<th align="center">error rate</th>
<th align="left">20.55</th>
<th align="left">22.36</th>
<th align="left">22.78</th>
<th align="left">23.33</th>
<th align="left">24.17</th>
<th align="left">24.31</th>
<th align="left">24.30</th>
<th align="left">26.25</th>
<th align="left">26.67</th>
</tr>
<tr>
<th align="center"># genes</th>
<th align="left">5</th>
<th align="left">200</th>
<th align="left">7129</th>
<th align="left">500</th>
<th align="left">200</th>
<th align="left">50</th>
<th align="left">10</th>
<th align="left">500</th>
<th align="left">500</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">SRBCT</td>
<td align="left">RF</td>
<td align="left">OFW-
<break></break>
cart</td>
<td align="left">DDA</td>
<td align="left">LDA</td>
<td align="left">sPLS</td>
<td align="left">NSC</td>
<td align="left">SGPLS</td>
<td align="left">RFE</td>
<td align="left">SPLSDA-
<break></break>
LDA</td>
</tr>
<tr>
<td align="center">error rate</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.16</td>
<td align="left">0.63</td>
<td align="left">1.27</td>
<td align="left">1.58</td>
<td align="left">1.90</td>
</tr>
<tr>
<td align="center"># genes</td>
<td align="left">30</td>
<td align="left">50</td>
<td align="left">30</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">500</td>
<td align="left">50</td>
<td align="left">5</td>
<td align="left">200</td>
</tr>
<tr>
<td colspan="10">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Brain</td>
<td align="left">RFE</td>
<td align="left">DDA</td>
<td align="left">LDA</td>
<td align="left">sPLS</td>
<td align="left">RF</td>
<td align="left">SPLSDA-
<break></break>
LDA</td>
<td align="left">NSC</td>
<td align="left">OFW-
<break></break>
cart</td>
<td align="left">SPLSDA-
<break></break>
LOG</td>
</tr>
<tr>
<td align="center">error rate</td>
<td align="left">10.56</td>
<td align="left">10.78</td>
<td align="left">11.11</td>
<td align="left">11.22</td>
<td align="left">11.89</td>
<td align="left">14.45</td>
<td align="left">15.11</td>
<td align="left">15.56</td>
<td align="left">17.00</td>
</tr>
<tr>
<td align="center"># genes</td>
<td align="left">10</td>
<td align="left">25</td>
<td align="left">30</td>
<td align="left">6144</td>
<td align="left">500</td>
<td align="left">35</td>
<td align="left">20</td>
<td align="left">35</td>
<td align="left">50</td>
</tr>
<tr>
<td colspan="10">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">GCM</td>
<td align="left">RFE</td>
<td align="left">LDA</td>
<td align="left">RF</td>
<td align="left">SGPLS-
<break></break>
LDA</td>
<td align="left">sPLS</td>
<td align="left">OFW-
<break></break>
svm</td>
<td align="left">SGPLS-
<break></break>
LOG</td>
<td align="left">OFW-
<break></break>
cart</td>
<td align="left">NSC</td>
</tr>
<tr>
<td align="center">error rate</td>
<td align="left">0.81</td>
<td align="left">1.14</td>
<td align="left">1.22</td>
<td align="left">1.63</td>
<td align="left">3.41</td>
<td align="left">4.01</td>
<td align="left">4.71</td>
<td align="left">4.88</td>
<td align="left">7.23</td>
</tr>
<tr>
<td align="center"># genes</td>
<td align="left">5</td>
<td align="left">500</td>
<td align="left">500</td>
<td align="left">200</td>
<td align="left">200</td>
<td align="left">500</td>
<td align="left">500</td>
<td align="left">7129</td>
<td align="left">10</td>
</tr>
<tr>
<td colspan="10">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">SNP</td>
<td align="left">NSC</td>
<td align="left">DDA</td>
<td align="left">SPLS</td>
<td align="left">RFE</td>
<td align="left">SPLSDA-
<break></break>
LDA</td>
<td align="left">RF</td>
<td align="left">SPLSDA-
<break></break>
LOG</td>
<td align="left">OFW-
<break></break>
cart</td>
<td align="left">OFW-
<break></break>
svm</td>
</tr>
<tr>
<td align="center">error rate</td>
<td align="left">6.50</td>
<td align="left">11.54</td>
<td align="left">11.71</td>
<td align="left">12.36</td>
<td align="left">13.01</td>
<td align="left">17.40</td>
<td align="left">31.22</td>
<td align="left">49.96</td>
<td align="left">51.67</td>
</tr>
<tr>
<td align="center"># SNPs</td>
<td align="left">5000</td>
<td align="left">1000</td>
<td align="left">2000</td>
<td align="left">20000</td>
<td align="left">2000</td>
<td align="left">20000</td>
<td align="left">200</td>
<td align="left">20000</td>
<td align="left">20000</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The approaches are ranked by their performances.</p>
</table-wrap-foot>
</table-wrap>
<sec>
<title>Details about the analysis</title>
<p>The aim of this section is to compare the classification performance of different types of variable selection approaches that may require some parameters to tune. We performed 10 fold cross-validation and averaged the obtained classification error rate accross 10 repetitions, and this for different variable selection sizes (Figure
<xref ref-type="fig" rid="F2">2</xref>
).</p>
<p>The wrapper approaches were run with the default parameters or the parameters proposed by the authors [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B50">50</xref>
]. The sDDA and sLDA approaches are actually two-stages approaches as variables need to be ranked first before sLDA/DDA can be applied, but they do not require any other input parameter than the number of variables to select. sPLS-DA, SPLSDA-LOG/LDA and SGPLS require as input the number of PLS dimensions as discussed above. In addition, while sPLS-DA requires the number of variables to select on each dimension as an input parameter, SPLSDA-LOG/LDA and SGPLS require to tune the
<italic>η </italic>
parameter that varies between 0 and 1 - the closer to 1 the smaller variable selection size, so that it matched the variable selection sizes with the other approaches. SPLSDA-LOG/LDA are performed in two steps: one step for variable selection with SPLS and one step for classification.</p>
</sec>
<sec>
<title>Complexity of the data sets</title>
<p>All data sets differ in their complexity. For example, the 4-class SRBCT data set is known to be easy to classify [
<xref ref-type="bibr" rid="B5">5</xref>
] and most approaches - except NSC, have similar good performances. Analogously, the GCM data set that contains numerous classes (11) gives similar overall classification error rates for all approaches. The Brain and Leukemia data sets with 5 and 3 classes respectively seem to increase in complexity, and, therefore, lead to more accentuated discrepancies between the different approaches. The SNP data set is more complex due to the discrete ordinal nature of the data (3 possible values for each variable), a high number of populations (7) that have similar characteristics - some of them, for instance Bantu Kenya and Bantu South Africa, are closely related. Consequently, it can be expected that a large number of SNP may be needed to discriminate at best the different populations. This is what we observe, but, nonetheless, most approaches (except OFW) perform well, in particular NSC.</p>
</sec>
<sec>
<title>Computational efficiency</title>
<p>We only recorded the computational time of the exploratory approaches sDDA, sLDA, SPLSDA-LOG, SPLSDA-LDA, SGPLS and sPLS-DA as the wrapper approaches are computationally very greedy (the training could take from 15 min up to 1 h on these data). Some computation time could not be recorded as a
<monospace>R</monospace>
memory allocation problem was encountered (SNP data for sLDA and SGPLS).</p>
<p>The fastest approach is sDDA (except for Leukemia). This approach is not necessarily the one that performs the best, but is certainly the most efficient on large data sets. sPLS-DA is the second fastest one. The SPLSDA approaches were efficient on SRBCT but otherwise performed third, while SGPLS computation time was similar to sPLSDA except for large multiclass data set such as GCM.</p>
</sec>
<sec>
<title>Wrapper approaches</title>
<p>Amongst the wrapper approaches, RFE gave the best results for a very small selection of variables in most cases. The performance of RFE then dramatically decreased when the number of selected variables becomes large. This is due to the the backward elimination strategy adopted in the approach: the original variables are progressively discarded until only the 'dominant' mostly uncorrelated variables remain. RF seemed to give the second best performance for a larger number of variables. OFW-cart also performed well, as it aggregates CART classifiers, whereas OFW-svm performed rather poorly. This latter result might be due to the use of the one-vs-one multiclass SVM. NSC seemed affected by a too large number of variables, but performed surprisingly well on the SNP data.</p>
</sec>
<sec>
<title>sDDA/sLDA</title>
<p>Both variants gave similar results, but we could observe some differences in the GCM data set. In fact, [
<xref ref-type="bibr" rid="B41">41</xref>
] advised to apply sDDA for extremely high-dimensional data, but when a difference was observed between the two approaches (GCM, Leukemia), it seemed that sLDA performs the best. However, in terms of computational efficiency, sDDA was the most efficient.</p>
</sec>
<sec>
<title>SPLSDA-LOG/SPLSDA-LDA</title>
<p>SPLSDA-LDA gave better results than SPLSDA-LOG except for SRBCT where both variants performed similarly. On Leukemia, Brain and SNP, SPLSDA-LDA had a similar performance to sPLS-DA but only when the selection size became larger.</p>
</sec>
<sec>
<title>SGPLS</title>
<p>SGPLS performed similarly to sPLS-DA on SRBCT and gave similar performance to sPLS-DA on Leukemia when the selection size was large. However, it performed poorly in Brain where the number of classes becomes large and very unbalanced. SGPLS could not be run on GCM data as while tuning the
<italic>η </italic>
parameter, the smallest variable selection size we could obtain was 100, which did not make SGPLS comparable to the other approaches. On the SNP data SGPLS encountered
<monospace>R</monospace>
memory allocation issues.</p>
</sec>
<sec>
<title>sPLS-DA</title>
<p>sPLS-DA gave similar results to sDDA and sLDA in the less complex data sets SRBCT and GCM. The performance obtained on Brain was quite poor, but results were very competitive in Leukemia for a number of selected genes varying between 5 and 30. Note that the number of selected variables is the total number of variables selected accross the
<italic>K </italic>
- 1(
<italic>K </italic>
- 2) chosen dimensions (SNP data). In overall, sPLS-DA gave better results than the wrapper approaches, and remained very competitive to the other exploratory approaches. One winning advantage of sPLS-DA is the graphical outputs that it can provide (see next Section), as well as its computational efficiency.</p>
</sec>
</sec>
<sec>
<title>Stability analysis of sPLS-DA</title>
<p>It is useful to assess how stable the variable selection is when the training set is perturbed, as recently proposed by [
<xref ref-type="bibr" rid="B39">39</xref>
,
<xref ref-type="bibr" rid="B40">40</xref>
]. For instance, the idea of bolasso [
<xref ref-type="bibr" rid="B40">40</xref>
] is to randomize the training set by drawing boostrap samples or drawing
<italic>n</italic>
/2 samples in the training set, where
<italic>n </italic>
is the total number of samples. The variable selection algorithm is then applied on each subsample with a fixed number of variables to select and the variables that are selected are then recorded [
<xref ref-type="bibr" rid="B40">40</xref>
]. proposed to keep in the selection only the variables that were selected in all subsamples, whereas [
<xref ref-type="bibr" rid="B39">39</xref>
] proposed to compute a relative selection frequency and keep the most stable variables in the selection.</p>
<p>We chose to illustrate the latter option as we believe that the stability frequency, or probability, gives a better understanding of the number of stable discriminative variables that are selected in sPLS-DA. The highly correlated variables will get a higher probability of being selected in each subsample, while the noisy variables will have a probability close to zero. This stability measure can also guide the user in the number of variables to choose on each sPLS-DA dimension. Once the number of variables to select has been chosen for the first dimension, the stability analysis should be run for the second dimension and so on. Note that [
<xref ref-type="bibr" rid="B39">39</xref>
] proposed an additional perturbation by introducing random weights in the Lasso coefficients, called
<italic>random lasso</italic>
. This latter approach could not, however, be directly applied in the sPLS-DA algorithm due to its iterative nature.</p>
<p>Figure
<xref ref-type="fig" rid="F3">3</xref>
illustrates the stability frequencies for the first two dimensions of the sPLS-DA for the GCM and SNP data sets using bootstrap sampling (i.e. of size
<italic>n</italic>
). The frequencies obtained on the GCM data set clearly show that the first 3 variables are often selected accross numerous bootstrap samples on the first dimension. We can see that while most microarray data could achieve a reasonably high stability frequency (see Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
), this was not the case, however, for the SNP data. Several SNPs may contain similar information, this may induce a lower stability across the bootstrap samples for a small variable selection. Once the variable selection size grows larger, then there is enough stable information to be retained.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Stability analysis</bold>
. Stability frequency using bolasso for the first two dimensions of sPLS-DA for GCM (top) and SNP data (bottom). One has to sequentially choose the most stabler genes/SNPs in the first dimension in order to pursue the stability analysis for the next sPLS-DA dimension.</p>
</caption>
<graphic xlink:href="1471-2105-12-253-3"></graphic>
</fig>
<p>We also noticed that once we reached too many dimensions (i.e. close
<italic>K </italic>
- 1), then the frequencies of all variables dropped, which clearly showed that sPLS-DA could not distinguish between discriminative variables and noisy variables any more (not shown).</p>
</sec>
<sec>
<title>Data visualization with sPLS-DA</title>
<sec>
<title>Representing the samples and the variables</title>
<p>Data interpretation is crucial for a better understanding of highly dimensional biological data sets. Data visualization is one of the clear advantage of projection-based methods, such a Principal Component Analysis (PCA), the original PLS-DA or sPLS-DA, compared to the other tested approaches (wrapper methods, SPLSDA and SGPLS). The decomposition of the data matrix into loading vectors and latent variables provide valuable graphical outputs to easily visualize the results. For example, the latent variables can be used to represent the similarities and dissimilarities between the samples: Figure
<xref ref-type="fig" rid="F4">4</xref>
illustrates the difference in the sample representation between classical PLS-DA (no variable selection) and sPLS-DA (26 genes selected on the first 2 dimensions) for the Brain data set. Variable selection for highly dimensional data sets can be beneficial to remove the noise and improve the samples clustering. A 3D graphical representation can be found in Additional file
<xref ref-type="supplementary-material" rid="S3">3</xref>
with sPLS-DA. Figures
<xref ref-type="fig" rid="F5">5</xref>
,
<xref ref-type="fig" rid="F6">6</xref>
and
<xref ref-type="fig" rid="F7">7</xref>
compare the sample representation on the SNP data set using PCA (SNP data set only), classical PLS-DA and sPLS-DA on several principal components or PLS dimensions. On the full data set, PCA is able to discriminate the African hunter gatherers populations San, Mbuti and Biaka from the 4 other populations that are very similar (Mandeka, Yoruba, Bantu South Africa and Bantu Kenya). This is a fact that was previously observed [
<xref ref-type="bibr" rid="B48">48</xref>
] and it indicates a good quality of the data. With PCA however, the differentiation between the 4 populations Mandeka, Yoruba, Bantu South Africa and Bantu Kenya is not very clear, even for further dimensions (Figure
<xref ref-type="fig" rid="F5">5</xref>
). On the contrary to PCA, PLS-DA (Figure
<xref ref-type="fig" rid="F6">6</xref>
) and sPLS-DA (Figure
<xref ref-type="fig" rid="F7">7</xref>
) are able to discriminate further these 4 populations on dimensions 4 and 5. In particular, the Mandeka population is well differentiated on dimension 4, and so is the Yaruba population on dimension 5. In terms of sample representation and in contrary to what was obtained with the Brain data set (Figure
<xref ref-type="fig" rid="F4">4</xref>
), the difference between PLS-DA and sPLS-DA is not striking on this particular data set. This is probably because the SNP variables, although containing redundant information, are all informative and mostly not noisy. This also explains the good population clusters obtained with PCA (Figure
<xref ref-type="fig" rid="F5">5</xref>
). However, the variable selection performed in sPLS-DA has two advantages: firstly it reduces the size of the data set for further investigation and analyses; secondly, since each (s)PLS dimension focuses on the differentiation of some particular populations (Figures
<xref ref-type="fig" rid="F5">5</xref>
and
<xref ref-type="fig" rid="F6">6</xref>
) and since sPLS selects an associated subset of variables on each on these dimensions, each of these subsets of variables is well able to differentiate these particular populations. This variable selection therefore gives more insight into the data (see [
<xref ref-type="bibr" rid="B25">25</xref>
] for more details). Figure
<xref ref-type="fig" rid="F8">8</xref>
illustrates the weights in absolute value of the sparse loading vectors for each sPLS-DA dimension in the Brain data set. Only the genes with a non-zero weight are considered in the sPLS-DA analysis and were included in the gene selection (50 genes in total for this example). Generally, the sparse loading vectors are orthogonal to each other, which permits to uniquely select genes across all dimensions. The latent variables can also be used to compute pairwise correlations between the genes to visualize them on correlation circles and better understand the correlation between the genes on each dimension (Figure
<xref ref-type="fig" rid="F9">9(a)</xref>
). Note that this type of output is commonly used for Canonical Correlation Analysis.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Brain data: sample representation and comparison with classical PLS-DA</bold>
. Comparisons of the sample representation using the first 2 latent variables from PLS-DA (no variable selection) and sPLS-DA (26 genes selected).</p>
</caption>
<graphic xlink:href="1471-2105-12-253-4"></graphic>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>SNP data: sample representation with PCA</bold>
. Sample representations using the first 5 principal components from PCA.</p>
</caption>
<graphic xlink:href="1471-2105-12-253-5"></graphic>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>SNP data: sample representation with classical PLS-DA</bold>
. Sample representation using the first 5 latent variables from PLS-DA (no SNPs selected).</p>
</caption>
<graphic xlink:href="1471-2105-12-253-6"></graphic>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>SNP data: sample representation with sPLS-DA</bold>
. Sample representation using the first 5 latent variables from sPLS-DA (1000 SNPs selected on each dimension).</p>
</caption>
<graphic xlink:href="1471-2105-12-253-7"></graphic>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>Brain data: representation of the loading vectors</bold>
. Absolute value of the weights in the loading vectors for each sPLS-DA dimension. Only the genes with non zero weights are considered in the sPLS-DA analysis and are included in the gene selection.</p>
</caption>
<graphic xlink:href="1471-2105-12-253-8"></graphic>
</fig>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption>
<p>
<bold>Brain data: variable representation</bold>
.
<bold>(a) </bold>
projection of the sPLS-DA selected variables on correlation circles with the
<monospace>R mixOmics</monospace>
package;
<bold>(b) </bold>
biological network generated with GeneGo using the same list of genes. Genes that are present in the network
<bold>(b) </bold>
are labelled in green, red and magenta in
<bold>(a)</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-12-253-9"></graphic>
</fig>
<p>On the contrary, the pooled centroid formulation used in sDDA and sLDA do not provide such latent variables, and, therefore, lack of such useful outputs. The same can be said about the wrapper approaches, which often have a much higher computational cost than the sparse exploratory approaches applied in this study.</p>
</sec>
</sec>
<sec>
<title>Brain data set: biological interpretation</title>
<sec>
<title>Comparisons between the gene lists</title>
<p>The ultimate aim when performing variable selection is to investigate whether the selected genes (or SNPS) have a biological meaning. We saw for example that some of the tested approaches gave similar performances, even though they select different variables.</p>
<p>We therefore compared the lists of 50 genes selected with the different approaches on the Brain data set. Note that the selection size has to be large enough to extract known biological information from manually curated databases.</p>
<p>Unsurprisingly, given the variety of approaches used, there were not many genes in common: there were between 12 and 30 genes shared between sPLS-DA, sDDA, sLDA and SPLDA - sDDA and sLDA shared the most important number of genes (30). The gene selection from SGPLS grandly differed from the other multivariate approaches (between 2 and 9 genes). This may explain why the performance of SGPLS was pretty poor compared to the other approaches on the Brain data set. RF seemed to be the approach that selected the highest number of genes in common with all approaches except with NSC (between 10 and 23 genes). A fact to be expected was that there were very few commonly selected genes between the exploratory approaches and the wrapper approaches (between 2 and 10 genes).</p>
<p>We then investigated further the biological meaning of the selected genes. This analysis was performed with the GeneGo software [
<xref ref-type="bibr" rid="B4">4</xref>
] that outputs process networks, gene ontology processes as well as the list of diseases potentially linked with the selected genes.</p>
<p>It was interesting to see that in all these gene lists (except NSC and RFE), between 3 to 5 genes were linked to networks involved in neurogenesis, apoptosis, as well as DNA damage (sPLS-DA, sDDA) and neurophysiological processes (OFW-cart). Most of the lists that were selected with the wrapper approaches generated interesting gene ontology processes, such as degeneration of neurons (RF), synaptic transmission or generation of neurons (OFW-svm). On the contrary, the sparse exploratory approaches seemed to pinpoint potential biomarkers linked with relevant diseases: central nervous system and brain tumor (sPLS-DA), Sturge Weber syndrome, angiomatosis, brain stem (sDDA, sLDA), neurocutaneous syndrome (sDDA), neurologic manifestations and cognition disorders (SGPLS).</p>
<p>This preliminary analysis shows that the different approaches are able to select relevant genes linked to the biological study and are able to select complementary information. This was also the conclusion drawn in [
<xref ref-type="bibr" rid="B10">10</xref>
].</p>
</sec>
<sec>
<title>Further biological interpretation with the sPLS-DA list</title>
<p>Using the GeneGo software, known biological networks were generated from the list of genes selected with sPLS-DA - 26 genes in total for the first two dimensions. For example, the network represented in Figure
<xref ref-type="fig" rid="F9">9(b)</xref>
is based on 12 of these selected genes (indicated with a red dot), which are involved in biological functions such as cell differentiation, cellular developmental process and central nervous system development. These genes are organised around two transcription factors, ESR1 and SP1. SP1 can activate or repress transcription in response to physiological and pathological stimuli and regulates the expression of a large number of genes involved in a variety of processes such as cell growth, apoptosis, differentiation and immune responses.</p>
<p>Interestingly, all 12 genes present in the network were also found to be highly correlated to the sPLS-DA dimensions 1 and 2 (indicated in green for the ESR1 network, magenta for the SP1 network and red for common genes in both subgraphs). This latter result suggests a. that the first (second) dimension of sPLS-DA seems to focus on the SP1 (ESR1) network and b. that the genes selected with sPLS-DA are of biological relevance (see Table
<xref ref-type="table" rid="T3">3</xref>
for a description of most genes). Further investigation would be required to give more insight into the sPLS-DA gene selection.</p>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Brain data: Biological relevance of some of the selected genes</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left">Bard1</td>
<td align="left">Plays a central role in the control of the cell cycle in response to DNA damage</td>
</tr>
<tr>
<td align="left">PGDH</td>
<td align="left">Possibly involved in development and maintenance of the blood-brain, blood-retina, blood-aqueous humor and blood-testis barrier. It is likely to play important roles in both maturation and maintenance of the central nervous system and male reproductive system</td>
</tr>
<tr>
<td align="left">Na(v) Beta1</td>
<td align="left">Involved in the generation and propagation of action potentials in muscle and neuronal cells</td>
</tr>
<tr>
<td align="left">NDF1</td>
<td align="left">Differentiation factor required for dendrite morphogenesis and maintenance in the cere bellar cortex</td>
</tr>
<tr>
<td align="left">Neuronatin</td>
<td align="left">May participate in the maintenance of segment identity in the hindbrain and pituitary development, and maturation or maintenance of the overall structure of the nervous system</td>
</tr>
<tr>
<td align="left">PEA15</td>
<td align="left">death effector domain (DED)-containing protein predominantly expressed in the central nervous system, particularly in astrocytes</td>
</tr>
<tr>
<td align="left">CD97</td>
<td align="left">Receptor potentially involved in both adhesion and signalling processes early after leukocyte activation. Plays an essential role in leukocyte migration</td>
</tr>
<tr>
<td align="left">ALDOC</td>
<td align="left">is expressed specifically in the hippocampus and Purkinje cells of the brain</td>
</tr>
<tr>
<td align="left">Cyclin D1</td>
<td align="left">The protein encoded by this gene has been shown to interact with tumor suppressor protein Rb Mutations, amplification and overexpression of this gene, which alters cell cycle progression, are observed frequently in a variety of tumors and may contribute to tumour genesis</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Description of the genes or proteins encoded by the genes selected by sPLS-DA and present in the known GeneGO biological network.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec>
<title>Conclusions</title>
<p>In this article, we showed that sPLS could be naturally extended to sPLS-DA for discrimination purposes by coding the response matrix
<italic>Y </italic>
with dummy variables. sPLS-DA often gave similar classification performance to competitive sparse LDA approaches in multiclass problems. Undoubtedly, the sparse approaches that we tested are extremely competitive to the wrapper methods, which are often considered as black boxes with no intuitive tuning parameters (such as the kernels to use in the SVM). The preliminary biological analysis showed that some tested approaches brought relevant biological information. The PLS-based approaches such as the sPLS-DA approach that we propose have a well established framework for class prediction. The computational efficiency of sPLS-DA as well as the valuable graphical outputs that provide easier interpretation of the results make sPLS-DA a great alternative to other types of variable selection techniques in a supervised classification framework. We also showed that a stability analysis could guide the parameter tunings of sPLS-DA. On the Brain data set, we showed that sPLS-DA selected relevant genes that shed more light on the biological study. For these reasons, we believe that sPLS-DA provides an interesting and worthwhile alternative for feature selection in multiclass problems.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<p>In this section, we introduce the sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to perform feature selection. sPLS-DA is based on Partial Least Squares regression (PLS) for discrimination analysis, but a Lasso penalization has been added to select variables. We denote
<italic>X </italic>
the
<italic>n </italic>
×
<italic>p </italic>
sample data matrix, where
<italic>n </italic>
is the number of patients or samples, and
<italic>p </italic>
is the number of variables (genes, SNPs, ...). In this supervised classification framework, we will assume that the samples
<italic>n </italic>
are partitioned into
<italic>K </italic>
groups.</p>
<sec>
<title>Introduction on PLS Discriminant Analysis</title>
<p>Although Partial Least Squares [
<xref ref-type="bibr" rid="B13">13</xref>
] was not originally designed for classification and discrimination problems, it has often been used for that purpose [
<xref ref-type="bibr" rid="B38">38</xref>
,
<xref ref-type="bibr" rid="B51">51</xref>
]. The response matrix
<italic>Y </italic>
is qualitative and is recoded as a dummy block matrix that records the membership of each observation, i.e. each of the response categories are coded via an indicator variable. The PLS regression (now PLS-DA) is then run as if
<italic>Y </italic>
was a continuous matrix. Note that this might be wrong from a theoretical point of view, however, it has been previously shown that this works well in practice and many authors have used dummy matrice in PLS for classification [
<xref ref-type="bibr" rid="B30">30</xref>
,
<xref ref-type="bibr" rid="B37">37</xref>
,
<xref ref-type="bibr" rid="B51">51</xref>
,
<xref ref-type="bibr" rid="B52">52</xref>
].</p>
<p>PLS constructs a set of orthogonal components that maximize the sample covariance between the response and the linear combination of the predictor variables. The objective function to be solved can be written as
<disp-formula>
<graphic xlink:href="1471-2105-12-253-i1.gif"></graphic>
</disp-formula>
</p>
<p>where
<italic>u
<sub>h </sub>
</italic>
and
<italic>v
<sub>h </sub>
</italic>
are the
<italic>h</italic>
th left and right singular vector of the singular value decomposition (SVD) of
<italic>X
<sup>T </sup>
Y </italic>
respectively [
<xref ref-type="bibr" rid="B53">53</xref>
] for each iteration or dimension
<italic>h </italic>
of the PLS. These singular vectors are also called loading vectors and are associated to the
<italic>X </italic>
and
<italic>Y </italic>
data set respectively.</p>
<p>In the case of discrimination problems, the PLS model can be formulated as follows:
<disp-formula>
<graphic xlink:href="1471-2105-12-253-i2.gif"></graphic>
</disp-formula>
</p>
<p>where
<italic>β </italic>
is the matrix of the regression coefficients and
<italic>E </italic>
is the residual matrix. To give more details,
<italic>β </italic>
=
<italic>W</italic>
*
<italic>V </italic>
<italic>
<sup>T</sup>
</italic>
, where V is the matrix containing the loading vectors (or right singular vectors from the SVD decomposition) (
<italic>v</italic>
<sub>1</sub>
, ...,
<italic>v
<sub>H </sub>
</italic>
) in columns,
<italic>W* </italic>
=
<italic>W </italic>
(
<italic>U
<sup>T </sup>
W </italic>
)
<sup>-1</sup>
, where
<italic>W </italic>
is the matrix containing the regression coefficients of the regression of
<italic>X </italic>
on the latent variable
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-253-i3.gif"></inline-graphic>
</inline-formula>
, and
<italic>U </italic>
is the matrix containing the loading vectors (or left singular vectors from the SVD decomposition) (
<italic>u</italic>
<sub>1</sub>
, ...,
<italic>u
<sub>H </sub>
</italic>
) in columns. More details about the PLS algorithm and the PLS model can be found in the reviews of [
<xref ref-type="bibr" rid="B53">53</xref>
,
<xref ref-type="bibr" rid="B54">54</xref>
]. The prediction of a new set of samples is then
<disp-formula>
<graphic xlink:href="1471-2105-12-253-i4.gif"></graphic>
</disp-formula>
</p>
<p>The identity of the class membership of each new sample (each row in
<italic>Y
<sub>new </sub>
</italic>
) is assigned as the column index of the element with the largest predicted value in this row.</p>
<sec>
<title>Discriminant PLS for large data sets</title>
<p>Numerous variants of PLS-DA have been proposed in the literature to be adapted to classification problems for large data sets such as microarray. Iterative Reweighted PLS was first proposed by [
<xref ref-type="bibr" rid="B31">31</xref>
] to extend PLS into the framework of generalized linear models. In the same context, [
<xref ref-type="bibr" rid="B51">51</xref>
,
<xref ref-type="bibr" rid="B55">55</xref>
,
<xref ref-type="bibr" rid="B56">56</xref>
] proposed a two-stage approach, first by extracting the PLS-DA latent variables to reduce the dimension of the data, and then by applying logistic discrimination or polychotomous discrimination in the case of multiclass problems. To avoid infinite parameters estimates and non convergence problems, other authors [
<xref ref-type="bibr" rid="B32">32</xref>
] extended the work of [
<xref ref-type="bibr" rid="B31">31</xref>
] by applying Firth's procedure to avoid (quasi) separation, whereas [
<xref ref-type="bibr" rid="B33">33</xref>
] combined PLS with logistic regression penalized with a ridge parameter. The response variables
<italic>Y </italic>
is modelled either as a dummy matrix [
<xref ref-type="bibr" rid="B51">51</xref>
,
<xref ref-type="bibr" rid="B55">55</xref>
,
<xref ref-type="bibr" rid="B56">56</xref>
], or as a pseudo-response variable whose expected value has a linear relationship with the covariates [
<xref ref-type="bibr" rid="B33">33</xref>
]. The approach proposed by [
<xref ref-type="bibr" rid="B32">32</xref>
] updates the adjusted dependent variable as the response rather than working with the original outcome. While these authors propose to address the problem of dimension reduction, they still require to perform gene filtering beforehand, with, for example,
<italic>t</italic>
-statistics or other filtering criterion such as the BSS/WSS originally proposed by [
<xref ref-type="bibr" rid="B2">2</xref>
].</p>
</sec>
</sec>
<sec>
<title>sparse PLS Discriminant Analysis</title>
<sec>
<title>sparse PLS for two data sets</title>
<p>The sparse PLS proposed by [
<xref ref-type="bibr" rid="B25">25</xref>
,
<xref ref-type="bibr" rid="B26">26</xref>
] was initially designed to identify subsets of correlated variables of two different types coming from two different data sets
<italic>X </italic>
and
<italic>Y </italic>
of sizes (
<italic>n </italic>
×
<italic>p</italic>
) and (
<italic>n </italic>
×
<italic>q</italic>
) respectively. The original approach was based on Singular Value Decomposition (SVD) of the cross product
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-253-i5.gif"></inline-graphic>
</inline-formula>
. We denote
<italic>u
<sub>h </sub>
</italic>
(
<italic>v
<sub>h</sub>
</italic>
) the left (right) singular vector from the SVD, for iteration
<italic>h</italic>
,
<italic>h </italic>
= 1 ...
<italic>H </italic>
where
<italic>H </italic>
is the number of performed deflations - also called chosen
<italic>dimensions </italic>
of the PLS. These singular vectors are named
<italic>loading vectors </italic>
in the PLS context. Sparse loading vectors were then obtained by applying
<italic>l</italic>
<sub>1 </sub>
penalization on both
<italic>u
<sub>h </sub>
</italic>
and
<italic>v
<sub>h</sub>
</italic>
. The optimization problem of the sPLS minimizes the Frobenius norm between the current cross product matrix and the loading vectors:
<disp-formula id="bmcM1">
<label>(1)</label>
<graphic xlink:href="1471-2105-12-253-i6.gif"></graphic>
</disp-formula>
</p>
<p>where
<italic>P</italic>
<sub>λ1 </sub>
(
<italic>u
<sub>h</sub>
</italic>
) =
<italic>sign</italic>
(
<italic>u
<sub>h</sub>
</italic>
)(|
<italic>u
<sub>h</sub>
</italic>
| - λ
<sub>1</sub>
)
<sub>+</sub>
, and
<italic>P</italic>
<sub>λ2 </sub>
(
<italic>v
<sub>h</sub>
</italic>
) =
<italic>sign</italic>
(
<italic>v
<sub>h</sub>
</italic>
)(|
<italic>v
<sub>h</sub>
</italic>
| - λ
<sub>2</sub>
)
<sub>+ </sub>
are applied componentwise in the vectors
<italic>u
<sub>h </sub>
</italic>
and
<italic>v
<sub>h </sub>
</italic>
and are the soft thresholding functions that approximate Lasso penalty functions [
<xref ref-type="bibr" rid="B21">21</xref>
]. They are simultaneously applied on both loading vectors. The problem (1) is solved with an iterative algorithm and the
<italic>X
<sub>h </sub>
</italic>
and
<italic>Y
<sub>h </sub>
</italic>
matrices are subsequently deflated for each iteration
<italic>h </italic>
(see [
<xref ref-type="bibr" rid="B25">25</xref>
] for more details). For practical purposes, sPLS has been implemented in the
<monospace>R</monospace>
package
<monospace>mixOmics</monospace>
such that the user can input the number of variables to select on each data set rather than the penalization parameters
<italic>λ</italic>
<sub>1 </sub>
and
<italic>λ</italic>
<sub>2</sub>
.</p>
</sec>
<sec>
<title>sPLS extended to sPLS-DA</title>
<p>The extension of sparse PLS to a supervised classification framework is straightforward. The response matrix
<italic>Y </italic>
of size (
<italic>n </italic>
×
<italic>K</italic>
) is coded with dummy variables to indicate the class membership of each sample. Note that in this specific framework, we will
<italic>only perform variable selection on the X data set</italic>
, i.e., we want to select the discriminative features that can help predicting the classes of the samples. The
<italic>Y </italic>
dummy matrix remains unchanged. Therefore, we set
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-253-i5.gif"></inline-graphic>
</inline-formula>
and the optimization problem of the sPLS-DA can be written as:
<disp-formula>
<graphic xlink:href="1471-2105-12-253-i7.gif"></graphic>
</disp-formula>
</p>
<p>with the same notation as in sPLS. Therefore, the penalization parameter to tune is λ. Our algorithm has been implemented to choose the number of variables to select rather than λ for practical reasons. For the class prediction of test samples, we use the
<italic>maximum </italic>
distance as presented above for the PLS case as it seemed to be the one that worked better in practice for multiclass problems. Note that other distances such as the centroids or Malhanobis distances are also implemented in the
<monospace>mixOmics</monospace>
package [
<xref ref-type="bibr" rid="B42">42</xref>
,
<xref ref-type="bibr" rid="B43">43</xref>
]. In the results section, we illustrated how to tune the PLS dimension
<italic>H </italic>
as well as the number of
<italic>X </italic>
variables to select.</p>
</sec>
<sec>
<title>sPLS-DA for multiclass classification</title>
<p>In binary problems, sPLS-DA was shown to bring relevant results in microarray cancer data sets (see [
<xref ref-type="bibr" rid="B57">57</xref>
]). In this paper, we investigated the use of sPLS-DA in the more complex multiclass case, as PLS-DA and sPLS-DA are naturally adapted to multiclass problems. In this paper, we did not attempt to address the specific problem of unbalanced classes, that would require the development of appropriately weighted multiclass objective functions for wrapper classification approaches (see for example [
<xref ref-type="bibr" rid="B58">58</xref>
]).</p>
</sec>
<sec>
<title>Parameters to tune in sPLS-DA</title>
<p>There are two parameters to tune in sPLS-DA: the number of dimensions
<italic>H</italic>
, and the number of variables to select on each dimension. In the Results Section, we showed that for most cases, the user could set
<italic>H </italic>
=
<italic>K </italic>
- 1, similar to what is advised in a LDA case. The number of variables to select is more challenging given the complexity of such data sets and is still as an open question. The tuning of such parameter can be guided through the estimation of the generalisation classification error and a stability analysis. However, these analyses might be seriously limited by the small number of samples. Most importantly, the user should keep in mind that a close interaction with the biologists is necessary to carefully tune this parameter in order to answer biological questions. Sometimes, an optimal but too short gene selection may not suffice to give a comprehensive biological interpretation, and experimental validation might be limited in the case of a too large gene selection.</p>
</sec>
</sec>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>KALC performed the statistical analysis, wrote the R functions and drafted the manuscript. SB preprocessed the SNP data and helped to draft the manuscript. PB participated in the design of the manuscript and helped to draft the manuscript. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>Tuning the number of dimensions in sPLS-DA</bold>
. Estimated classification error rates for Leukemia, SRBCT and GCM (10 cross-validation averaged 10 times) with respect to each sPLS-DA dimension. The different lines represent the number of variables selected on each dimension (going from 5 to
<italic>p</italic>
).</p>
</caption>
<media xlink:href="1471-2105-12-253-S1.EPS" mimetype="application" mime-subtype="postscript">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>
<bold>Stability analysis</bold>
. Stability frequency using bolasso for the first two dimensions of sPLS-DA for Brain (top) and SRBCT data (bottom). One has to sequentially choose the most stabler genes/SNP in the first dimension in order to go on to the next sPLS-DA dimension.</p>
</caption>
<media xlink:href="1471-2105-12-253-S2.EPS" mimetype="application" mime-subtype="postscript">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3">
<caption>
<title>Additional file 3</title>
<p>
<bold>Brain data: sample representation in 3D</bold>
. Example of 3D samples plot using the first 3 latent variables from sPLS-DA with the
<monospace>R mixOmics</monospace>
package.</p>
</caption>
<media xlink:href="1471-2105-12-253-S3.PNG" mimetype="image" mime-subtype="png">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We would like to thank Dr. Dominique Gorse (QFAB) for his advice on using GeneGo. We are indebted to Pierre-Alain Chaumeil (QFAB) for his support on using the QFAB cluster. We thank the referees for their useful comments that helped improving the manuscript. This work was supported, in part, by the Wound Management Innovation CRC (established and supported under the Australian Government's Cooperative Research Centres Program).</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Golub</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Slonim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tamayo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Huard</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gaasenbeek</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mesirov</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Coller</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Loh</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Downing</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Caligiuri</surname>
<given-names>M</given-names>
</name>
<etal></etal>
<article-title>Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring</article-title>
<source>Science</source>
<year>1999</year>
<volume>286</volume>
<issue>5439</issue>
<fpage>531</fpage>
<pub-id pub-id-type="doi">10.1126/science.286.5439.531</pub-id>
<pub-id pub-id-type="pmid">10521349</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Dudoit</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Fridlyand</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Speed</surname>
<given-names>T</given-names>
</name>
<article-title>Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data</article-title>
<source>Journal of the American Statistical Association</source>
<year>2002</year>
<volume>97</volume>
<issue>457</issue>
<fpage>77</fpage>
<lpage>88</lpage>
<pub-id pub-id-type="doi">10.1198/016214502753479248</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Guyon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Elisseefi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kaelbling</surname>
<given-names>L</given-names>
</name>
<article-title>An Introduction to Variable and Feature Selection</article-title>
<source>Journal of Machine Learning Research</source>
<year>2003</year>
<volume>3</volume>
<issue>7-8</issue>
<fpage>1157</fpage>
<lpage>1182</lpage>
<pub-id pub-id-type="doi">10.1162/153244303322753616</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ball</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Blake</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Botstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Cherry</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Dolinski</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dwight</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Eppig</surname>
<given-names>J</given-names>
</name>
<etal></etal>
<article-title>Gene Ontology: tool for the unification of biology</article-title>
<source>Nature genetics</source>
<year>2000</year>
<volume>25</volume>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="doi">10.1038/75556</pub-id>
<pub-id pub-id-type="pmid">10802651</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Bonnet</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gadat</surname>
<given-names>S</given-names>
</name>
<article-title>Multiclass classification and gene selection with a stochastic algorithm</article-title>
<source>Computational Statistics and Data Analysis</source>
<year>2009</year>
<volume>53</volume>
<fpage>3601</fpage>
<lpage>3615</lpage>
<pub-id pub-id-type="doi">10.1016/j.csda.2009.02.028</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="book">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Olshen</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Stone</surname>
<given-names>C</given-names>
</name>
<source>Classification and Regression Trees</source>
<year>1984</year>
<publisher-name>Monterey, CA: Wadsworth and Brooks</publisher-name>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="book">
<name>
<surname>Vapnik</surname>
<given-names>VN</given-names>
</name>
<source>The Nature of Statistical Learning Theory (Information Science and Statistics)</source>
<year>1999</year>
<publisher-name>Springer</publisher-name>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
<article-title>Random forests</article-title>
<source>Machine learning</source>
<year>2001</year>
<volume>45</volume>
<fpage>5</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>G</given-names>
</name>
<article-title>Diagnosis of multiple cancer types by shrunken centroids of gene expression</article-title>
<source>Proceedings of the National Academy of Sciences</source>
<year>2002</year>
<volume>99</volume>
<issue>10</issue>
<fpage>6567</fpage>
<pub-id pub-id-type="doi">10.1073/pnas.082099299</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Goncalves</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Besse</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gadat</surname>
<given-names>S</given-names>
</name>
<article-title>Selection of biologically relevant genes with a wrapper stochastic algorithm</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2007</year>
<volume>6</volume>
<fpage>29</fpage>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Bair</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Paul</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<article-title>Prediction by Supervised Principal Components</article-title>
<source>Journal of the American Statistical Association</source>
<year>2006</year>
<volume>101</volume>
<issue>473</issue>
<fpage>119</fpage>
<lpage>137</lpage>
<pub-id pub-id-type="doi">10.1198/016214505000000628</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Jombart</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Devillard</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Balloux</surname>
<given-names>F</given-names>
</name>
<article-title>Discriminant analysis of principal components: a new method for the analysis of genetically structured populations</article-title>
<source>BMC Genetics</source>
<year>2010</year>
<volume>11</volume>
<issue>94</issue>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="book">
<name>
<surname>Wold</surname>
<given-names>H</given-names>
</name>
<person-group person-group-type="editor">krishnaiah pr</person-group>
<source>Multivariate Analysis</source>
<year>1966</year>
<publisher-name>Academic Press, New York, Wiley</publisher-name>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Antoniadis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lambert-Lacroix</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Leblanc</surname>
<given-names>F</given-names>
</name>
<article-title>Effective dimension reduction methods for tumor classification using gene expression data</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>5</issue>
<fpage>563</fpage>
<lpage>570</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg062</pub-id>
<pub-id pub-id-type="pmid">12651713</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Boulesteix</surname>
<given-names>A</given-names>
</name>
<article-title>PLS Dimension Reduction for Classification with Microarray Data</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2004</year>
<volume>3</volume>
<fpage>1075</fpage>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Dai</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lieu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Rocke</surname>
<given-names>D</given-names>
</name>
<article-title>Dimension reduction for classification with gene expression microarray data</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2006</year>
<volume>5</volume>
<fpage>1147</fpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="book">
<name>
<surname>Hoerl</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kennard</surname>
<given-names>R</given-names>
</name>
<source>Ridge regression in 'Encyclopedia of Statistical Sciences'</source>
<year>1984</year>
<volume>8</volume>
<publisher-name>Monterey, CA: Wiley, New York</publisher-name>
<pub-id pub-id-type="pmid">21696930</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<article-title>Regression shrinkage and selection via the lasso</article-title>
<source>Journal of the Royal Statistical Society, Series B</source>
<year>1996</year>
<volume>58</volume>
<fpage>267</fpage>
<lpage>288</lpage>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Zou</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<article-title>Regularization and variable selection via the elastic net</article-title>
<source>Journal of the Royal Statistical Society Series B</source>
<year>2005</year>
<volume>67</volume>
<issue>2</issue>
<fpage>301</fpage>
<lpage>320</lpage>
<pub-id pub-id-type="doi">10.1111/j.1467-9868.2005.00503.x</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Jolliffe</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Trendafilov</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Uddin</surname>
<given-names>M</given-names>
</name>
<article-title>A Modified Principal Component Technique Based on the LASSO</article-title>
<source>Journal of Computational & Graphical Statistics</source>
<year>2003</year>
<volume>12</volume>
<issue>3</issue>
<fpage>531</fpage>
<lpage>547</lpage>
<pub-id pub-id-type="doi">10.1198/1061860032148</pub-id>
<pub-id pub-id-type="pmid">21706111</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Shen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>JZ</given-names>
</name>
<article-title>Sparse Principal Component Analysis via Regularized Low Rank Matrix Approximation</article-title>
<source>Journal of Multivariate Analysis</source>
<year>2008</year>
<volume>99</volume>
<fpage>1015</fpage>
<lpage>1034</lpage>
<pub-id pub-id-type="doi">10.1016/j.jmva.2007.06.007</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Waaijenborg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>de Witt Hamer</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Philip</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zwinderman</surname>
<given-names>A</given-names>
</name>
<article-title>Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2008</year>
<volume>7</volume>
<issue>3</issue>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Parkhomenko</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tritchler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Beyene</surname>
<given-names>J</given-names>
</name>
<article-title>Sparse canonical correlation analysis with application to genomic data integration</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2009</year>
<volume>8</volume>
<fpage>1</fpage>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Witten</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<article-title>A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis</article-title>
<source>Biostatistics</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>515</fpage>
<pub-id pub-id-type="doi">10.1093/biostatistics/kxp008</pub-id>
<pub-id pub-id-type="pmid">19377034</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Rossouw</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Robert-Granié</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Besse</surname>
<given-names>P</given-names>
</name>
<article-title>Sparse PLS: Variable Selection when Integrating Omics data</article-title>
<source>Statistical Application and Molecular Biology</source>
<year>2008</year>
<volume>7</volume>
<issue>1</issue>
<fpage>37</fpage>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Robert-Granié</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Besse</surname>
<given-names>P</given-names>
</name>
<article-title>ofw: Sparse canonical methods for biological data integration: application to a cross-platform study</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>34</issue>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Chun</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Keleş</surname>
<given-names>S</given-names>
</name>
<article-title>Sparse partial least squares regression for simultaneous dimension reduction and variable selection</article-title>
<source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>
<year>2010</year>
<volume>72</volume>
<fpage>3</fpage>
<lpage>25</lpage>
<pub-id pub-id-type="doi">10.1111/j.1467-9868.2009.00723.x</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W</given-names>
</name>
<article-title>Linear regression and two-class classification with gene expression data</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>16</issue>
<fpage>2072</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg283</pub-id>
<pub-id pub-id-type="pmid">14594712</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="other">
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>J</given-names>
</name>
<article-title>Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<fpage>4991</fpage>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Chung</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Keles</surname>
<given-names>S</given-names>
</name>
<article-title>Sparse Partial Least Squares Classification for High Dimensional Data</article-title>
<source>Statistical Applications in Genetics and Molecular Biology</source>
<year>2010</year>
<volume>9</volume>
<fpage>17</fpage>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="other">
<name>
<surname>Marx</surname>
<given-names>B</given-names>
</name>
<article-title>Iteratively reweighted partial least squares estimation for generalized linear regression</article-title>
<source>Technometrics</source>
<year>1996</year>
<fpage>374</fpage>
<lpage>381</lpage>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Ding</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gentleman</surname>
<given-names>R</given-names>
</name>
<article-title>Classification using generalized partial least squares</article-title>
<source>Journal of Computational and Graphical Statistics</source>
<year>2005</year>
<volume>14</volume>
<issue>2</issue>
<fpage>280</fpage>
<lpage>298</lpage>
<pub-id pub-id-type="doi">10.1198/106186005X47697</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Fort</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Lambert-Lacroix</surname>
<given-names>S</given-names>
</name>
<article-title>Classification using partial least squares with penalized logistic regression</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>7</issue>
<fpage>1104</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti114</pub-id>
<pub-id pub-id-type="pmid">15531609</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Tuck</surname>
<given-names>D</given-names>
</name>
<article-title>MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>9</issue>
<fpage>1106</fpage>
<lpage>1114</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm036</pub-id>
<pub-id pub-id-type="pmid">17494773</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>T</given-names>
</name>
<article-title>Efficient multi-class cancer diagnosis algorithm, using a global similarity pattern</article-title>
<source>Computational Statistics & Data Analysis</source>
<year>2009</year>
<volume>53</volume>
<issue>3</issue>
<fpage>756</fpage>
<lpage>765</lpage>
<pub-id pub-id-type="doi">10.1016/j.csda.2008.08.028</pub-id>
<pub-id pub-id-type="pmid">21703335</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C</given-names>
</name>
<article-title>A genetic programming-based approach to the classification of multiclass microarray datasets</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>3</issue>
<fpage>331</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn644</pub-id>
<pub-id pub-id-type="pmid">19088122</pub-id>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="journal">
<name>
<surname>Barker</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rayens</surname>
<given-names>W</given-names>
</name>
<article-title>Partial least squares for discrimination</article-title>
<source>Journal of Chemometrics</source>
<year>2003</year>
<volume>17</volume>
<issue>3</issue>
<fpage>166</fpage>
<lpage>173</lpage>
<pub-id pub-id-type="doi">10.1002/cem.785</pub-id>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal">
<name>
<surname>Tan</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Tong</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Gene Hwang</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>C</given-names>
</name>
<article-title>Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models</article-title>
<source>Computational Biology and Chemistry</source>
<year>2004</year>
<volume>28</volume>
<issue>3</issue>
<fpage>235</fpage>
<lpage>243</lpage>
<pub-id pub-id-type="doi">10.1016/j.compbiolchem.2004.05.002</pub-id>
<pub-id pub-id-type="pmid">15261154</pub-id>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="book">
<name>
<surname>Meinshausen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bühlmann</surname>
<given-names>P</given-names>
</name>
<source>Stability selection</source>
<year>2008</year>
<publisher-name>Tech. rep., ETH Zurich</publisher-name>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="book">
<name>
<surname>Bach</surname>
<given-names>F</given-names>
</name>
<source>Model-consistent sparse estimation through the bootstrap</source>
<year>2009</year>
<publisher-name>Tech. rep., Laboratoire d'Informatique de l'Ecole Normale Superieure, Paris</publisher-name>
</mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="journal">
<name>
<surname>Ahdesmäki</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<article-title>Feature selection in omics prediction problems using cat scores and false non-discovery rate control</article-title>
<source>Ann Appl Stat</source>
<year>2010</year>
<volume>4</volume>
<fpage>503</fpage>
<lpage>519</lpage>
<pub-id pub-id-type="doi">10.1214/09-AOAS277</pub-id>
</mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>González</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Déejean</surname>
<given-names>S</given-names>
</name>
<article-title>integrOmics: an
<monospace>R</monospace>
package to unravel relationships between two omics data sets</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>21</issue>
<fpage>2855</fpage>
<lpage>2856</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp515</pub-id>
<pub-id pub-id-type="pmid">19706745</pub-id>
</mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="other">
<article-title>mixOmics</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.math.univ-toulouse.fr/~biostat/mixOmics">http://www.math.univ-toulouse.fr/~biostat/mixOmics</ext-link>
</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="journal">
<name>
<surname>Khan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Ringnér</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Saal</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Ladanyi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Westermann</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Berthold</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Schwab</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Meltzer</surname>
<given-names>PS</given-names>
</name>
<article-title>Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks</article-title>
<source>Nat Med</source>
<year>2001</year>
<volume>7</volume>
<issue>6</issue>
<fpage>673</fpage>
<lpage>679</lpage>
<pub-id pub-id-type="doi">10.1038/89044</pub-id>
<pub-id pub-id-type="pmid">11385503</pub-id>
</mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="journal">
<name>
<surname>Pomeroy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tamayo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gaasenbeek</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sturla</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Angelo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>McLaughlin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Goumnerova</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Black</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lau</surname>
<given-names>C</given-names>
</name>
<etal></etal>
<article-title>Prediction of central nervous system embryonal tumour outcome based on gene expression</article-title>
<source>Nature</source>
<year>2002</year>
<volume>415</volume>
<issue>6870</issue>
<fpage>436</fpage>
<lpage>442</lpage>
<pub-id pub-id-type="doi">10.1038/415436a</pub-id>
<pub-id pub-id-type="pmid">11807556</pub-id>
</mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="journal">
<name>
<surname>Ramaswamy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tamayo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rifkin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mukherjee</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yeang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Angelo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ladd</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Reich</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Latulippe</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Mesirov</surname>
<given-names>J</given-names>
</name>
<etal></etal>
<article-title>Multiclass cancer diagnosis using tumor gene expression signatures</article-title>
<source>Proceedings of the National Academy of Sciences</source>
<year>2001</year>
<volume>98</volume>
<issue>26</issue>
<fpage>15149</fpage>
<lpage>15154</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.211566398</pub-id>
</mixed-citation>
</ref>
<ref id="B47">
<mixed-citation publication-type="journal">
<name>
<surname>Yeung</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Burmgarner</surname>
<given-names>R</given-names>
</name>
<article-title>Multi-class classification of microarray data with repeated measurements: application to cancer</article-title>
<source>Genome Biology</source>
<year>2003</year>
<volume>4</volume>
<issue>83</issue>
</mixed-citation>
</ref>
<ref id="B48">
<mixed-citation publication-type="journal">
<name>
<surname>Jakobsson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Scheet</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gibbs</surname>
<given-names>J</given-names>
</name>
<name>
<surname>VanLiere</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fung</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Szpiech</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Degnan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Guerreiro</surname>
<given-names>R</given-names>
</name>
<etal></etal>
<article-title>Genotype, haplotype and copy-number variation in worldwide human populations</article-title>
<source>Nature</source>
<year>2008</year>
<volume>451</volume>
<issue>7181</issue>
<fpage>998</fpage>
<lpage>1003</lpage>
<pub-id pub-id-type="doi">10.1038/nature06742</pub-id>
<pub-id pub-id-type="pmid">18288195</pub-id>
</mixed-citation>
</ref>
<ref id="B49">
<mixed-citation publication-type="journal">
<name>
<surname>Guyon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Barnhill</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
<article-title>Gene selection for cancer classification using support vector machines</article-title>
<source>Machine learning</source>
<year>2002</year>
<volume>46</volume>
<fpage>389</fpage>
<lpage>422</lpage>
<pub-id pub-id-type="doi">10.1023/A:1012487302797</pub-id>
</mixed-citation>
</ref>
<ref id="B50">
<mixed-citation publication-type="journal">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Chabrier</surname>
<given-names>P</given-names>
</name>
<article-title>ofw: An R Package to Select Continuous Variables for Multiclass Classification with a Stochastic Wrapper Method</article-title>
<source>Journal of Statistical Software</source>
<year>2008</year>
<volume>28</volume>
<issue>9</issue>
<fpage>1</fpage>
<lpage>16</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.jstatsoft.org/v28/i09/">http://www.jstatsoft.org/v28/i09/</ext-link>
</mixed-citation>
</ref>
<ref id="B51">
<mixed-citation publication-type="journal">
<name>
<surname>Nguyen</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Rocke</surname>
<given-names>D</given-names>
</name>
<article-title>Tumor classification by partial least squares using microarray gene expression data</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>39</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.1.39</pub-id>
<pub-id pub-id-type="pmid">11836210</pub-id>
</mixed-citation>
</ref>
<ref id="B52">
<mixed-citation publication-type="journal">
<name>
<surname>Boulesteix</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<article-title>Partial least squares: a versatile tool for the analysis of high-dimensional genomic data</article-title>
<source>Briefings in Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>32</fpage>
<pub-id pub-id-type="pmid">16772269</pub-id>
</mixed-citation>
</ref>
<ref id="B53">
<mixed-citation publication-type="journal">
<name>
<surname>Höskuldsson</surname>
<given-names>A</given-names>
</name>
<article-title>PLS regression methods</article-title>
<source>Journal of Chemometrics</source>
<year>1988</year>
<volume>2</volume>
<issue>3</issue>
<fpage>211</fpage>
<lpage>228</lpage>
<pub-id pub-id-type="doi">10.1002/cem.1180020306</pub-id>
</mixed-citation>
</ref>
<ref id="B54">
<mixed-citation publication-type="journal">
<name>
<surname>Wold</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sjöström</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Eriksson</surname>
<given-names>L</given-names>
</name>
<article-title>PLS-regression: a basic tool of chemometrics</article-title>
<source>Chemometrics and intelligent laboratory systems</source>
<year>2001</year>
<volume>58</volume>
<issue>2</issue>
<fpage>109</fpage>
<lpage>130</lpage>
<pub-id pub-id-type="doi">10.1016/S0169-7439(01)00155-1</pub-id>
</mixed-citation>
</ref>
<ref id="B55">
<mixed-citation publication-type="journal">
<name>
<surname>Chih-Yu Wang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Chiang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Shueng-Tsong Young</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chiang</surname>
<given-names>H</given-names>
</name>
<article-title>A probability-based multivariate statistical algorithm for autofluorescence spectroscopic identification of oral carcinogenesis</article-title>
<source>Photochemistry and photobiology</source>
<year>1999</year>
<volume>69</volume>
<issue>4</issue>
<fpage>471</fpage>
<lpage>477</lpage>
<pub-id pub-id-type="doi">10.1111/j.1751-1097.1999.tb03314.x</pub-id>
<pub-id pub-id-type="pmid">10212579</pub-id>
</mixed-citation>
</ref>
<ref id="B56">
<mixed-citation publication-type="journal">
<name>
<surname>Nguyen</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Rocke</surname>
<given-names>D</given-names>
</name>
<article-title>Multi-class cancer classification via partial least squares with gene expression profiles</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>9</issue>
<fpage>1216</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.9.1216</pub-id>
<pub-id pub-id-type="pmid">12217913</pub-id>
</mixed-citation>
</ref>
<ref id="B57">
<mixed-citation publication-type="other">
<name>
<surname>Lê Cao</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Meugnier</surname>
<given-names>E</given-names>
</name>
<name>
<surname>McLachlan</surname>
<given-names>G</given-names>
</name>
<article-title>Integrative mixture of experts to combine clinical factors and gene markers</article-title>
<source>Bioinformatics</source>
<year>2010</year>
</mixed-citation>
</ref>
<ref id="B58">
<mixed-citation publication-type="journal">
<name>
<surname>Qiao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<article-title>Adaptive weighted learning for unbalanced multicategory classification</article-title>
<source>Biometrics</source>
<year>2009</year>
<volume>65</volume>
<fpage>159</fpage>
<lpage>168</lpage>
<pub-id pub-id-type="doi">10.1111/j.1541-0420.2008.01017.x</pub-id>
<pub-id pub-id-type="pmid">18363773</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002272  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 002272  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Asie
   |area=    AustralieFrV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Dec 5 10:43:12 2017. Site generation: Tue Mar 5 14:07:20 2024