Serveur d'exploration sur les relations entre la France et l'Australie

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed

Identifieur interne : 000251 ( Pmc/Corpus ); précédent : 000250; suivant : 000252

Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed

Auteurs : Laurent Jacob

Source :

RBID : PMC:4679071

Abstract

When dealing with large scale gene expression studies, observations are commonly contaminated by sources of unwanted variation such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset—as opposed to the study of an observed factor of interest—taking unwanted variation into account can become a difficult task. The factors driving unwanted variation may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data. The proposed methods are then evaluated on synthetic data and three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state-of-the-art corrections. All proposed methods are implemented in the bioconductor package RUVnormalize.


Url:
DOI: 10.1093/biostatistics/kxv026
PubMed: 26286812
PubMed Central: 4679071

Links to Exploration step

PMC:4679071

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed</title>
<author>
<name sortKey="Jacob, Laurent" sort="Jacob, Laurent" uniqKey="Jacob L" first="Laurent" last="Jacob">Laurent Jacob</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26286812</idno>
<idno type="pmc">4679071</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4679071</idno>
<idno type="RBID">PMC:4679071</idno>
<idno type="doi">10.1093/biostatistics/kxv026</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000251</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000251</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed</title>
<author>
<name sortKey="Jacob, Laurent" sort="Jacob, Laurent" uniqKey="Jacob L" first="Laurent" last="Jacob">Laurent Jacob</name>
</author>
</analytic>
<series>
<title level="j">Biostatistics (Oxford, England)</title>
<idno type="ISSN">1465-4644</idno>
<idno type="eISSN">1468-4357</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>When dealing with large scale gene expression studies, observations are commonly contaminated by sources of unwanted variation such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset—as opposed to the study of an observed factor of interest—taking unwanted variation into account can become a difficult task. The factors driving unwanted variation may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data. The proposed methods are then evaluated on synthetic data and three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state-of-the-art corrections. All proposed methods are implemented in the bioconductor package
<monospace>RUVnormalize</monospace>
.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Alter, O" uniqKey="Alter O">O. Alter</name>
</author>
<author>
<name sortKey="Brown, P O" uniqKey="Brown P">P. O. Brown</name>
</author>
<author>
<name sortKey="Botstein, D" uniqKey="Botstein D">D. Botstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benito, M" uniqKey="Benito M">M. Benito</name>
</author>
<author>
<name sortKey="Parker, J" uniqKey="Parker J">J. Parker</name>
</author>
<author>
<name sortKey="Du, Q" uniqKey="Du Q">Q. Du</name>
</author>
<author>
<name sortKey="Wu, J" uniqKey="Wu J">J. Wu</name>
</author>
<author>
<name sortKey="Xiang, D" uniqKey="Xiang D">D. Xiang</name>
</author>
<author>
<name sortKey="Perou, C M" uniqKey="Perou C">C. M. Perou</name>
</author>
<author>
<name sortKey="Marron, J S" uniqKey="Marron J">J. S. Marron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolstad, B M" uniqKey="Bolstad B">B. M. Bolstad</name>
</author>
<author>
<name sortKey="Irizarry, R A" uniqKey="Irizarry R">R. A. Irizarry</name>
</author>
<author>
<name sortKey="Astr, M" uniqKey="Astr M">M. Astr</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P. Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Livera, A M" uniqKey="De Livera A">A. M. De Livera</name>
</author>
<author>
<name sortKey="Sysi Aho, M" uniqKey="Sysi Aho M">M. Sysi-Aho</name>
</author>
<author>
<name sortKey="Jacob, L" uniqKey="Jacob L">L. Jacob</name>
</author>
<author>
<name sortKey="Gagnon Bartsch, J A" uniqKey="Gagnon Bartsch J">J. A. Gagnon-Bartsch</name>
</author>
<author>
<name sortKey="Castillo, S" uniqKey="Castillo S">S. Castillo</name>
</author>
<author>
<name sortKey="Simpson, J A" uniqKey="Simpson J">J. A. Simpson</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P. Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freedman, D" uniqKey="Freedman D">D. Freedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gagnon Bartsch, J" uniqKey="Gagnon Bartsch J">J. Gagnon-Bartsch</name>
</author>
<author>
<name sortKey="Jacob, L" uniqKey="Jacob L">L. Jacob</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P. Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gagnon Bartsch, J A" uniqKey="Gagnon Bartsch J">J. A. Gagnon-Bartsch</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P. Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hotelling, H" uniqKey="Hotelling H">H. Hotelling</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jacob, L" uniqKey="Jacob L">L. Jacob</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jacob, L" uniqKey="Jacob L">L. Jacob</name>
</author>
<author>
<name sortKey="Van Den Akker, J" uniqKey="Van Den Akker J">J. Van Den Akker</name>
</author>
<author>
<name sortKey="Witteveen, A" uniqKey="Witteveen A">A. Witteveen</name>
</author>
<author>
<name sortKey="Goosens, I" uniqKey="Goosens I">I. Goosens</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P. Speed</name>
</author>
<author>
<name sortKey="Glas, A" uniqKey="Glas A">A. Glas</name>
</author>
<author>
<name sortKey="Veer, L V" uniqKey="Veer L">L. V. Veer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Johnson, W E" uniqKey="Johnson W">W. E. Johnson</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C. Li</name>
</author>
<author>
<name sortKey="Rabinovic, A" uniqKey="Rabinovic A">A. Rabinovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kang, H M" uniqKey="Kang H">H. M. Kang</name>
</author>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C. Ye</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E. Eskin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leek, J T" uniqKey="Leek J">J. T. Leek</name>
</author>
<author>
<name sortKey="Storey, J D" uniqKey="Storey J">J. D. Storey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leek, J T" uniqKey="Leek J">J. T. Leek</name>
</author>
<author>
<name sortKey="Storey, J D" uniqKey="Storey J">J. D. Storey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Listgarten, J" uniqKey="Listgarten J">J. Listgarten</name>
</author>
<author>
<name sortKey="Kadie, C" uniqKey="Kadie C">C. Kadie</name>
</author>
<author>
<name sortKey="Schadt, E E" uniqKey="Schadt E">E. E. Schadt</name>
</author>
<author>
<name sortKey="Heckerman, D" uniqKey="Heckerman D">D. Heckerman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mairal, J" uniqKey="Mairal J">J. Mairal</name>
</author>
<author>
<name sortKey="Bach, F" uniqKey="Bach F">F. Bach</name>
</author>
<author>
<name sortKey="Ponce, J" uniqKey="Ponce J">J. Ponce</name>
</author>
<author>
<name sortKey="Sapiro, G" uniqKey="Sapiro G">G. Sapiro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Risso, D" uniqKey="Risso D">D. Risso</name>
</author>
<author>
<name sortKey="Ngai, J" uniqKey="Ngai J">J. Ngai</name>
</author>
<author>
<name sortKey="Speed, T P" uniqKey="Speed T">T. P Speed</name>
</author>
<author>
<name sortKey="Dudoit, S" uniqKey="Dudoit S">S. Dudoit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vawter, M P And Others" uniqKey="Vawter M">M. P. and others Vawter</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Biostatistics</journal-id>
<journal-id journal-id-type="iso-abbrev">Biostatistics</journal-id>
<journal-id journal-id-type="publisher-id">biosts</journal-id>
<journal-id journal-id-type="hwp">biosts</journal-id>
<journal-title-group>
<journal-title>Biostatistics (Oxford, England)</journal-title>
</journal-title-group>
<issn pub-type="ppub">1465-4644</issn>
<issn pub-type="epub">1468-4357</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26286812</article-id>
<article-id pub-id-type="pmc">4679071</article-id>
<article-id pub-id-type="doi">10.1093/biostatistics/kxv026</article-id>
<article-id pub-id-type="publisher-id">kxv026</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Jacob</surname>
<given-names>Laurent</given-names>
</name>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<aff>
<addr-line>Laboratoire de Biométrie et Biologie Évolutive, Université de Lyon, Université Lyon 1, CNRS, UMR, 5558 Lyon, France</addr-line>
</aff>
</contrib-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Gagnon-Bartsch</surname>
<given-names>Johann A.</given-names>
</name>
</contrib>
<aff>
<addr-line>Department of Statistics, University of California, Berkeley, CA 974720, USA</addr-line>
</aff>
</contrib-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Speed</surname>
<given-names>Terence P.</given-names>
</name>
</contrib>
<aff>
<addr-line>Department of Statistics, University of California, Berkeley, CA 974720, USA and Division of Bioinformatics, Walter and Eliza Hall Institute of Medical Research, Melbourne 3052, Australia</addr-line>
</aff>
</contrib-group>
<author-notes>
<corresp id="cor1">
<label>*</label>
To whom correspondence should be addressed.
<email>laurent.jacob@univ-lyon1.fr</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>17</day>
<month>8</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>17</day>
<month>8</month>
<year>2015</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>17</volume>
<issue>1</issue>
<fpage>16</fpage>
<lpage>28</lpage>
<history>
<date date-type="received">
<day>26</day>
<month>11</month>
<year>2014</year>
</date>
<date date-type="rev-recd">
<day>18</day>
<month>6</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>6</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2015. Published by Oxford University Press.</copyright-statement>
<copyright-year>2015</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="kxv026.pdf"></self-uri>
<abstract>
<p>When dealing with large scale gene expression studies, observations are commonly contaminated by sources of unwanted variation such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset—as opposed to the study of an observed factor of interest—taking unwanted variation into account can become a difficult task. The factors driving unwanted variation may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data. The proposed methods are then evaluated on synthetic data and three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state-of-the-art corrections. All proposed methods are implemented in the bioconductor package
<monospace>RUVnormalize</monospace>
.</p>
</abstract>
<kwd-group>
<kwd>Batch effect</kwd>
<kwd>Control genes</kwd>
<kwd>Gene expression</kwd>
<kwd>Normalization</kwd>
<kwd>Replicate samples</kwd>
</kwd-group>
<funding-group>
<award-group id="funding-1">
<award-id>SU2C-AACR-DT0409</award-id>
</award-group>
<award-group id="funding-2">
<funding-source>Australian National Health and Medical Research Council Program</funding-source>
<award-id>APP1054618</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1.</label>
<title>Introduction</title>
<p>Over the last few years, microarray-based gene expression studies involving a large number of samples have been conducted (
<xref rid="KXV026C4" ref-type="bibr">Cancer Genome Atlas Research Network, 2008</xref>
), with the goal of helping understand or predict some particular
<italic>factors of interest</italic>
like the prognosis or the subtypes of a cancer. Such large gene expression studies are often carried out over several years, may involve several hospitals or research centers and typically contain some
<italic>unwanted variation</italic>
. Sources of unwanted variation can be technical elements such as batches, different platforms or laboratories, or any biological signal which is not the factor of interest of the study such as heterogeneity in ages or different ethnic groups.</p>
<p>Unwanted variation can easily lead to spurious associations. For example when one is looking for genes which are differentially expressed between two subtypes of cancer, the observed differential expression of some genes could actually be caused by differences between laboratories if laboratories are partially confounded with subtypes. When doing clustering to identify new subgroups of the disease, one may actually identify some of the unwanted factors if their effects on gene expression are stronger than the subgroup effect. If one is interested in predicting prognosis, one may actually end up predicting whether the sample was collected at the beginning or at the end of the study because better prognosis patients were accepted at the end of the study. In this case, the classifier obtained would have little value for predicting the prognosis of new patients.</p>
<p>Similar problems arise when trying to combine several smaller studies rather than working on one large heterogeneous study: in a dataset resulting from the merging of several studies the strongest effect one can observe is generally related to the membership of samples to different studies. A very important objective is therefore to remove this unwanted variation without losing the variation of interest.</p>
<p>A large number of methods have been proposed to tackle this problem, mostly using linear models. When both the factor of interest and the unwanted factors are observed, the problem essentially boils down to a linear regression. ComBat (
<xref rid="KXV026C12" ref-type="bibr">Johnson
<italic>and others</italic>
, 2007</xref>
) is an empirical Bayes version of linear regression that has been shown to be quite effective. When the factor of interest is observed but the unwanted factors are not, the latter need to be estimated before a regression is possible. This can be done using some form of factor analysis (
<xref rid="KXV026C14" ref-type="bibr">Leek and Storey, 2007</xref>
,
<xref rid="KXV026C15" ref-type="bibr">2008</xref>
;
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed, 2012</xref>
), or using the entire covariance structure of the gene expression matrix (
<xref rid="KXV026C13" ref-type="bibr">Kang
<italic>and others</italic>
, 2008</xref>
;
<xref rid="KXV026C16" ref-type="bibr">Listgarten
<italic>and others</italic>
, 2010</xref>
). Finally if the factor of interest itself is not defined, some methods (
<xref rid="KXV026C1" ref-type="bibr">Alter
<italic>and others</italic>
, 2000</xref>
) use singular value decomposition (SVD) on gene expression to identify and remove the unwanted variation and others (
<xref rid="KXV026C2" ref-type="bibr">Benito
<italic>and others</italic>
, 2004</xref>
) remove observed batches by linear regression. A more detailed overview of the literature on this subject is provided in Section 1 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
.</p>
<p>In this paper, we focus on this latter case where there is no predefined factor of interest. This situation arises when performing unsupervised estimation tasks such as clustering or principal component analysis (PCA), in the presence of unwanted variation. It can also be the case that one needs to normalize a dataset without knowing which factors of interest will be studied. Our objective is to correct the gene expression by estimating and removing the unwanted variation, without removing the—unobserved—variation of interest.</p>
<p>The recent work of
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
suggests that negative control genes can be used to estimate unwanted factors. Here we propose ways to improve estimation of the effect of these sources when the factor of interest is not observed. Our contributions here are 3-fold. We propose estimators which, given the unwanted factors estimated by
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
, are well suited to estimating their effect in the presence of unobserved factors of interest. We introduce a different estimator which relies on replicate samples. Finally, we systematically compare existing and proposed correction methods on an extensive set of experiments.</p>
<p>Section
<xref ref-type="sec" rid="s2">2</xref>
recalls the model of
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
and introduces estimators of the effect of a given unwanted factors, which are suited to the case where the factor of interest is unobserved. Section
<xref ref-type="sec" rid="s3">3</xref>
describes an alternative estimator of the unwanted variation using replicate samples rather than the unwanted factors previously estimated using control genes. Section
<xref ref-type="sec" rid="s4">4</xref>
compares existing and proposed correction methods on synthetic data, Section
<xref ref-type="sec" rid="s5">5</xref>
does the same thing on a gene expression dataset.</p>
</sec>
<sec id="s2">
<label>2.</label>
<title>Correction using negative control genes</title>
<p>The removal of unwanted variation (RUV) model used by
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
is a linear model first introduced in this context by
<xref rid="KXV026C14" ref-type="bibr">Leek and Storey (2007)</xref>
, with a term representing the variation of interest and another term representing the unwanted variation:
<disp-formula id="KXV026M1">
<label>(2.1)</label>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} Y = X\beta + W\alpha + \varepsilon, \end{equation*}\end{document}</tex-math>
</disp-formula>
with
<inline-formula>
<tex-math id="M2">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y \in \mathbb {R}^{m\times n}$\end{document}</tex-math>
</inline-formula>
,
<inline-formula>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\in \mathbb {R}^{m\times p}$\end{document}</tex-math>
</inline-formula>
,
<inline-formula>
<tex-math id="M4">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta \in \mathbb {R}^{p\times n}$\end{document}</tex-math>
</inline-formula>
,
<inline-formula>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\in \mathbb {R}^{m\times k}$\end{document}</tex-math>
</inline-formula>
,
<inline-formula>
<tex-math id="M6">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha \in \mathbb {R}^{k\times n}$\end{document}</tex-math>
</inline-formula>
, and
<inline-formula>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\varepsilon \in \mathbb {R}^{m\times n}$\end{document}</tex-math>
</inline-formula>
.
<inline-formula>
<tex-math id="M8">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
is the observed matrix of expression of
<inline-formula>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$n$\end{document}</tex-math>
</inline-formula>
genes for
<inline-formula>
<tex-math id="M10">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$m$\end{document}</tex-math>
</inline-formula>
samples,
<inline-formula>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
represents the
<inline-formula>
<tex-math id="M12">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document}</tex-math>
</inline-formula>
factors of interest,
<inline-formula>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
the
<inline-formula>
<tex-math id="M14">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
unwanted factors and
<inline-formula>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\varepsilon $\end{document}</tex-math>
</inline-formula>
some noise, typically
<inline-formula>
<tex-math id="M16">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\varepsilon _j \sim \mathcal {N}(0,\sigma ^2_\varepsilon I_m),\ j=1,\ldots ,n$\end{document}</tex-math>
</inline-formula>
. While
<xref rid="KXV026C14" ref-type="bibr">Leek and Storey (2007)</xref>
and
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
use a gene-specific variance
<inline-formula>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\sigma ^2_{\varepsilon _j}$\end{document}</tex-math>
</inline-formula>
, we restrict ourselves to a common variance in this work—Sections
<inline-formula>
<tex-math id="M18">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$3.7.4$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\textrm {A}.3$\end{document}</tex-math>
</inline-formula>
in
<xref rid="KXV026C7" ref-type="bibr">Gagnon-Bartsch
<italic>and others</italic>
(2013)</xref>
provide a detailed and illustrated discussion of why this approximation is reasonable. Both
<inline-formula>
<tex-math id="M20">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta $\end{document}</tex-math>
</inline-formula>
are modeled as fixed, i.e., non-random.</p>
<p>
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
were mainly interested in the case where
<inline-formula>
<tex-math id="M22">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is an observed factor of interest, and the objective is to test which genes are affected by this factor of interest—whether
<inline-formula>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta _j = 0$\end{document}</tex-math>
</inline-formula>
for each gene
<inline-formula>
<tex-math id="M24">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$j$\end{document}</tex-math>
</inline-formula>
. If
<inline-formula>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
is also observed, the maximum likelihood estimator of
<inline-formula>
<tex-math id="M26">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta $\end{document}</tex-math>
</inline-formula>
is a well studied linear regression estimator. The major contribution of
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
is to provide an estimator of
<inline-formula>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
exploiting the fact that some genes are known to be negative controls, i.e., unrelated to the factor of interest
<inline-formula>
<tex-math id="M28">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
. We refer to this estimator as
<inline-formula>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
in the remaining of this article. By contrast in this work, we are interested in the case where
<inline-formula>
<tex-math id="M30">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is not observed. Our objective in general will be to estimate
<inline-formula>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
and remove it from
<inline-formula>
<tex-math id="M32">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
.</p>
<sec id="s2a">
<label>2.1</label>
<title>A random effect version of RUV for unobserved
<inline-formula>
<tex-math id="M33">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
</title>
<p>If
<inline-formula>
<tex-math id="M34">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is not observed,
<inline-formula>
<tex-math id="M35">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(X\beta , \alpha )$\end{document}</tex-math>
</inline-formula>
become non-identifiable even given
<inline-formula>
<tex-math id="M36">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
. A naive solution is to estimate
<inline-formula>
<tex-math id="M37">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
by regression of
<inline-formula>
<tex-math id="M38">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
on
<inline-formula>
<tex-math id="M39">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
, i.e., to project
<inline-formula>
<tex-math id="M40">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
onto the orthogonal complement of
<inline-formula>
<tex-math id="M41">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
. This approach is referred to as
<italic>naive RUV-2</italic>
in
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
and is expected to be helpful as long as
<inline-formula>
<tex-math id="M42">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M43">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are not too correlated. If however there is some degree of confounding between the factor of interest and the unwanted variation sources, such a radical removal of the latter will remove too much of the former. In an extreme example, if one studies the effect of a treatment on gene expression and all treated samples are done on Day 1 and all untreated samples on Day 2, removing all variation along the Day 1–Day 2 axis also removes all variation between treated and untreated samples.</p>
<p>We now discuss how a random
<inline-formula>
<tex-math id="M44">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
version of (
<xref rid="KXV026M1" ref-type="disp-formula">2.1</xref>
) could improve estimation when
<inline-formula>
<tex-math id="M45">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M46">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are not expected to be orthogonal.</p>
<p>The naive RUV-2 estimator of
<inline-formula>
<tex-math id="M47">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
is formally given by
<disp-formula id="KXV026M2">
<label>(2.2)</label>
<tex-math id="M48">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} \min_{\alpha\in\mathbb{R}^{k\times n}} \|Y - \hat{W}_{2} \alpha\|^2_F, \end{equation*}\end{document}</tex-math>
</disp-formula>
which is the maximum likelihood estimator of
<inline-formula>
<tex-math id="M49">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
for model (
<xref rid="KXV026M1" ref-type="disp-formula">2.1</xref>
) if
<inline-formula>
<tex-math id="M50">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W=\hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M51">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta =0$\end{document}</tex-math>
</inline-formula>
. If we keep the same model and endow
<inline-formula>
<tex-math id="M52">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
with a distribution
<inline-formula>
<tex-math id="M53">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha _j \stackrel {iid}{\sim } \mathcal {N}(0,\sigma _\alpha ^2I_k),\ j=1,\ldots ,n$\end{document}</tex-math>
</inline-formula>
, the maximum a posteriori estimator of
<inline-formula>
<tex-math id="M54">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
becomes:
<disp-formula id="KXV026M3">
<label>(2.3)</label>
<tex-math id="M55">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} \min_{\alpha\in\mathbb{R}^{k\times n}} \{\|Y - \hat{W}_{2} \alpha\|^2_F + \nu \|\alpha\|^2_F\}, \end{equation*}\end{document}</tex-math>
</disp-formula>
where
<inline-formula>
<tex-math id="M56">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu =\sigma _\varepsilon ^2/\sigma _\alpha ^2$\end{document}</tex-math>
</inline-formula>
. Here again like with
<inline-formula>
<tex-math id="M57">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\sigma _\varepsilon $\end{document}</tex-math>
</inline-formula>
, we limit ourselves to a model where
<inline-formula>
<tex-math id="M58">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\sigma _\alpha $\end{document}</tex-math>
</inline-formula>
is common to all genes. Sections 14 and 15 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
discuss a related model where
<inline-formula>
<tex-math id="M59">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
rather than
<inline-formula>
<tex-math id="M60">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
is modeled as a random quantity.</p>
<p>The only difference with the naive RUV-2 estimator (
<xref rid="KXV026M2" ref-type="disp-formula">2.2</xref>
) is the
<inline-formula>
<tex-math id="M61">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\ell _2$\end{document}</tex-math>
</inline-formula>
penalty term: (
<xref rid="KXV026M3" ref-type="disp-formula">2.3</xref>
) is a ridge regression against
<inline-formula>
<tex-math id="M62">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
whereas (
<xref rid="KXV026M2" ref-type="disp-formula">2.2</xref>
) is an ordinary regression. In this context where
<inline-formula>
<tex-math id="M63">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is unobserved and
<inline-formula>
<tex-math id="M64">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
is set to 0 to estimate
<inline-formula>
<tex-math id="M65">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
, this difference can be important if
<inline-formula>
<tex-math id="M66">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M67">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are correlated—for more detail, see Section 3 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
.</p>
</sec>
<sec id="s2b">
<label>2.2</label>
<title>Generalization: joint estimation of
<inline-formula>
<tex-math id="M68">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M69">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
</title>
<p>Assuming some structure is known on the unobserved
<inline-formula>
<tex-math id="M70">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
term, it is possible to write a joint estimator of
<inline-formula>
<tex-math id="M71">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(X\beta , \alpha )$\end{document}</tex-math>
</inline-formula>
given
<inline-formula>
<tex-math id="M72">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
rather than fixing
<inline-formula>
<tex-math id="M73">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta =0$\end{document}</tex-math>
</inline-formula>
:
<disp-formula id="KXV026M4">
<label>(2.4)</label>
<tex-math id="M74">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} \min_{X\beta \in \mathcal{M}} \min_{\alpha\in\mathbb{R}^{k\times n}} \{\|Y - W\alpha - X\beta\|^2_F + \nu \|\alpha\|^2_F\}, \end{equation*}\end{document}</tex-math>
</disp-formula>
where
<inline-formula>
<tex-math id="M75">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {M}$\end{document}</tex-math>
</inline-formula>
is a subset of
<inline-formula>
<tex-math id="M76">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathbb {R}^{m\times n}$\end{document}</tex-math>
</inline-formula>
. A typical example of
<inline-formula>
<tex-math id="M77">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {M}$\end{document}</tex-math>
</inline-formula>
would be a clustering structure
<inline-formula>
<tex-math id="M78">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\{X\beta ~: X\in ~\mathcal {C}, \beta \in \mathbb {R}^{k\times n}\}$\end{document}</tex-math>
</inline-formula>
, where
<inline-formula>
<tex-math id="M79">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {C}$\end{document}</tex-math>
</inline-formula>
denotes the set of cluster membership matrices
<inline-formula>
<tex-math id="M80">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {C} \stackrel {\Delta }{=} \{M \in \{0,1\}^{m\times k}, \sum _{j=1}^k M_{i,j} = 1, i=1,\ldots ,m \}$\end{document}</tex-math>
</inline-formula>
. In this case, the minimization over
<inline-formula>
<tex-math id="M81">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
in (
<xref rid="KXV026M4" ref-type="disp-formula">2.4</xref>
) for a given
<inline-formula>
<tex-math id="M82">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
can be addressed by a
<italic>k</italic>
-means algorithm over
<inline-formula>
<tex-math id="M83">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - W\alpha $\end{document}</tex-math>
</inline-formula>
. Other typical examples include PCA where
<inline-formula>
<tex-math id="M84">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {M}=\{M :\textrm {rank}(M)\leq p\}$\end{document}</tex-math>
</inline-formula>
and sparse dictionary learning (
<xref rid="KXV026C17" ref-type="bibr">Mairal
<italic>and others</italic>
, 2010</xref>
) where
<inline-formula>
<tex-math id="M85">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {M}=\{X\beta :\textrm {rank}(X\beta )\leq p, \|X_i\|\leq 1, i=1,\ldots ,p, \|\beta \|_1 \leq \mu \}$\end{document}</tex-math>
</inline-formula>
. Section 12 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
provides an alternative formulation of (
<xref rid="KXV026M4" ref-type="disp-formula">2.4</xref>
).</p>
<p>The objective of this joint modeling can be 2-fold: one may still just want to estimate
<inline-formula>
<tex-math id="M86">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
in order to return a corrected
<inline-formula>
<tex-math id="M87">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - W\alpha $\end{document}</tex-math>
</inline-formula>
matrix, but hope that the joint estimation will yield a better estimate of
<inline-formula>
<tex-math id="M88">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
(in the sense of
<inline-formula>
<tex-math id="M89">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\|\hat {\alpha } - \alpha \|^2$\end{document}</tex-math>
</inline-formula>
). One may also be actually interested in estimating the unobserved
<inline-formula>
<tex-math id="M90">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
, e.g., to solve a clustering problem in the presence of unwanted variation.</p>
<p>A joint solution for
<inline-formula>
<tex-math id="M91">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(X\beta ,\alpha )$\end{document}</tex-math>
</inline-formula>
is generally not available for (
<xref rid="KXV026M4" ref-type="disp-formula">2.4</xref>
). A possible way of maximizing the likelihood of
<inline-formula>
<tex-math id="M92">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
however is to alternate between a step of optimization over
<inline-formula>
<tex-math id="M93">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
for a given
<inline-formula>
<tex-math id="M94">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
, which corresponds to a ridge regression problem, and a step of optimization over
<inline-formula>
<tex-math id="M95">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
for a given
<inline-formula>
<tex-math id="M96">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
using the relevant unsupervised estimation procedure over
<inline-formula>
<tex-math id="M97">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - W\alpha $\end{document}</tex-math>
</inline-formula>
. Each step decreases the objective
<inline-formula>
<tex-math id="M98">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\|Y - X\beta - W\alpha \|_F + \nu \|\alpha \|_F^2$\end{document}</tex-math>
</inline-formula>
, and even if this procedure does not converge in general to the global maximum likelihood of
<inline-formula>
<tex-math id="M99">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(X\beta , \alpha )$\end{document}</tex-math>
</inline-formula>
, it may yield better estimates than (
<xref rid="KXV026M3" ref-type="disp-formula">2.3</xref>
) where
<inline-formula>
<tex-math id="M100">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
is simply assumed to be 0.</p>
<p>Finally, such a joint scheme can be used to build a different estimator of
<inline-formula>
<tex-math id="M101">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
, akin to the feasible generalized least squares (
<xref rid="KXV026C6" ref-type="bibr">Freedman, 2005</xref>
) used in regression: once an estimate of
<inline-formula>
<tex-math id="M102">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta |W, \alpha $\end{document}</tex-math>
</inline-formula>
becomes available,
<inline-formula>
<tex-math id="M103">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
can be re-estimated using and SVD on the residuals
<inline-formula>
<tex-math id="M104">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - \hat {X\beta }$\end{document}</tex-math>
</inline-formula>
rather than the control genes. This approach is discussed in Section 2 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
.</p>
</sec>
</sec>
<sec id="s3">
<label>3.</label>
<title>Correction using replicate samples</title>
<p>We now introduce an alternative estimator of the unwanted variation
<inline-formula>
<tex-math id="M105">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
, which, unlike the ones discussed in Section
<xref ref-type="sec" rid="s2">2</xref>
does not rely on a previous estimator of
<inline-formula>
<tex-math id="M106">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
. Symmetrically to the negative control genes used to estimate
<inline-formula>
<tex-math id="M107">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
in Section
<xref ref-type="sec" rid="s2">2</xref>
, we now consider negative “control samples” for which the factor of interest
<inline-formula>
<tex-math id="M108">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is 0.</p>
<p>In practice, one way of obtaining such control samples is to use replicate samples, i.e., samples that come from the same tissue but which were hybridized in two different settings, say across time or platform. The profile formed by the difference of two such replicates should only be influenced by unwanted variation—those whose levels differ between the two replicates. In particular, the
<inline-formula>
<tex-math id="M109">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
of this difference should be 0. By construction, this approach is only able to deal with unwanted variation with respect to which replicates are available, which is often the case for technical unwanted variation but rarely the case for biological unwanted variation.</p>
<p>More generally when there are more than two replicates, one may take all pairwise differences or the differences between each replicate and the average of the other replicates. We will denote by
<inline-formula>
<tex-math id="M110">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d$\end{document}</tex-math>
</inline-formula>
the indices of these artificial control samples formed by differences of replicates, and we therefore have
<inline-formula>
<tex-math id="M111">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X^d = 0$\end{document}</tex-math>
</inline-formula>
where
<inline-formula>
<tex-math id="M112">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X^d$\end{document}</tex-math>
</inline-formula>
are the rows of
<inline-formula>
<tex-math id="M113">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
indexed by
<inline-formula>
<tex-math id="M114">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d$\end{document}</tex-math>
</inline-formula>
.</p>
<p>Such samples can then be used to estimate
<inline-formula>
<tex-math id="M115">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
in a way that is dual to the way (
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed, 2012</xref>
) used control genes to estimate
<inline-formula>
<tex-math id="M116">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
, recalled in Section 3 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
. More precisely, we consider the following algorithm:
<list list-type="bullet">
<list-item>
<p>Use the rows of
<inline-formula>
<tex-math id="M117">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
corresponding to control samples
<inline-formula>
<tex-math id="M118">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y^d = W^d\alpha + \varepsilon ^d$\end{document}</tex-math>
</inline-formula>
to estimate
<inline-formula>
<tex-math id="M119">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
. Assuming i.i.d. noise
<inline-formula>
<tex-math id="M120">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\varepsilon _j \sim \mathcal {N}(0,\sigma ^2_\varepsilon I_m),\ j=1,\ldots ,n$\end{document}</tex-math>
</inline-formula>
, the
<inline-formula>
<tex-math id="M121">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(W^d\alpha )$\end{document}</tex-math>
</inline-formula>
matrix maximizing the likelihood of this model is
<inline-formula>
<tex-math id="M122">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${\rm argmin}_{W^d\alpha ,\,\textrm {rank}\,W^d\alpha \geq k}\|Y^d -W^d\alpha \|_F^2$\end{document}</tex-math>
</inline-formula>
. By the same argument used to compute
<inline-formula>
<tex-math id="M123">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
, this argmin is reached for
<inline-formula>
<tex-math id="M124">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\widehat {W^d\alpha } = PE_kQ^\top $\end{document}</tex-math>
</inline-formula>
, where
<inline-formula>
<tex-math id="M125">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y^d = PEQ^\top $\end{document}</tex-math>
</inline-formula>
is the SVD of
<inline-formula>
<tex-math id="M126">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y^d$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M127">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$E_k$\end{document}</tex-math>
</inline-formula>
is the diagonal matrix with the
<inline-formula>
<tex-math id="M128">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
largest singular values as its
<inline-formula>
<tex-math id="M129">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
first entries and 0 on the rest of the diagonal. We can use
<inline-formula>
<tex-math id="M130">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {\alpha } = E_k Q^\top $\end{document}</tex-math>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>Using
<inline-formula>
<tex-math id="M131">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {\alpha }$\end{document}</tex-math>
</inline-formula>
in the restriction
<inline-formula>
<tex-math id="M132">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y_c = W\alpha _c + \varepsilon _c$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M133">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
to the control gene columns, the maximum likelihood estimate of
<inline-formula>
<tex-math id="M134">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
is now solved by a linear regression,
<inline-formula>
<tex-math id="M135">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{r}\stackrel {\Delta }{=} Y_c \hat {\alpha }^\top _c (\hat {\alpha }_c\hat {\alpha }_c^\top )^{-1}$\end{document}</tex-math>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>Once
<inline-formula>
<tex-math id="M136">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M137">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
are estimated,
<inline-formula>
<tex-math id="M138">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}\hat {\alpha }$\end{document}</tex-math>
</inline-formula>
can be removed from
<inline-formula>
<tex-math id="M139">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
.</p>
</list-item>
</list>
</p>
<p>
<inline-formula>
<tex-math id="M140">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
is not required in this procedure which constitutes a fully unsupervised correction for
<inline-formula>
<tex-math id="M141">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
.</p>
<p>The extreme case where all genes are used as control genes is of interest, and is discussed in Section 10 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
.</p>
<p>Finally, this replicate-based correction yields an estimator
<inline-formula>
<tex-math id="M142">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{r}$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M143">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
, obtained by regression of
<inline-formula>
<tex-math id="M144">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y_c$\end{document}</tex-math>
</inline-formula>
against
<inline-formula>
<tex-math id="M145">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {\alpha }_c$\end{document}</tex-math>
</inline-formula>
. This estimator could be used in any of the methods described in Section
<xref ref-type="sec" rid="s2">2</xref>
. Section 11 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
discusses the difference between
<inline-formula>
<tex-math id="M146">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{r}$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M147">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
.</p>
</sec>
<sec id="s4">
<label>4.</label>
<title>Result on synthetic data</title>
<p>We start with a set of experiments on synthetic data, where we generate the data ourselves. In this case, we are able to measure how well each correction method recovers the true matrix
<inline-formula>
<tex-math id="M148">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - W\alpha $\end{document}</tex-math>
</inline-formula>
and it makes sense to talk about “right” or “best” corrections.</p>
<sec id="s4a">
<label>4.1</label>
<title>Protocol</title>
<p>We generate data according to model (
<xref rid="KXV026M1" ref-type="disp-formula">2.1</xref>
).
<inline-formula>
<tex-math id="M149">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
has two columns: one is a binary variable, the other an associated continuous variable. One can think of the binary variable as some clinical grouping of the tumor, such as the ER status for breast cancer. The continuous variable could be survival time. More precisely the survival covariate is sampled from an exponential law with parameter
<inline-formula>
<tex-math id="M150">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$0.05$\end{document}</tex-math>
</inline-formula>
for ER
<inline-formula>
<tex-math id="M151">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$+$\end{document}</tex-math>
</inline-formula>
samples and
<inline-formula>
<tex-math id="M152">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$0.1$\end{document}</tex-math>
</inline-formula>
for ER
<inline-formula>
<tex-math id="M153">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$-$\end{document}</tex-math>
</inline-formula>
samples.
<inline-formula>
<tex-math id="M154">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
also has two columns, a binary one which could be the technical platform, and a continuous one which could be the RNA quality. RNA quality is sampled from a normal distribution—independently from the technical platform.</p>
<p>The columns of
<inline-formula>
<tex-math id="M155">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M156">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are then transformed to have norm 1. We generate
<inline-formula>
<tex-math id="M157">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$m=100$\end{document}</tex-math>
</inline-formula>
samples with
<inline-formula>
<tex-math id="M158">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$n=10\,000$\end{document}</tex-math>
</inline-formula>
genes each using model (
<xref rid="KXV026M1" ref-type="disp-formula">2.1</xref>
). The
<inline-formula>
<tex-math id="M159">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha _{ij}$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M160">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta _{ij}$\end{document}</tex-math>
</inline-formula>
parameters are iid sampled from a normal distribution with mean 0 and variance 1, except for the 100 control genes, for which
<inline-formula>
<tex-math id="M161">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta _{ij}=0$\end{document}</tex-math>
</inline-formula>
. The
<inline-formula>
<tex-math id="M162">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\varepsilon _{ij}$\end{document}</tex-math>
</inline-formula>
parameters are i.i.d. sampled from a normal distribution with mean 0 and variance
<inline-formula>
<tex-math id="M163">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$0.01$\end{document}</tex-math>
</inline-formula>
. We also generate 10 additional samples with 2 replicates each, with the same values of
<inline-formula>
<tex-math id="M164">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
but different values of
<inline-formula>
<tex-math id="M165">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
.</p>
<p>We compare the performances of ComBat (
<xref rid="KXV026C12" ref-type="bibr">Johnson
<italic>and others</italic>
, 2007</xref>
), quantile normalization (QN,
<xref rid="KXV026C3" ref-type="bibr">Bolstad
<italic>and others</italic>
, 2003</xref>
), naive RUV-2, the random effect model above, and the replicate-based model on simulated data. ComBat cannot deal with continuous unwanted variation factors and is only given the binary unwanted one. Each method using negative control genes is given either the actual negative control genes, or everything but the negative control genes (“poor control genes” version). Furthermore, we try two iterative methods, as described in
<xref ref-type="sec" rid="s2b">2.2</xref>
. We perform 100 iterations: in the first version,
<inline-formula>
<tex-math id="M166">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
is estimated once and for all from control genes like in the non-iterative random effect estimator, and in the second version it is re-estimated from the
<inline-formula>
<tex-math id="M167">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y-\widehat {X\beta }$\end{document}</tex-math>
</inline-formula>
as detailed in Section 2 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
after every 34 iterations.</p>
<p>Table
<xref ref-type="table" rid="KXV026TB1">1</xref>
shows the performances of each correction method in three different settings: canonical correlations (
<xref rid="KXV026C9" ref-type="bibr">Hotelling, 1936</xref>
) between
<inline-formula>
<tex-math id="M168">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M169">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
equal to
<inline-formula>
<tex-math id="M170">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(0.13, 0.05)$\end{document}</tex-math>
</inline-formula>
(
<italic>Independent</italic>
),
<inline-formula>
<tex-math id="M171">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(1, 1)$\end{document}</tex-math>
</inline-formula>
(
<italic>Confounded</italic>
), and
<inline-formula>
<tex-math id="M172">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(0.99, 0.8)$\end{document}</tex-math>
</inline-formula>
(
<italic>Moderate</italic>
). The normalized reconstruction error is measured by
<inline-formula>
<tex-math id="M173">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\|(Y - \widehat {W\alpha }) - (Y- W\alpha )\|^2/\|Y - W\alpha \|^2$\end{document}</tex-math>
</inline-formula>
: our objective is to remove the unwanted variation from
<inline-formula>
<tex-math id="M174">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
, and only the unwanted variation. These performances are those obtained when using the hyperparameter used to generate the data for each method. The effect of misspecification for
<inline-formula>
<tex-math id="M175">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M176">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
is discussed in Section 8 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
.
<table-wrap id="KXV026TB1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Simulations: reconstruction error after various corrections in the three studies settings</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th align="center" rowspan="1" colspan="1">Independent</th>
<th align="center" rowspan="1" colspan="1">Confounded</th>
<th align="center" rowspan="1" colspan="1">Moderate</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Uncorrected</td>
<td rowspan="1" colspan="1">0.67</td>
<td rowspan="1" colspan="1">0.68</td>
<td rowspan="1" colspan="1">0.67</td>
</tr>
<tr>
<td rowspan="1" colspan="1">QN</td>
<td rowspan="1" colspan="1">0.74</td>
<td rowspan="1" colspan="1">0.66</td>
<td rowspan="1" colspan="1">0.63</td>
</tr>
<tr>
<td rowspan="1" colspan="1">ComBat</td>
<td rowspan="1" colspan="1">0.34</td>
<td rowspan="1" colspan="1">0.66</td>
<td rowspan="1" colspan="1">0.76</td>
</tr>
<tr>
<td rowspan="1" colspan="1">naive RUV2</td>
<td rowspan="1" colspan="1">0.17</td>
<td rowspan="1" colspan="1">0.67</td>
<td rowspan="1" colspan="1">0.53</td>
</tr>
<tr>
<td rowspan="1" colspan="1">naive RUV2 poor control</td>
<td rowspan="1" colspan="1">0.74</td>
<td rowspan="1" colspan="1">0.67</td>
<td rowspan="1" colspan="1">0.67</td>
</tr>
<tr>
<td rowspan="1" colspan="1">random effect</td>
<td rowspan="1" colspan="1">0.23</td>
<td rowspan="1" colspan="1">0.34</td>
<td rowspan="1" colspan="1">0.31</td>
</tr>
<tr>
<td rowspan="1" colspan="1">random poor control</td>
<td rowspan="1" colspan="1">0.48</td>
<td rowspan="1" colspan="1">0.37</td>
<td rowspan="1" colspan="1">0.38</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Iterative</td>
<td rowspan="1" colspan="1">0.11</td>
<td rowspan="1" colspan="1">0.46</td>
<td rowspan="1" colspan="1">0.19</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Iterative update</td>
<td rowspan="1" colspan="1">0.06</td>
<td rowspan="1" colspan="1">0.40</td>
<td rowspan="1" colspan="1">0.16</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Replicates</td>
<td rowspan="1" colspan="1">0.32</td>
<td rowspan="1" colspan="1">0.30</td>
<td rowspan="1" colspan="1">0.30</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Replicates poor control</td>
<td rowspan="1" colspan="1">0.28</td>
<td rowspan="1" colspan="1">0.28</td>
<td rowspan="1" colspan="1">0.28</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="s4b">
<label>4.2</label>
<title>Result</title>
<p>In the case, where
<inline-formula>
<tex-math id="M177">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M178">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are generated independently, the QN-corrected matrix yields a larger reconstruction error than the uncorrected one. ComBat helps, because it removes all of the platform effect without affecting much the signal along
<inline-formula>
<tex-math id="M179">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
. Naive RUV-2 gets even better results because it also accounts for the continuous unwanted factor, but its performance is severely reduced if we use non-control genes: in this case, the estimated
<inline-formula>
<tex-math id="M180">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
—by PCA on the non-control genes—is associated with the true
<inline-formula>
<tex-math id="M181">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
, which leads to removing too much variance along
<inline-formula>
<tex-math id="M182">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
.</p>
<p>The random effect model yields similar performances as naive RUV-2, slightly worse because it uses
<inline-formula>
<tex-math id="M183">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k=m$\end{document}</tex-math>
</inline-formula>
and therefore shrinks the signal in every direction. It is also affected by the use of non-control genes, but less dramatically than naive RUV-2 because it does not remove all the signal along the estimated
<inline-formula>
<tex-math id="M184">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
. The iterative methods greatly improves the performances.</p>
<p>The replicate-based method obtains a performance similar to that of ComBat. Interestingly, its results are slightly improved by using non-control genes, see Section 10 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
for a detailed discussion of this phenomenon.</p>
<p>When
<inline-formula>
<tex-math id="M185">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X=W$\end{document}</tex-math>
</inline-formula>
(
<italic>Confounded</italic>
), QN, ComBat and naive RUV-2 lead to the same reconstruction error as the uncorrected matrix. They actually fail for opposite reasons: the unwanted variation along
<inline-formula>
<tex-math id="M186">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W=X$\end{document}</tex-math>
</inline-formula>
adds variance along
<inline-formula>
<tex-math id="M187">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
, so the uncorrected matrix has too much variance along this direction. By contrast, ComBat/naive RUV-2 remove too much variance along
<inline-formula>
<tex-math id="M188">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X=W$\end{document}</tex-math>
</inline-formula>
, because they treat all of it as unwanted variation. The random effect method performs a bit less well than in the independent case, but is not as dramatically affected by the confounding as the fixed effect naive RUV-2 method. Importantly, the iterative methods decrease the performance with respect to the non-iterative random effect estimator. The iterative methods only help if they manage to estimate
<inline-formula>
<tex-math id="M189">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
well enough to improve the estimation of
<inline-formula>
<tex-math id="M190">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
. In this setting,
<inline-formula>
<tex-math id="M191">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X=W$\end{document}</tex-math>
</inline-formula>
so removing
<inline-formula>
<tex-math id="M192">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\widehat {W\alpha }$\end{document}</tex-math>
</inline-formula>
decreases the variance along
<inline-formula>
<tex-math id="M193">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
too much, and makes it harder to identify it properly.</p>
<p>Finally, as expected from the discussion of Section 11 of
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
, the replicate-based method is not affected by the confounding at all: by construction, it only considers the variance coming from the technical unwanted variation.</p>
<p>The last column illustrates an intermediate case with moderate confounding. Random effect still works much better than uncorrected, QN, ComBat and naive RUV-2, and is less affected by the use of non-control genes. It is worth noting that even if
<inline-formula>
<tex-math id="M194">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M195">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
are highly correlated, the iterative methods yield a much lower error than the non-iterative one, suggesting that they only reduce performances in extreme cases like the total confounding setting.</p>
<p>We now summarize our observations on these synthetic experiments. ComBat performs well to remove an observed batch if it is largely independent from the signal of interest. Naive RUV-2 performs well to remove a batch—observed or not—if it is given good control genes and if the batch is not too associated with the factor of interest. The random effect model performs like naive RUV-2 but is much less affected by confounding and poor control genes. Iterating the estimation of
<inline-formula>
<tex-math id="M196">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
given
<inline-formula>
<tex-math id="M197">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$(W, X\beta )$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M198">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
given
<inline-formula>
<tex-math id="M199">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
improves the estimate of
<inline-formula>
<tex-math id="M200">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
, even more so if the residuals
<inline-formula>
<tex-math id="M201">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - X\beta $\end{document}</tex-math>
</inline-formula>
are used to update the estimate
<inline-formula>
<tex-math id="M202">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
. This strategy fails however when
<inline-formula>
<tex-math id="M203">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
is too hard to estimate, in which case iterations can even reduce the performances. Finally, the replicate-based method is not affected at all by control gene quality or confounding, it only depends on the number/coverage of replicates—see Sections 10 and 13 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
for additional discussion of these points.</p>
</sec>
</sec>
<sec id="s5">
<label>5.</label>
<title>Result on real data</title>
<p>On real data, we have no way to measure how close we are to the true
<inline-formula>
<tex-math id="M204">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y - W\alpha $\end{document}</tex-math>
</inline-formula>
matrix. As a surrogate, we choose datasets for which we know a factor of interest
<inline-formula>
<tex-math id="M205">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
, and measure how well this factor is recovered by clustering on each corrected gene expression matrix. The correction methods are not allowed to use the known groups, to emulate a problem where the factor of interest is not observed. This surrogate is admittedly imperfect, as other factors of interest may be present in these datasets. Whether or not removing the effect of these (possibly biological) other factors is desirable depends on whether they are considered “wanted” or “unwanted” in any particular analysis. We consider these surrogates to be complementary to the synthetic datasets of Section
<xref ref-type="sec" rid="s4">4</xref>
—which are not real data but where we can define what the right correction is.</p>
<p>This section presents results on the microarray gene expression datasets studied in
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
. We use the gender partition as the factor of interest to be recovered, implicitly treating other factors as unwanted variation, but discuss other options in Section
<inline-formula>
<tex-math id="M206">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$4.3$\end{document}</tex-math>
</inline-formula>
of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
. Two additional datasets are studied in Sections 5 and 6 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
, and Section 7 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
shows the importance of having good negative control genes on correction quality.</p>
<sec id="s5a">
<label>5.1</label>
<title>Protocol</title>
<p>For each of the correction methods that we evaluate, we apply the correction method to the expression matrix
<inline-formula>
<tex-math id="M207">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y$\end{document}</tex-math>
</inline-formula>
and then estimate the clustering using a
<inline-formula>
<tex-math id="M208">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
-means algorithm. To quantify how close each clustering gets to the objective partition, we adopt the following squared distance between two given partitions
<inline-formula>
<tex-math id="M209">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {C}=(c_1,\ldots ,c_k)$\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M210">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\mathcal {C}'=(c'_1,\ldots ,c'_k)$\end{document}</tex-math>
</inline-formula>
of the samples into
<inline-formula>
<tex-math id="M211">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
clusters:
<inline-formula>
<tex-math id="M212">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d(\mathcal {C},\mathcal {C}') = k - \sum _{i,j=1}^k ({|c_i \cap c'_j|^2}/{|c_i||c'_j|})$\end{document}</tex-math>
</inline-formula>
, where
<inline-formula>
<tex-math id="M213">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$|S|$\end{document}</tex-math>
</inline-formula>
denotes the cardinal of a set
<inline-formula>
<tex-math id="M214">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$S$\end{document}</tex-math>
</inline-formula>
. This score ranges between 0 when the two partitions are equivalent, and
<inline-formula>
<tex-math id="M215">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k-1$\end{document}</tex-math>
</inline-formula>
when the two partitions are completely different. To give a visual impression of the effect of the corrections on the data, we also plot the data in the space spanned by the first two principal components.</p>
<p>We evaluate two basic correction methods: the replicate-based procedure described in Section
<xref ref-type="sec" rid="s3">3</xref>
and the random
<inline-formula>
<tex-math id="M216">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
model of Section
<xref ref-type="sec" rid="s2a">2.1</xref>
with
<inline-formula>
<tex-math id="M217">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W} = \hat {W}_{2}$\end{document}</tex-math>
</inline-formula>
. For each of these two methods, we also evaluate iterative versions as described in Section
<xref ref-type="sec" rid="s2b">2.2</xref>
.
<inline-formula>
<tex-math id="M218">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta |W\alpha $\end{document}</tex-math>
</inline-formula>
is estimated using the sparse dictionary estimator of
<xref rid="KXV026C17" ref-type="bibr">Mairal
<italic>and others</italic>
(2010)</xref>
, which minimizes
<inline-formula>
<tex-math id="M219">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$1/2 \|(Y-\widehat {W\alpha }) - X\beta \|_F^2 + \lambda \|\beta \|_1$\end{document}</tex-math>
</inline-formula>
under the constraint that
<inline-formula>
<tex-math id="M220">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
has rank
<inline-formula>
<tex-math id="M221">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document}</tex-math>
</inline-formula>
and the columns of
<inline-formula>
<tex-math id="M222">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X$\end{document}</tex-math>
</inline-formula>
have norm 1.
<inline-formula>
<tex-math id="M223">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W$\end{document}</tex-math>
</inline-formula>
is re-estimated using the
<inline-formula>
<tex-math id="M224">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$Y-\widehat {X\beta }$\end{document}</tex-math>
</inline-formula>
residuals every 10 iterations as discussed in Section 2 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
. In addition, we consider as baselines (i) an absence of correction, (ii) a centering of the data by level of the known unwanted factors—similar to the correction provided by ComBat, and (iii) the naive RUV-2 procedure.</p>
<p>Some of the methods we compare require the user to choose some hyperparameters: the ranks
<inline-formula>
<tex-math id="M225">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M226">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
and
<inline-formula>
<tex-math id="M227">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M228">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$X\beta $\end{document}</tex-math>
</inline-formula>
, the ridge parameter
<inline-formula>
<tex-math id="M229">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
and the strength
<inline-formula>
<tex-math id="M230">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\lambda $\end{document}</tex-math>
</inline-formula>
of the
<inline-formula>
<tex-math id="M231">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\ell _1$\end{document}</tex-math>
</inline-formula>
penalty on
<inline-formula>
<tex-math id="M232">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\beta $\end{document}</tex-math>
</inline-formula>
. On synthetic data, it makes sense to define which hyperparameters yield the best correction. On real data by contrast, different choices of these hyperparameters may lead to throwing away or keeping different signals, which can be a good or a bad thing depending on what downstream analysis is decided afterward. This point is illustrated in Section
<inline-formula>
<tex-math id="M233">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$4.3$\end{document}</tex-math>
</inline-formula>
of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
: large values of
<inline-formula>
<tex-math id="M234">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
lead to a clustering by brain region while smaller values lead to a clustering by gender.</p>
<p>No single rule can be therefore given as to hyperparameter choice and judgment is necessary each time adjustments are performed without a specified factor of interest. We suggest using relative log expression (RLE) plots, clustering with respect to a known factor of interest, or known differentially expressed genes with respect to a known factor of interest as positive controls. In this experiment, we compare normalization methods based on how well they allow clustering to recover the gender partition, so it would not make sense to use the same criterion to choose
<inline-formula>
<tex-math id="M235">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
. In Section 4 of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
, we use RLE plots to pick
<inline-formula>
<tex-math id="M236">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
for this dataset, and also discuss the other criteria (see
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Table 1 of the Supplementary Material</ext-link>
).</p>
<p>Since
<inline-formula>
<tex-math id="M237">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
acts on the eigenvalues of
<inline-formula>
<tex-math id="M238">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W^\top W$\end{document}</tex-math>
</inline-formula>
, we recommend considering a grid of powers of 10 of the largest of them—the discussion in the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
regards how to choose the power. The rank
<inline-formula>
<tex-math id="M239">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M240">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$W\alpha $\end{document}</tex-math>
</inline-formula>
was chosen to be close to
<inline-formula>
<tex-math id="M241">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$m/4$\end{document}</tex-math>
</inline-formula>
, or to the number of replicate samples when the latter was smaller than the former. For methods using
<inline-formula>
<tex-math id="M242">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document}</tex-math>
</inline-formula>
, we chose
<inline-formula>
<tex-math id="M243">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p=k$\end{document}</tex-math>
</inline-formula>
. For random
<inline-formula>
<tex-math id="M244">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
models, we use
<inline-formula>
<tex-math id="M245">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k=m$\end{document}</tex-math>
</inline-formula>
: the model is regularized by the ridge
<inline-formula>
<tex-math id="M246">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
and we do not combine it with a regularization of the rank. Finally, in order to make iterative and non-iterative methods comparable, we choose
<inline-formula>
<tex-math id="M247">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\lambda $\end{document}</tex-math>
</inline-formula>
for each method such that
<inline-formula>
<tex-math id="M248">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\|W\alpha \|_F$\end{document}</tex-math>
</inline-formula>
is close to the one obtained with its non-iterative counterpart.</p>
</sec>
<sec id="s5b">
<label>5.2</label>
<title>Result</title>
<p>
<xref rid="KXV026C19" ref-type="bibr">Vawter
<italic>and others</italic>
(2004)</xref>
studied differences in gene expression between male and female patients.
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
used the resulting dataset to study the performances of RUV-2.</p>
<p>This gender study is an interesting benchmark for methods aiming at removing unwanted variation as it expected to be affected by several technical and biological factors: two microarray platforms, three different labs, three tissue localizations in the brain. Most of the 10 patients involved in the study had samples taken from the anterior cingulate cortex (a), the dorsolateral prefontal cortex (d), and the cerebellar hemisphere (c). Most of these samples were sent to three independent labs: UC Irvine (I), UC Davis (D) and University of Michigan, Ann Arbor (M).</p>
<p>Gene expression was measured using either HGU-95A or HGU-95Av2 Affymetrix arrays with 12 600 genes shared between the two platforms. Six of the
<inline-formula>
<tex-math id="M249">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$10\times 3\times 3$\end{document}</tex-math>
</inline-formula>
combinations were missing, leading to 84 samples. We use as control genes the same 799 housekeeping probesets, which were used in
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
. The proportion of genes on the sex chromosomes is similar in the housekeeping genes (
<inline-formula>
<tex-math id="M250">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$3\%$\end{document}</tex-math>
</inline-formula>
) and other genes (
<inline-formula>
<tex-math id="M251">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$4\%$\end{document}</tex-math>
</inline-formula>
).</p>
<p>For the replicate-based method of Section
<xref ref-type="sec" rid="s3">3</xref>
, we use all possible pairs that either differ in lab, but are otherwise identical in terms of chip type, the patient, and brain region, or (ii) differ in brain region but are otherwise identical in terms of chip type, the patient, and lab, leading to 106 differences. Finally as a pre-processing, we also mean-center the samples per array type.</p>
<p>Since most genes function irrespective of gender, clustering by gender gives better results in general when removing genes with low variance before clustering. For each method, we therefore apply clustering after filtering different numbers of genes based on their variance after correction.</p>
<p>Figure
<xref ref-type="fig" rid="KXV026F1">1</xref>
shows the clustering error for the methods against the number of genes retained. The uncorrected and mean-centering cases are not displayed to avoid cluttering the plot, but give values above
<inline-formula>
<tex-math id="M252">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$0.95$\end{document}</tex-math>
</inline-formula>
for all numbers of genes retained. Figure
<xref ref-type="fig" rid="KXV026F2">2</xref>
shows the samples in the space of the first two principal components in these two cases, keeping the 1260 genes with highest variance. On the uncorrected data (left panel), it is clear that the samples first cluster by lab which is the main source of variance, then by brain region which is the second main source of variance. This explains why the clustering on uncorrected data is far away from a clustering by gender. Mean-centering samples by region-lab (right panel) removes all clustering per brain region or lab, but does not make the samples cluster by gender.</p>
<fig id="KXV026F1" orientation="portrait" position="float">
<label>Fig. 1.</label>
<caption>
<p>Clustering error against number of genes selected (based on variance) before clustering. From top to bottom at 1260 genes: replicate-based correction (full purple); naive RUV-2 (full gray); iterated replicate-based correction (dashed purple); random
<inline-formula>
<tex-math id="M253">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
model using
<inline-formula>
<tex-math id="M254">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_2$\end{document}</tex-math>
</inline-formula>
(full green); iterated random
<inline-formula>
<tex-math id="M255">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
model using
<inline-formula>
<tex-math id="M256">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}_2$\end{document}</tex-math>
</inline-formula>
(dashed green).</p>
</caption>
<graphic xlink:href="kxv02601"></graphic>
</fig>
<fig id="KXV026F2" orientation="portrait" position="float">
<label>Fig. 2.</label>
<caption>
<p>Samples of the gender study represented in the space of their first two principal components before correction (left panel) and after centering by lab plus brain region (right panel). Light blue samples are males, dark pink samples are females. The labels indicate the laboratory and brain region of each sample. The capital letter is the laboratory and the lowercase one is the brain region.</p>
</caption>
<graphic xlink:href="kxv02602"></graphic>
</fig>
<p>The gray line of Figure
<xref ref-type="fig" rid="KXV026F1">1</xref>
shows the performance of naive RUV-2 for
<inline-formula>
<tex-math id="M257">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k=20$\end{document}</tex-math>
</inline-formula>
. Since naive RUV-2 is a radical correction which removes all variance along some directions, it is expected to be more sensitive to the choice of
<inline-formula>
<tex-math id="M258">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
. The estimation is damaged by using
<inline-formula>
<tex-math id="M259">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k=40$\end{document}</tex-math>
</inline-formula>
(clustering error
<inline-formula>
<tex-math id="M260">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$0.99$\end{document}</tex-math>
</inline-formula>
). Using
<inline-formula>
<tex-math id="M261">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k=5$\end{document}</tex-math>
</inline-formula>
also degrades the performances, except when very few genes are kept.</p>
<p>The purple lines of Figure
<xref ref-type="fig" rid="KXV026F1">1</xref>
represent the replicate-based corrections. The solid line shows the performances of the non-iterative method described in Section
<xref ref-type="sec" rid="s3">3</xref>
. When very few genes are selected, it leads to a perfect clustering by gender, which no other method achieves regardless of the number of genes they retain. When considering more genes, however, its performance become similar to the one of naive RUV-2, suggesting that additional genes are influenced by non-gender variation which the replicate-based method does not remove. It is expected that a few genes are more strongly affected by gender than the others, so it makes sense for a correction method to recover a better clustering by gender after restriction to a small number of high variance genes. In addition, Table
<xref ref-type="table" rid="KXV026TB1">1</xref>
of the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
shows that even though the replicate-based method has a large clustering error, it actually performs as well as or better than other methods in terms of number of differentially expressed genes on the sex chromosomes.</p>
<p>The iterative version in dotted line leads to much better clustering except again when very few genes are selected. Figure
<xref ref-type="fig" rid="KXV026F3">3</xref>
shows the samples in the space of the first two principal components after applying the non-iterative (left panel) and iterative (right panel) replicate-based method. The correction shrinks the replicates together, leading to a new variance structure, more driven by gender although not separating perfectly males and females.</p>
<fig id="KXV026F3" orientation="portrait" position="float">
<label>Fig. 3.</label>
<caption>
<p>Using replicates. Left: no iteration and right: with iterations.</p>
</caption>
<graphic xlink:href="kxv02603"></graphic>
</fig>
<p>The green lines of Figure
<xref ref-type="fig" rid="KXV026F1">1</xref>
correspond to the random
<inline-formula>
<tex-math id="M262">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
-based corrections. The solid line shows the results for the non-iterative method. These results are good, as illustrated by the reasonably good separation obtained in the space spanned by the first two principal components after correction on the left panel of Figure
<xref ref-type="fig" rid="KXV026F4">4</xref>
. The dotted green line corresponds to the random
<inline-formula>
<tex-math id="M263">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
-based corrections with iterations plus sparsity, which leads to an even lower clustering error.</p>
<fig id="KXV026F4" orientation="portrait" position="float">
<label>Fig. 4.</label>
<caption>
<p>Random alpha with control genes only. Left: no iteration and right: with iterations.</p>
</caption>
<graphic xlink:href="kxv02604"></graphic>
</fig>
</sec>
</sec>
<sec id="s6">
<label>6.</label>
<title>Discussion</title>
<p>We introduced methods to estimate and remove unobserved unwanted variation from gene expression data when the factor of interest is also unobserved. One method uses the negative control gene-based estimator of unwanted factors introduced in
<xref rid="KXV026C8" ref-type="bibr">Gagnon-Bartsch and Speed (2012)</xref>
, and estimates the effect of these factors on gene expression using a random effect model. The second method relies on replicate samples and estimates the unwanted variation using the variation observed in differences of replicates. Both estimators can be improved by joint modeling of the variation of interest and the unwanted variation. All the methods we introduce are available in the bioconductor package RUVnormalize (
<xref rid="KXV026C10" ref-type="bibr">Jacob, 2014</xref>
).</p>
<p>We systematically compared the proposed correction techniques with state-of-the-art methods on both synthetic and real gene expression data. On synthetic data, we knew what the correct signal was, and could measure how well each correction method recovered this signal. When good control genes were available, the random effect estimator performed much better than existing correction methods in the presence of confounding. The replicate-based method performed less well than the control gene based one—unless a really large number of replicates was available—but was unaffected by poor quality control genes and to large confounding level. We were able to verify that both proposed methods provide a better correction even in the case where the factor of interest and the unwanted factors are totally confounded.</p>
<p>On real gene expression data where it did not make sense to define a single correct signal to be recovered, we assessed how well we were able to rediscover by clustering a known factor of interest which was unspecified at correction time. Here again, the proposed methods lead to better reconstruction than existing corrections.</p>
<p>Assessing how well each unsupervised correction method works on a new real dataset is problematic, since the factor of interest is not observed. Clustering with respect to a known biological factor, like we do with gender, is one option to perform this assessment. Other options include using positive control genes and RLE plots, like we do in the
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary Material</ext-link>
. None of these options is perfect but they can be used as guidelines, to monitor whether too much variance is being removed by any correction method. In particular, they can and should be used to choose regularization parameters such as the rank
<inline-formula>
<tex-math id="M264">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}</tex-math>
</inline-formula>
of
<inline-formula>
<tex-math id="M265">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\hat {W}$\end{document}</tex-math>
</inline-formula>
and the ridge
<inline-formula>
<tex-math id="M266">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\nu $\end{document}</tex-math>
</inline-formula>
of random
<inline-formula>
<tex-math id="M267">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}</tex-math>
</inline-formula>
approaches. In any case, one should keep in mind that optimizing for one known thing may not optimize for another: in our gender data example, the parameters which were chosen by RLE and behaved well for gender recovery are not optimal for recovering a partition by brain region.</p>
<p>To conclude, our results suggest that it is possible to remove unwanted variation from gene expression without losing the signal of interest, provided enough controls are available: negative control genes which are affected by the unwanted factors only, or replicate samples. Together with other researchers in our groups we have also started applying some of the methods that we introduce here to RNA-Seq (
<xref rid="KXV026C18" ref-type="bibr">Risso
<italic>and others</italic>
, 2014</xref>
), metabolomics (
<xref rid="KXV026C5" ref-type="bibr">Livera
<italic>and others</italic>
, 2015</xref>
) and expression array data (
<xref rid="KXV026C11" ref-type="bibr">Jacob
<italic>and others</italic>
, 2015</xref>
) and obtained consistently good results. We hope these extensive evaluations and comparisons will be helpful to future researchers trying to remove unwanted variation from their data.</p>
</sec>
<sec id="s7">
<title>Supplementary material</title>
<p>
<ext-link ext-link-type="uri" xlink:href="http://biostatistics.oxfordjournals.org/lookup/suppl/doi:10.1093/biostatistics/kxv026/-/DC1">Supplementary material is available at http://biostatistics.oxfordjournals.org.</ext-link>
</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work was funded by the
<award-id>SU2C-AACR-DT0409</award-id>
grant. Funding to pay the Open Access publication charges for this article was provided by
<funding-source>Australian National Health and Medical Research Council Program</funding-source>
Grant
<award-id>APP1054618</award-id>
.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_17_1_16__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_kxv026_kxv026supp.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>The authors thank Julien Mairal, Anne Biton, Leming Shi, Jennifer Fostel, Minjun Chen, and Moshe Olshansky for helpful discussions.
<italic>Conflict of Interest</italic>
: None declared.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="KXV026C1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alter</surname>
<given-names>O.</given-names>
</name>
,
<name>
<surname>Brown</surname>
<given-names>P. O.</given-names>
</name>
,
<name>
<surname>Botstein</surname>
<given-names>D.</given-names>
</name>
</person-group>
(
<year>2000</year>
).
<article-title>Singular value decomposition for genome-wide expression data processing and modeling</article-title>
.
<source>PNAS</source>
<volume>97</volume>
(
<issue>18</issue>
),
<fpage>10101</fpage>
<lpage>10106</lpage>
.
<pub-id pub-id-type="pmid">10963673</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benito</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Parker</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Du</surname>
<given-names>Q.</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Xiang</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Perou</surname>
<given-names>C. M.</given-names>
</name>
,
<name>
<surname>Marron</surname>
<given-names>J. S.</given-names>
</name>
</person-group>
(
<year>2004</year>
).
<article-title>Adjustment of systematic microarray data biases</article-title>
.
<source>Bioinformatics</source>
<volume>20</volume>
(
<issue>1</issue>
),
<fpage>105</fpage>
<lpage>14</lpage>
.
<pub-id pub-id-type="pmid">14693816</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bolstad</surname>
<given-names>B. M.</given-names>
</name>
,
<name>
<surname>Irizarry</surname>
<given-names>R. A.</given-names>
</name>
,
<name>
<surname>Astr</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P.</given-names>
</name>
</person-group>
(
<year>2003</year>
).
<article-title>A comparison of normalization methods for high density</article-title>
.
<source>Bioinformatics</source>
<volume>19</volume>
,
<fpage>185</fpage>
<lpage>193</lpage>
.
<pub-id pub-id-type="pmid">12538238</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">Cancer Genome Atlas Research Network</person-group>
(
<year>2008</year>
).
<article-title>Comprehensive genomic characterization defines human glioblastoma genes and core pathways</article-title>
.
<source>Nature</source>
<volume>455</volume>
(
<issue>7216</issue>
),
<fpage>1061</fpage>
<lpage>1068</lpage>
.
<pub-id pub-id-type="pmid">18772890</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>De Livera</surname>
<given-names>A. M.</given-names>
</name>
,
<name>
<surname>Sysi-Aho</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Jacob</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Gagnon-Bartsch</surname>
<given-names>J. A.</given-names>
</name>
,
<name>
<surname>Castillo</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Simpson</surname>
<given-names>J. A.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P.</given-names>
</name>
</person-group>
(
<year>2015</year>
).
<article-title>Statistical methods for handling unwanted variation in metabolomics data</article-title>
.
<source>Analytical Chemistry</source>
<volume>87</volume>
(
<issue>7</issue>
),
<fpage>3606</fpage>
<lpage>3615</lpage>
.
<comment>PMID: 25692814</comment>
.
<pub-id pub-id-type="pmid">25692814</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C6">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Freedman</surname>
<given-names>D.</given-names>
</name>
</person-group>
(
<year>2005</year>
)
<source>Statistical Models: Theory And Practice</source>
.
<publisher-loc>Cambridge</publisher-loc>
:
<publisher-name>Cambridge University Press</publisher-name>
.</mixed-citation>
</ref>
<ref id="KXV026C7">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Gagnon-Bartsch</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Jacob</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P.</given-names>
</name>
</person-group>
(
<year>2013</year>
).
<comment>Removing unwanted variation from high dimensional data with negative controls.
<italic>Technical Report</italic>
, UC Berkeley. Technical report 820. Monograph in preparation</comment>
.</mixed-citation>
</ref>
<ref id="KXV026C8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gagnon-Bartsch</surname>
<given-names>J. A.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P.</given-names>
</name>
</person-group>
(
<year>2012</year>
).
<article-title>Using control genes to correct for unwanted variation in microarray data</article-title>
.
<source>Biostatistics</source>
<volume>13</volume>
(
<issue>3</issue>
),
<fpage>539</fpage>
<lpage>552</lpage>
.
<pub-id pub-id-type="pmid">22101192</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hotelling</surname>
<given-names>H.</given-names>
</name>
</person-group>
(
<year>1936</year>
).
<article-title>Relation between two sets of variates</article-title>
.
<source>Biometrika</source>
<volume>28</volume>
,
<fpage>322</fpage>
<lpage>377</lpage>
.</mixed-citation>
</ref>
<ref id="KXV026C10">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Jacob</surname>
<given-names>L.</given-names>
</name>
</person-group>
(
<year>2014</year>
).
<comment>
<italic>RUV for Normalization of Expression Array Data</italic>
. Bioconductor
<inline-formula>
<tex-math id="M268">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\geq $\end{document}</tex-math>
</inline-formula>
3.0</comment>
.</mixed-citation>
</ref>
<ref id="KXV026C11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jacob</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Van Den Akker</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Witteveen</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Goosens</surname>
<given-names>I.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P.</given-names>
</name>
,
<name>
<surname>Glas</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Veer</surname>
<given-names>L. V.</given-names>
</name>
</person-group>
(
<year>2015</year>
).
<article-title>A blueprint for managing microarray technical variations and data processing in the large randomized MINDACT trial</article-title>
(
<comment>in preparation</comment>
).</mixed-citation>
</ref>
<ref id="KXV026C12">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Johnson</surname>
<given-names>W. E.</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
,
<string-name>Biostatistics, Department, Biology, Computational</string-name>
,
<name>
<surname>Rabinovic</surname>
<given-names>A.</given-names>
</name>
</person-group>
(
<year>2007</year>
).
<article-title>Adjusting batch effects in microarray expression data using empirical bayes methods</article-title>
.
<source>Biostatistics</source>
<volume>1</volume>
(
<issue>8</issue>
),
<fpage>118</fpage>
<lpage>127</lpage>
.
<pub-id pub-id-type="pmid">16632515</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname>
<given-names>H. M.</given-names>
</name>
,
<name>
<surname>Ye</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Eskin</surname>
<given-names>E.</given-names>
</name>
</person-group>
(
<year>2008</year>
).
<article-title>Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots</article-title>
.
<source>Genetics</source>
<volume>180</volume>
(
<issue>4</issue>
),
<fpage>1909</fpage>
<lpage>1925</lpage>
.
<pub-id pub-id-type="pmid">18791227</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leek</surname>
<given-names>J. T.</given-names>
</name>
,
<name>
<surname>Storey</surname>
<given-names>J. D.</given-names>
</name>
</person-group>
(
<year>2007</year>
).
<article-title>Capturing heterogeneity in gene expression studies by surrogate variable analysis</article-title>
.
<source>PLoS Genetics</source>
<volume>3</volume>
(
<issue>9</issue>
),
<fpage>1724</fpage>
<lpage>1735</lpage>
.
<pub-id pub-id-type="pmid">17907809</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leek</surname>
<given-names>J. T.</given-names>
</name>
,
<name>
<surname>Storey</surname>
<given-names>J. D.</given-names>
</name>
</person-group>
(
<year>2008</year>
).
<article-title>A general framework for multiple testing dependence</article-title>
.
<source>PNAS</source>
<volume>105</volume>
(
<issue>48</issue>
),
<fpage>18718</fpage>
<lpage>18723</lpage>
.
<pub-id pub-id-type="pmid">19033188</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Listgarten</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Kadie</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Schadt</surname>
<given-names>E. E.</given-names>
</name>
,
<name>
<surname>Heckerman</surname>
<given-names>D.</given-names>
</name>
</person-group>
(
<year>2010</year>
).
<article-title>Correction for hidden confounders in the genetic analysis of gene expression</article-title>
.
<source>PNAS</source>
<volume>107</volume>
(
<issue>38</issue>
),
<fpage>16465</fpage>
<lpage>16470</lpage>
.
<pub-id pub-id-type="pmid">20810919</pub-id>
</mixed-citation>
</ref>
<ref id="KXV026C17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mairal</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Bach</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Ponce</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Sapiro</surname>
<given-names>G.</given-names>
</name>
</person-group>
(
<year>2010</year>
).
<article-title>Online learning for matrix factorization and sparse coding</article-title>
.
<source>Journal of Machine Learning Research</source>
<volume>11</volume>
,
<fpage>19</fpage>
<lpage>60</lpage>
.</mixed-citation>
</ref>
<ref id="KXV026C18">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Risso</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Ngai</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Speed</surname>
<given-names>T. P</given-names>
</name>
,
<name>
<surname>Dudoit</surname>
<given-names>S.</given-names>
</name>
</person-group>
(
<year>2014</year>
).
<article-title>Normalization of RNA-seq data using factor analysis of control genes or samples</article-title>
.
<source>Nature Biotechnology</source>
<volume>32</volume>
(
<issue>9</issue>
),
<fpage>896</fpage>
<lpage>902</lpage>
.</mixed-citation>
</ref>
<ref id="KXV026C19">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vawter</surname>
<given-names>M. P. and others</given-names>
</name>
</person-group>
(
<year>2004</year>
).
<article-title>Gender-specific gene expression in post-mortem human brain: localization to sex chromosomes</article-title>
.
<source>Neuropsychopharmacology</source>
<volume>29</volume>
(
<issue>2</issue>
),
<fpage>373</fpage>
<lpage>384</lpage>
.
<pub-id pub-id-type="pmid">14583743</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000251 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000251 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Asie
   |area=    AustralieFrV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4679071
   |texte=   Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26286812" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a AustralieFrV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Dec 5 10:43:12 2017. Site generation: Tue Mar 5 14:07:20 2024