Serveur d'exploration sur Pittsburgh

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Identifieur interne : 000440 ( Pmc/Corpus ); précédent : 000439; suivant : 000441

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Auteurs : Serena G. Liao ; Yan Lin ; Dongwan D. Kang ; Divay Chandra ; Jessica Bon ; Naftali Kaminski ; Frank C. Sciurba ; George C. Tseng

Source :

RBID : PMC:4228077

Abstract

Background

In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.

Results

In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available.

Conclusions

Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12859-014-0346-6
PubMed: 25371041
PubMed Central: 4228077

Links to Exploration step

PMC:4228077

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Missing value imputation in high-dimensional phenomic data: imputable or not, and how?</title>
<author>
<name sortKey="Liao, Serena G" sort="Liao, Serena G" uniqKey="Liao S" first="Serena G" last="Liao">Serena G. Liao</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lin, Yan" sort="Lin, Yan" uniqKey="Lin Y" first="Yan" last="Lin">Yan Lin</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kang, Dongwan D" sort="Kang, Dongwan D" uniqKey="Kang D" first="Dongwan D" last="Kang">Dongwan D. Kang</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chandra, Divay" sort="Chandra, Divay" uniqKey="Chandra D" first="Divay" last="Chandra">Divay Chandra</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bon, Jessica" sort="Bon, Jessica" uniqKey="Bon J" first="Jessica" last="Bon">Jessica Bon</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kaminski, Naftali" sort="Kaminski, Naftali" uniqKey="Kaminski N" first="Naftali" last="Kaminski">Naftali Kaminski</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sciurba, Frank C" sort="Sciurba, Frank C" uniqKey="Sciurba F" first="Frank C" last="Sciurba">Frank C. Sciurba</name>
<affiliation>
<nlm:aff id="Aff5">Department of Medicine, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tseng, George C" sort="Tseng, George C" uniqKey="Tseng G" first="George C" last="Tseng">George C. Tseng</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25371041</idno>
<idno type="pmc">4228077</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228077</idno>
<idno type="RBID">PMC:4228077</idno>
<idno type="doi">10.1186/s12859-014-0346-6</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000440</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000440</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Missing value imputation in high-dimensional phenomic data: imputable or not, and how?</title>
<author>
<name sortKey="Liao, Serena G" sort="Liao, Serena G" uniqKey="Liao S" first="Serena G" last="Liao">Serena G. Liao</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lin, Yan" sort="Lin, Yan" uniqKey="Lin Y" first="Yan" last="Lin">Yan Lin</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kang, Dongwan D" sort="Kang, Dongwan D" uniqKey="Kang D" first="Dongwan D" last="Kang">Dongwan D. Kang</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chandra, Divay" sort="Chandra, Divay" uniqKey="Chandra D" first="Divay" last="Chandra">Divay Chandra</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bon, Jessica" sort="Bon, Jessica" uniqKey="Bon J" first="Jessica" last="Bon">Jessica Bon</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kaminski, Naftali" sort="Kaminski, Naftali" uniqKey="Kaminski N" first="Naftali" last="Kaminski">Naftali Kaminski</name>
<affiliation>
<nlm:aff id="Aff4">Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sciurba, Frank C" sort="Sciurba, Frank C" uniqKey="Sciurba F" first="Frank C" last="Sciurba">Frank C. Sciurba</name>
<affiliation>
<nlm:aff id="Aff5">Department of Medicine, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tseng, George C" sort="Tseng, George C" uniqKey="Tseng G" first="George C" last="Tseng">George C. Tseng</name>
<affiliation>
<nlm:aff id="Aff1">Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
<author>
<name sortKey="Ritchie, Md" uniqKey="Ritchie M">MD Ritchie</name>
</author>
<author>
<name sortKey="Basford, Ma" uniqKey="Basford M">MA Basford</name>
</author>
<author>
<name sortKey="Pulley, Jm" uniqKey="Pulley J">JM Pulley</name>
</author>
<author>
<name sortKey="Bastarache, L" uniqKey="Bastarache L">L Bastarache</name>
</author>
<author>
<name sortKey="Brown Gentry, K" uniqKey="Brown Gentry K">K Brown-Gentry</name>
</author>
<author>
<name sortKey="Wang, D" uniqKey="Wang D">D Wang</name>
</author>
<author>
<name sortKey="Masys, Dr" uniqKey="Masys D">DR Masys</name>
</author>
<author>
<name sortKey="Roden, Dm" uniqKey="Roden D">DM Roden</name>
</author>
<author>
<name sortKey="Crawford, Dc" uniqKey="Crawford D">DC Crawford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hanauer, Da" uniqKey="Hanauer D">DA Hanauer</name>
</author>
<author>
<name sortKey="Ramakrishnan, N" uniqKey="Ramakrishnan N">N Ramakrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lyalina, S" uniqKey="Lyalina S">S Lyalina</name>
</author>
<author>
<name sortKey="Percha, B" uniqKey="Percha B">B Percha</name>
</author>
<author>
<name sortKey="Lependu, P" uniqKey="Lependu P">P Lependu</name>
</author>
<author>
<name sortKey="Iyer, Sv" uniqKey="Iyer S">SV Iyer</name>
</author>
<author>
<name sortKey="Altman, Rb" uniqKey="Altman R">RB Altman</name>
</author>
<author>
<name sortKey="Shah, Nh" uniqKey="Shah N">NH Shah</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ritchie, Md" uniqKey="Ritchie M">MD Ritchie</name>
</author>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
<author>
<name sortKey="Zuvich, Rl" uniqKey="Zuvich R">RL Zuvich</name>
</author>
<author>
<name sortKey="Crawford, Dc" uniqKey="Crawford D">DC Crawford</name>
</author>
<author>
<name sortKey="Schildcrout, Js" uniqKey="Schildcrout J">JS Schildcrout</name>
</author>
<author>
<name sortKey="Bastarache, L" uniqKey="Bastarache L">L Bastarache</name>
</author>
<author>
<name sortKey="Ramirez, Ah" uniqKey="Ramirez A">AH Ramirez</name>
</author>
<author>
<name sortKey="Mosley, Jd" uniqKey="Mosley J">JD Mosley</name>
</author>
<author>
<name sortKey="Pulley, Jm" uniqKey="Pulley J">JM Pulley</name>
</author>
<author>
<name sortKey="Basford, Ma" uniqKey="Basford M">MA Basford</name>
</author>
<author>
<name sortKey="Bradford, Y" uniqKey="Bradford Y">Y Bradford</name>
</author>
<author>
<name sortKey="Rasmussen, Lv" uniqKey="Rasmussen L">LV Rasmussen</name>
</author>
<author>
<name sortKey="Pathak, J" uniqKey="Pathak J">J Pathak</name>
</author>
<author>
<name sortKey="Chute, Cg" uniqKey="Chute C">CG Chute</name>
</author>
<author>
<name sortKey="Kullo, Ij" uniqKey="Kullo I">IJ Kullo</name>
</author>
<author>
<name sortKey="Mccarty, Ca" uniqKey="Mccarty C">CA McCarty</name>
</author>
<author>
<name sortKey="Chisholm, Rl" uniqKey="Chisholm R">RL Chisholm</name>
</author>
<author>
<name sortKey="Kho, An" uniqKey="Kho A">AN Kho</name>
</author>
<author>
<name sortKey="Carlson, Cs" uniqKey="Carlson C">CS Carlson</name>
</author>
<author>
<name sortKey="Larson, Eb" uniqKey="Larson E">EB Larson</name>
</author>
<author>
<name sortKey="Jarvik, Gp" uniqKey="Jarvik G">GP Jarvik</name>
</author>
<author>
<name sortKey="Sotoodehnia, N" uniqKey="Sotoodehnia N">N Sotoodehnia</name>
</author>
<author>
<name sortKey="Manolio, Ta" uniqKey="Manolio T">TA Manolio</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Masys, Dr" uniqKey="Masys D">DR Masys</name>
</author>
<author>
<name sortKey="Haines, Jl" uniqKey="Haines J">JL Haines</name>
</author>
<author>
<name sortKey="Roden, Dm" uniqKey="Roden D">DM Roden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warner, Jl" uniqKey="Warner J">JL Warner</name>
</author>
<author>
<name sortKey="Alterovitz, G" uniqKey="Alterovitz G">G Alterovitz</name>
</author>
<author>
<name sortKey="Bodio, K" uniqKey="Bodio K">K Bodio</name>
</author>
<author>
<name sortKey="Joyce, Rm" uniqKey="Joyce R">RM Joyce</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fernald, Gh" uniqKey="Fernald G">GH Fernald</name>
</author>
<author>
<name sortKey="Capriotti, E" uniqKey="Capriotti E">E Capriotti</name>
</author>
<author>
<name sortKey="Daneshjou, R" uniqKey="Daneshjou R">R Daneshjou</name>
</author>
<author>
<name sortKey="Karczewski, Kj" uniqKey="Karczewski K">KJ Karczewski</name>
</author>
<author>
<name sortKey="Altman, Rb" uniqKey="Altman R">RB Altman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Singer, E" uniqKey="Singer E">E Singer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sterne, Ja" uniqKey="Sterne J">JA Sterne</name>
</author>
<author>
<name sortKey="White, Ir" uniqKey="White I">IR White</name>
</author>
<author>
<name sortKey="Carlin, Jb" uniqKey="Carlin J">JB Carlin</name>
</author>
<author>
<name sortKey="Spratt, M" uniqKey="Spratt M">M Spratt</name>
</author>
<author>
<name sortKey="Royston, P" uniqKey="Royston P">P Royston</name>
</author>
<author>
<name sortKey="Kenward, Mg" uniqKey="Kenward M">MG Kenward</name>
</author>
<author>
<name sortKey="Wood, Am" uniqKey="Wood A">AM Wood</name>
</author>
<author>
<name sortKey="Carpenter, Jr" uniqKey="Carpenter J">JR Carpenter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Little, Rj" uniqKey="Little R">RJ Little</name>
</author>
<author>
<name sortKey="D Gostino, R" uniqKey="D Gostino R">R D’Agostino</name>
</author>
<author>
<name sortKey="Cohen, Ml" uniqKey="Cohen M">ML Cohen</name>
</author>
<author>
<name sortKey="Dickersin, K" uniqKey="Dickersin K">K Dickersin</name>
</author>
<author>
<name sortKey="Emerson, Ss" uniqKey="Emerson S">SS Emerson</name>
</author>
<author>
<name sortKey="Farrar, Jt" uniqKey="Farrar J">JT Farrar</name>
</author>
<author>
<name sortKey="Frangakis, C" uniqKey="Frangakis C">C Frangakis</name>
</author>
<author>
<name sortKey="Hogan, Jw" uniqKey="Hogan J">JW Hogan</name>
</author>
<author>
<name sortKey="Molenberghs, G" uniqKey="Molenberghs G">G Molenberghs</name>
</author>
<author>
<name sortKey="Murphy, Sa" uniqKey="Murphy S">SA Murphy</name>
</author>
<author>
<name sortKey="Neaton, Jd" uniqKey="Neaton J">JD Neaton</name>
</author>
<author>
<name sortKey="Rotnitzky, A" uniqKey="Rotnitzky A">A Rotnitzky</name>
</author>
<author>
<name sortKey="Scharfstein, D" uniqKey="Scharfstein D">D Scharfstein</name>
</author>
<author>
<name sortKey="Shih, Wj" uniqKey="Shih W">WJ Shih</name>
</author>
<author>
<name sortKey="Siegel, Jp" uniqKey="Siegel J">JP Siegel</name>
</author>
<author>
<name sortKey="Stern, H" uniqKey="Stern H">H Stern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tanner, Ma" uniqKey="Tanner M">MA Tanner</name>
</author>
<author>
<name sortKey="Wong, Wh" uniqKey="Wong W">WH Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tanner, Ma" uniqKey="Tanner M">MA Tanner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, C" uniqKey="Liu C">C Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Little, Rja" uniqKey="Little R">RJA Little</name>
</author>
<author>
<name sortKey="Rubin, Db" uniqKey="Rubin D">DB Rubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raghunathan, Te" uniqKey="Raghunathan T">TE Raghunathan</name>
</author>
<author>
<name sortKey="Lepkowski, Jm" uniqKey="Lepkowski J">JM Lepkowski</name>
</author>
<author>
<name sortKey="Hoewyk, Jv" uniqKey="Hoewyk J">JV Hoewyk</name>
</author>
<author>
<name sortKey="Solenberger, P" uniqKey="Solenberger P">P Solenberger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubin, Db" uniqKey="Rubin D">DB Rubin</name>
</author>
<author>
<name sortKey="Schafer, Jl" uniqKey="Schafer J">JL Schafer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Buuren Kg O, S" uniqKey="Van Buuren Kg O S">S van Buuren KG-O</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andridge, Rr" uniqKey="Andridge R">RR Andridge</name>
</author>
<author>
<name sortKey="Little, Rj" uniqKey="Little R">RJ Little</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Little, Rj" uniqKey="Little R">RJ Little</name>
</author>
<author>
<name sortKey="Yosef, M" uniqKey="Yosef M">M Yosef</name>
</author>
<author>
<name sortKey="Cain, Kc" uniqKey="Cain K">KC Cain</name>
</author>
<author>
<name sortKey="Nan, B" uniqKey="Nan B">B Nan</name>
</author>
<author>
<name sortKey="Harlow, Sd" uniqKey="Harlow S">SD Harlow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubin, Db" uniqKey="Rubin D">DB Rubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raghunathan, Te" uniqKey="Raghunathan T">TE Raghunathan</name>
</author>
<author>
<name sortKey="Grizzle, Je" uniqKey="Grizzle J">JE Grizzle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raghunathan, Te" uniqKey="Raghunathan T">TE Raghunathan</name>
</author>
<author>
<name sortKey="Siscovick, Ds" uniqKey="Siscovick D">DS Siscovick</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schafer, Jl" uniqKey="Schafer J">JL Schafer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jerez, Jm" uniqKey="Jerez J">JM Jerez</name>
</author>
<author>
<name sortKey="Molina, I" uniqKey="Molina I">I Molina</name>
</author>
<author>
<name sortKey="Garcia Laencina, Pj" uniqKey="Garcia Laencina P">PJ Garcia-Laencina</name>
</author>
<author>
<name sortKey="Alba, E" uniqKey="Alba E">E Alba</name>
</author>
<author>
<name sortKey="Ribelles, N" uniqKey="Ribelles N">N Ribelles</name>
</author>
<author>
<name sortKey="Martin, M" uniqKey="Martin M">M Martin</name>
</author>
<author>
<name sortKey="Franco, L" uniqKey="Franco L">L Franco</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brock, Gn" uniqKey="Brock G">GN Brock</name>
</author>
<author>
<name sortKey="Shaffer, Jr" uniqKey="Shaffer J">JR Shaffer</name>
</author>
<author>
<name sortKey="Blakesley, Re" uniqKey="Blakesley R">RE Blakesley</name>
</author>
<author>
<name sortKey="Lotz, Mj" uniqKey="Lotz M">MJ Lotz</name>
</author>
<author>
<name sortKey="Tseng, Gc" uniqKey="Tseng G">GC Tseng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sunghee Oh, Ddk" uniqKey="Sunghee Oh D">DDK Sunghee Oh</name>
</author>
<author>
<name sortKey="Brock, Gn" uniqKey="Brock G">GN Brock</name>
</author>
<author>
<name sortKey="Tseng, Gc" uniqKey="Tseng G">GC Tseng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buhlmann, Djsp" uniqKey="Buhlmann D">DJSP Buhlmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Acuna, E" uniqKey="Acuna E">E Acuna</name>
</author>
<author>
<name sortKey="Rodriguez, C" uniqKey="Rodriguez C">C Rodriguez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="B, Th" uniqKey="B T">TH Bø</name>
</author>
<author>
<name sortKey="Dysvik, B" uniqKey="Dysvik B">B Dysvik</name>
</author>
<author>
<name sortKey="Jonassen, I" uniqKey="Jonassen I">I Jonassen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Olkin, I" uniqKey="Olkin I">I Olkin</name>
</author>
<author>
<name sortKey="Tate, Rf" uniqKey="Tate R">RF Tate</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Agresti, A" uniqKey="Agresti A">A Agresti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ulf Olsson, Fd" uniqKey="Ulf Olsson F">FD Ulf Olsson</name>
</author>
<author>
<name sortKey="Dorans, Nj" uniqKey="Dorans N">NJ Dorans</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Olsson, U" uniqKey="Olsson U">U Olsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boas, F" uniqKey="Boas F">F Boas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, K" uniqKey="Pearson K">K Pearson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yule, Gu" uniqKey="Yule G">GU Yule</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cramer, H" uniqKey="Cramer H">H Cramér</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gower, Jc" uniqKey="Gower J">JC Gower</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25371041</article-id>
<article-id pub-id-type="pmc">4228077</article-id>
<article-id pub-id-type="publisher-id">346</article-id>
<article-id pub-id-type="doi">10.1186/s12859-014-0346-6</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Missing value imputation in high-dimensional phenomic data: imputable or not, and how?</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Liao</surname>
<given-names>Serena G</given-names>
</name>
<address>
<email>liaoge.serena@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Lin</surname>
<given-names>Yan</given-names>
</name>
<address>
<email>yal14@pitt.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kang</surname>
<given-names>Dongwan D</given-names>
</name>
<address>
<email>donkang75@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chandra</surname>
<given-names>Divay</given-names>
</name>
<address>
<email>chandrad@upmc.edu</email>
</address>
<xref ref-type="aff" rid="Aff4"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bon</surname>
<given-names>Jessica</given-names>
</name>
<address>
<email>bonjm@upmc.edu</email>
</address>
<xref ref-type="aff" rid="Aff4"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kaminski</surname>
<given-names>Naftali</given-names>
</name>
<address>
<email>naftali.kaminski@yale.edu</email>
</address>
<xref ref-type="aff" rid="Aff4"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sciurba</surname>
<given-names>Frank C</given-names>
</name>
<address>
<email>sciurbafc@upmc.edu</email>
</address>
<xref ref-type="aff" rid="Aff5"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Tseng</surname>
<given-names>George C</given-names>
</name>
<address>
<email>ctseng@pitt.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA USA</aff>
<aff id="Aff2">
<label></label>
Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA USA</aff>
<aff id="Aff3">
<label></label>
Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA USA</aff>
<aff id="Aff4">
<label></label>
Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT USA</aff>
<aff id="Aff5">
<label></label>
Department of Medicine, University of Pittsburgh, Pittsburgh, PA USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>5</day>
<month>11</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>5</day>
<month>11</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<volume>15</volume>
<issue>1</issue>
<elocation-id>346</elocation-id>
<history>
<date date-type="received">
<day>6</day>
<month>3</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>6</day>
<month>10</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>© Liao et al; licensee BioMed Central Ltd. 2014</copyright-statement>
<license license-type="open-access">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Missing data</kwd>
<kwd>K-nearest-neighbor</kwd>
<kwd>Phenomic data</kwd>
<kwd>Self-training selection</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2014</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="introduction">
<title>Background</title>
<p>In many studies of complex diseases, a large number of demographic, environmental and clinical variables are collected and missing values (MVs) are inevitable in the data collection process. Major categories of variables include but not limited to: (1) demographic measures, such as gender, race, education and marital status; (2) environmental exposures, such as pollen, feather pillows and pollutions; (3) living habits, such as exercise, sleep, diet, vitamin supplement and smoking; (4) measures of general health status or organ function, such as body mass index (BMI), blood pressure, walking speed and forced vital capacity (FVC); (5) summary measures from medical images, such as fMRI and PET scan; (6) drug history; and (7) family disease history. The dimension of the data can easily go beyond several hundreds to nearly a thousand and we refer to such data as “phenomic data”, hereafter. It has been shown recently that systematic analysis of the phenomic data and integration with other genomic information provide further understanding of diseases [
<xref ref-type="bibr" rid="CR1">1</xref>
-
<xref ref-type="bibr" rid="CR5">5</xref>
], and enhance disease subtype discovery towards precision medicine [
<xref ref-type="bibr" rid="CR6">6</xref>
,
<xref ref-type="bibr" rid="CR7">7</xref>
]. The presence of missing values in clinical research not only reduces statistical power of the study but also impedes the implementation of many statistical and bioinformatic methods that require a complete dataset (e.g. principal component analysis, clustering analysis, machine learning and graphical models). Many have pointed out that “missing value has the potential to undermine the validity of epidemiologic and clinical research and lead the conclusion to bias” [
<xref ref-type="bibr" rid="CR8">8</xref>
].</p>
<p>Standard statistical methods for analysis of data with missing values include list-wise deletion or complete-case analysis (i.e. discard any subject with a missing value), likelihood-based methods, data augmentation and imputation [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR10">10</xref>
]. The list-wise deletion in general leads to loss of statistical power and biased results when data are not missing completely at random. Likelihood-based methods and data augmentation are popular for low dimensional data with parametric models for the missing-data process [
<xref ref-type="bibr" rid="CR10">10</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
]. However, their application in high dimensional data is problematic especially when the missing data pattern is complicated and the required intensive computing is most likely insurmountable. On the contrary, imputation provides an intuitive and powerful tool for analysis of data with complex missing-data patterns [
<xref ref-type="bibr" rid="CR12">12</xref>
-
<xref ref-type="bibr" rid="CR16">16</xref>
]. Explicit imputation methods such as mean imputation or stochastic imputation either undermines the variability of the data or requires parametric assumption on the data and subsequently faces similar challenges as the likelihood-based method and data augmentation [
<xref ref-type="bibr" rid="CR12">12</xref>
-
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR16">16</xref>
]. Implicit imputation methods such as nearest-neighbour imputation, hot-deck and fractional imputation provide flexible and powerful approaches for analysis of data with complex missing-data patterns even though the implicit imputation model is not coherent with the assumed model for the underlying complete data [
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR18">18</xref>
]. Multiple imputations usually are considered to account for the variability due to imputation [
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR16">16</xref>
,
<xref ref-type="bibr" rid="CR19">19</xref>
].</p>
<p>Except for some implicit imputation methods, other above-mentioned methods rely on correct modelling of the missing data process and work well in traditional situations with large number of subjects and small number of variables (large n, small p). With the trend of increasing number of variables (large p) in phenomic data, the model fitting, diagnostic check and sensitivity analysis become difficult to ensure success of multiple imputation or maximum likelihood imputation. The complexity of phenomic data with mixed data types (binary, multi-class categorical, ordinal and continuous) further aggravates the difficulties of modeling the joint distribution of all variables. Although a few of the algorithms are designed to handle datasets with both continuous and categorical variables [
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
-
<xref ref-type="bibr" rid="CR22">22</xref>
], the implementation of most of these complicated methods in the high dimensional phenomic data is not straightforward. Imputation methods by exact statistical modeling often suffer from “curse of dimensionality”. Jerez and colleagues compared machine learning methods, such as multi-layer perceptron (MLP), self-organizing maps (SOM) and k-nearest neighbor (KNN), to traditional statistical imputation methods in a large breast cancer dataset and concluded that machine learning imputation methods seemed to perform better in this large clinical data [
<xref ref-type="bibr" rid="CR23">23</xref>
].</p>
<p>In the past decade, missing value imputation for high-throughput experimental data,(e.g. microarray data) has drawn great attention and many methods have been developed and widely used (see [
<xref ref-type="bibr" rid="CR24">24</xref>
], [
<xref ref-type="bibr" rid="CR25">25</xref>
] for review and comparative studies). Imputation of phenomic data differs from microarray data and brings new challenges for two major reasons. Firstly microarray data contain entirely continuous intensity measurements, while phenomic data have mixed data types. This voids majority of established microarray imputation methods for phenomic data. Secondly, microarray data monitor gene expression of thousands of genes and the majority of the genes are believed to be co-regulated with others in a systemic sense, which leads to a highly correlated structure of the data and makes imputation intrinsically easier. The phenomic data, on the other hand, are more likely to contain isolated variables (or samples) that are “not imputable” from other observed variables (samples).</p>
<p>There are at least three aspects of novelty in this paper. Firstly, to our knowledge, this is the first systematic comparative study of missing value imputation methods for large-scale phenomic data. We will compare two existing methods (missForest [
<xref ref-type="bibr" rid="CR26">26</xref>
] and multivariate imputation by chained equations, MICE [
<xref ref-type="bibr" rid="CR16">16</xref>
]) and extend four variants of KNN imputation method that was popularly used in microarray analysis [
<xref ref-type="bibr" rid="CR27">27</xref>
]. Secondly, to characterize and identify missing values that are “not imputable” from other observed values in phenomic data, we propose an “imputability measure” (IM) to quantify imputability of a missing value. When a variable or subject has an overall small IM in its missing values, it is recommended to remove the variable or subject from further analysis (or impute with caution). Thirdly, we propose a self-training scheme (STS) [
<xref ref-type="bibr" rid="CR24">24</xref>
] to select the best missing value imputation method for each data type in a given dataset. The result provides a practical guideline in applications. The IM and STS selection tool will remain useful when more powerful methods for phenomic data imputation are developed in the future.</p>
</sec>
<sec id="Sec2" sec-type="materials|methods">
<title>Methods</title>
<sec id="Sec3">
<title>Real data</title>
<p>The current work is motivated by three high-dimensional phenomic datasets, all of which have a mixture of continuous, ordinal, binary and nominal covariates. The Chronic Obstructive Pulmonary Disease (COPD) dataset was generated from a COPD study conducted in the Division of Pulmonary, Department of Medicine at the University of Pittsburgh. The second dataset is the phenotypic data set of the Lung Tissue Research Consortium (LTRC,
<ext-link ext-link-type="uri" xlink:href="http://www.nhlbi.nih.gov/resources/ltrc.htm">http://www.nhlbi.nih.gov/resources/ltrc.htm</ext-link>
). The third dataset is obtained from the Severe Asthma Research Program (SARP) study (
<ext-link ext-link-type="uri" xlink:href="http://www.severeasthma.org/">http://www.severeasthma.org/</ext-link>
). These datasets represent different variable/subject ratios and different proportions of data types in the variables. In Table 
<xref rid="Tab1" ref-type="table">1</xref>
, Raw Data (RD) refers to the original raw data with missing values we initially obtained. Complete Data (CD) represents a complete dataset without any missing value after we iteratively remove variables and subjects with large missing value percentage. CDs contain no missing values and are ideal to perform simulation for evaluating different methods (see section 
<xref rid="Sec12" ref-type="sec">Simulated datasets</xref>
).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>
<bold>Descriptions of three real data sets</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr valign="top">
<th>
<bold>Number of variables and subjects</bold>
</th>
<th>
<bold>COPD</bold>
</th>
<th>
<bold>LTRC</bold>
</th>
<th>
<bold>SARP</bold>
</th>
</tr>
</thead>
<tbody>
<tr valign="top">
<td>Subjects (RD/CD)</td>
<td>699/491</td>
<td>1428/709</td>
<td>1671/640</td>
</tr>
<tr valign="top">
<td>Variables (RD/CD)</td>
<td>528/257</td>
<td>1568/129</td>
<td>1761/135</td>
</tr>
<tr valign="top">
<td>Continuous variables (Con)</td>
<td>113</td>
<td>11</td>
<td>27</td>
</tr>
<tr valign="top">
<td>Multi-class categorical variables (Cat)</td>
<td>12</td>
<td>27</td>
<td>6</td>
</tr>
<tr valign="top">
<td>Binary variables (Bin)</td>
<td>78</td>
<td>0</td>
<td>86</td>
</tr>
<tr valign="top">
<td>Ordinal variables (Ord)</td>
<td>54</td>
<td>91</td>
<td>16</td>
</tr>
<tr valign="top">
<td>Total variables in CD</td>
<td>257</td>
<td>129</td>
<td>135</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec4">
<title>Imputation methods</title>
<p>We will compare four newly developed KNN methods with the MICE and the missForest methods in this paper. The methods and detailed implementations are described below.</p>
<sec id="Sec5">
<title>Two existing methods MICE and missForest</title>
<p>Multivariate Imputation by Chained Equations (MICE) is a popular method to impute multivariate missing data. It factorizes the joint conditional density as a sequence of conditional probabilities and imputes missing values by multiple regression sequentially based on different types of missing covariates. Gibbs sampling is used to estimate the parameters. It then draws imputation for each variable condition on all the other variables. We used the R package “MICE” to implement this method.</p>
<p>MissForest is a random forest based method to impute phenomic data [
<xref ref-type="bibr" rid="CR26">26</xref>
]. The method treats the variable of the missing value as the response variable and borrows information from other variables by the resampling-based classification and regression trees to grow a random forest for the final prediction. The method is repeated until the imputed values reach convergence. The method is implement in the “missForest” R package.</p>
</sec>
<sec id="Sec6">
<title>KNN imputation methods</title>
<p>KNN method is popular due to its simplicity and proven effectiveness in many missing value imputation problems. For a missing value, the method seeks its K nearest variables or subjects and imputes by a weighted average of observed values of the identified neighbours. We adopted the weight choice from the LSimpute method used for microarray missing value imputation [
<xref ref-type="bibr" rid="CR28">28</xref>
]. LSimpute is an extension of the KNN, which utilizes correlations between both genes and arrays, and the missing values are imputed by a weighted average of the gene and array based estimates. Specifically, the weight for the k
<sup>th</sup>
neighbor of a missing variable or subject was given by
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\mathrm{w}}_{\mathrm{k}}={\left({\mathrm{r}}_{\mathrm{k}}^2/\left(1-{\mathrm{r}}_{\mathrm{k}}^2+\upvarepsilon \right)\right)}^2 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where r
<sub>k</sub>
is the correlation between the k
<sup>th</sup>
neighbor and the missing variable or subject and ε = 10
<sup>− 6</sup>
. As a result, this algorithm gives more weight to closer neighbors. Here, we extended the two KNN methods of LSimpute, imputation by the nearest variables (KNN-V) and imputation by the nearest subjects (KNN-S), so that they could be used to impute the phenomic data with mixed types of variables. Furthermore, we developed a hybrid of these two methods using global variable/subject weights (KNN-H) and adaptive variable/subject weights (KNN-A).</p>
</sec>
<sec id="Sec7">
<title>Impute by nearest variables (KNN-V)</title>
<p>To extend the KNN imputation method to data with mixed types of variables, we used established statistical correlation measures between different data types to measure the distance among different types of variables. As described in Table 
<xref rid="Tab1" ref-type="table">1</xref>
, the phenomic data usually contain four types of variables – continuous (Con), binary (Bin), multi-class categorical (Cat) and ordinal (Ord). Table 
<xref rid="Tab2" ref-type="table">2</xref>
lists correlation measures across different data types to construct the correlation matrix for KNN-V (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
contains more detailed description):
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>
<bold>Correlation measures between different types of variables</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr valign="top">
<th>
<bold>Variables</bold>
</th>
<th>
<bold>Con</bold>
</th>
<th>
<bold>Ord</bold>
</th>
<th>
<bold>Bin</bold>
</th>
<th>
<bold>Cat</bold>
</th>
</tr>
</thead>
<tbody>
<tr valign="top">
<td>Con</td>
<td>Spearman</td>
<td>--</td>
<td>--</td>
<td>--</td>
</tr>
<tr valign="top">
<td>Ord</td>
<td>Polyserial</td>
<td>Polycoric</td>
<td>--</td>
<td>--</td>
</tr>
<tr valign="top">
<td>Bin</td>
<td>Point Biserial</td>
<td>Rank Biserial</td>
<td>Phi</td>
<td>--</td>
</tr>
<tr valign="top">
<td>Cat</td>
<td>Point Biserial extension</td>
<td>Rank Biserial extension</td>
<td>Cramer’s V</td>
<td>Cramer’s V</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Spearman’s rank correlation (Con vs. Con): we use Spearman’s rank correlation to measure the correlation between two continuous variables. It is equivalent to compute Pearson correlation based on ranks:
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M2">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{r}=1-6\times \frac{{\displaystyle {\sum}_{\mathrm{i}=1}^{\mathrm{N}}}{\mathrm{d}}_{\mathrm{i}}^2}{\mathrm{N}\times \left({\mathrm{N}}^2-1\right)} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where d
<sub>i</sub>
is the rank difference of each corresponding observation and N is the number of subjects.</p>
<p>Point biserial correlation (Con vs. Bin) and its extension (Con vs. Cat): Point biserial correlation between a continuous variable X and a dichotomous variable Y (Y = 0 or 1) is defined as
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{r}=\frac{{\overline{\mathrm{X}}}_1-{\overline{\mathrm{X}}}_0}{{\mathrm{S}}_{\mathrm{X}}/\sqrt{{\mathrm{p}}_{\mathrm{Y}}\times \left(1-{\mathrm{p}}_{\mathrm{Y}}\right)}} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M4">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\overline{\mathrm{X}}}_1 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\overline{\mathrm{X}}}_0 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
represent the means of X given Y = 1 and 0 respectively, S
<sub>X</sub>
, the standard deviation of X and p
<sub>Y</sub>
, the proportion of subjects with Y = 1. Note that the point biserial correlation is mathematically equivalent to the Pearson correlation and there is no underlying assumption for Y. When Y is a multi-level categorical variable with more than two possible values, the point biserial correlation can be generalized, assuming Y follows a multinomial distribution and the conditional distribution of X given Y is normal [
<xref ref-type="bibr" rid="CR29">29</xref>
]. It is implemented by the “biserial.cor” function in the “ltm” R package.</p>
<p>Rank biserial correlation (Ord vs Bin) and its extension (Ord vs Cat): The rank biserial correlation replaces the continuous variable X in point biserial correlation with ranks. To calculate the correlation between an ordinal and a nominal variable (binary or multi-class), we transform the ordinal variable into ranks and then apply rank biserial correlation or its extension for the calculation [
<xref ref-type="bibr" rid="CR30">30</xref>
].</p>
<p>Polyserial correlation (Con vs Ord): Polyserial correlation measures the correlation between a continuous X and an ordinal variable Y. Y is assumed to be defined from a latent continuous variable η, generated with equal space and is strictly monotonic. The joint distribution of the observed continuous variable X and η is assumed to be bivariate normal. The Polyserial correlation is the estimated correlation between X and η and is estimated by maximum likelihood [
<xref ref-type="bibr" rid="CR31">31</xref>
]. It is implemented by the “polyserial” function in the “polycor” R package.</p>
<p>Polychoric correlation (Ord vs Ord): Polychoric correlation measures correlation between two ordinal variables. Similar to the polyserial correlation described above, polychoric correlation estimates the correlation of two underlying latent continuous variables, which are assumed to follow a bivariate normal distribution [
<xref ref-type="bibr" rid="CR32">32</xref>
]. It is implemented by the “polychor” function in the “polycor” R package.</p>
<p>Phi (Bin vs Bin): Phi coefficient measures the correlation between two dichotomous variables. The phi coefficient is the linear correlation of an underlying bivariate discrete distribution [
<xref ref-type="bibr" rid="CR33">33</xref>
-
<xref ref-type="bibr" rid="CR35">35</xref>
]. The Phi correlation is calculated as
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M6">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{r}=\sqrt{{\mathrm{X}}^2/\mathrm{N}} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where N is the number of subjects and X
<sup>2</sup>
is the chi-square statistic for the 2 × 2 contingency table of the two binary variables.</p>
<p>Cramer’s V (Bin vs Cat and Cat vs Cat): Cramer’s V measures correlation between two nominal variables with two or more levels. It is based on the Pearson’s chi-square statistic [
<xref ref-type="bibr" rid="CR36">36</xref>
]. The formula is given by:
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{r}=\sqrt{\frac{{\mathrm{X}}^2}{\mathrm{N}\times \left(\mathrm{H}-1\right)}} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where N is the number of subjects, X
<sup>2</sup>
is the chi-square statistic for the contingency table and H is the number of rows or columns, whichever is less.</p>
<p>We note that all correlation measures in Table 
<xref rid="Tab2" ref-type="table">2</xref>
are based on the classical Pearson correlation (some with additional Gaussian assumptions on the data) and as a result, the correlations from different data types are comparable in selecting K nearest neighbors. A corresponding distance measure could be computed as d = |1 − r|, where r is the correlation measures between pairwise variables. Given a missing value in the data matrix for variable x (missing on subject i), only the K nearest neighbors of x (denoted as y
<sub>1</sub>
… y
<sub>K</sub>
) are included in the prediction model. In addition, none of y
<sub>1</sub>
, …, y
<sub>K</sub>
is allowed to have missing values for the same subject as the missing value to be predicted. For each neighbour, a generalized linear regression model with single predictor is constructed: g(μ) = α + βy
<sub>k</sub>
using available cases, where μ = E(x) and g(·) is the link function. The regression methods used for the imputation of different types of variables are listed in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. Missing values could be imputed by
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M8">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\widehat{\mathrm{x}}}_{\mathrm{i}\left(\mathrm{k}\right)}={\mathrm{g}}^{-1}\left(\upalpha +{\upbeta \mathrm{y}}_{\mathrm{i}\mathrm{k}}\right) $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Finally, the weighted average of estimated impute values from the K nearest neighbors is used to impute the missing value of continuous data type. For nominal variables (binary or multi-class categorical), weighted majority vote from the K nearest neighbors is used. For ordinal variables, we treat the levels as positive integers (i.e. 1, 2, 3,…, q) and the imputed value is given by the rounded value of the weighted average.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>
<bold>Methods for aggregating imputation information of different data types from K nearest neighbors</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr valign="top">
<th>
<bold>Variables</bold>
</th>
<th>
<bold>Regression methods</bold>
</th>
<th>
<bold>Final imputed value</bold>
</th>
</tr>
</thead>
<tbody>
<tr valign="top">
<td>Con</td>
<td>Linear regression</td>
<td>
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\displaystyle \sum }{\mathrm{w}}_{\mathrm{k}}\widehat{{\mathrm{y}}_{\mathrm{k}}}/{\displaystyle \sum }{\mathrm{w}}_{\mathrm{k}} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
</tr>
<tr valign="top">
<td>Ord</td>
<td>Ordinal logistic regression</td>
<td>
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M10">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \min \left( \max \left(1,\left[{\displaystyle \sum }{\mathrm{w}}_{\mathrm{k}}\widehat{{\mathrm{y}}_{\mathrm{k}}}/{\displaystyle \sum }{\mathrm{w}}_{\mathrm{k}}\right]\right),\mathrm{q}\right) $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
</tr>
<tr valign="top">
<td>Bin</td>
<td>Logistic regression</td>
<td>Weighted majority vote</td>
</tr>
<tr valign="top">
<td>Cat</td>
<td>Multinomial logistic regression</td>
<td>Weighted majority vote</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>(q: number of level for ordinal variable).</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec8">
<title>Impute by nearest subjects (KNN-S)</title>
<p>The procedure of the KNN-S is generally the same as that of the KNN-V. Here, we borrow information from the nearest subjects, instead of variables. Thus, we will have mixed type of values within each vector (subject). We defined similarity of a pair of subjects by the Gower’s distance [
<xref ref-type="bibr" rid="CR37">37</xref>
]. For each pair of subjects, it is the average of distance between each variable for the pair of subjects considered:
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\mathrm{d}}_{\mathrm{ij}}=\frac{{\displaystyle {\sum}_{\mathrm{v}=1}^{\mathrm{V}}}{\updelta}_{\mathrm{ij}\mathrm{v}}{\mathrm{d}}_{\mathrm{ij}\mathrm{v}}}{{\displaystyle {\sum}_{\mathrm{v}=1}^{\mathrm{V}}}{\updelta}_{\mathrm{ij}\mathrm{v}}} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where d
<sub>ijv</sub>
is the dissimilarity score between subject i and j for the v
<sup>th</sup>
variable and δ
<sub>ijv</sub>
indicates whether the v
<sup>th</sup>
variable is available for both subject i and j; it takes the value of 0 or 1. Depending on different types of variable, d
<sub>ijv</sub>
is defined differently: (1) for dichotomous and multi-level categorical variables, d
<sub>ijv</sub>
 = 0 if the two subjects agree on the v
<sup>th</sup>
variable, otherwise d
<sub>ijv</sub>
 = 1; (2) the contribution of other variables (continuous and ordinal) is the absolute difference of both values divided by the total range of that variable [
<xref ref-type="bibr" rid="CR37">37</xref>
]. The calculation of the Gower’s distance is implemented by the “daisy” function in the “cluster” R package.</p>
</sec>
<sec id="Sec9">
<title>Hybrid imputation by nearest subjects and variables (KNN-H)</title>
<p>Since the nearest variables and the nearest subjects often both contain information to improve imputation, we propose to combine imputed values from KNN-S and KNN-V by:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M12">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{K}\mathrm{N}\mathrm{N}-\mathrm{H}=p\times \mathrm{K}\mathrm{N}\mathrm{N}-\mathrm{S}+\left(1-p\right)\times \mathrm{K}\mathrm{N}\mathrm{N}-\mathrm{V}. $$ \end{document}</tex-math>
<graphic xlink:href="12859_2014_346_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Following Bø et. al. [
<xref ref-type="bibr" rid="CR28">28</xref>
], we estimated p by simulating 5% secondary missing values in the dataset. Define a dataset (D
<sub>ij</sub>
)
<sub>NP</sub>
with missing value indicator I
<sub>ij</sub>
 = 1 if missing and 0 other wise. We simulate second layer of missing values randomly (I
<sub>ij</sub>
’ = 1 if subject i variable j is missing at second layer), perform imputation and assess the normalized squared error of each imputed values using KNN-S and KNN-V(
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\mathrm{e}}_{\mathrm{S}}^2 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M14">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\mathrm{e}}_{\mathrm{V}}^2 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
).
<italic>p</italic>
is chosen to minimize
<disp-formula id="Equb">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\displaystyle \sum }{\mathrm{e}}_{\mathrm{H}}^2={\displaystyle \sum }{\mathrm{p}}^2{\mathrm{e}}_{\mathrm{S}}^2+2\mathrm{p}\left(1-\mathrm{p}\right){\mathrm{e}}_{\mathrm{S}}\cdot {\mathrm{e}}_{\mathrm{V}}+{\left(1-\mathrm{p}\right)}^2{\mathrm{e}}_{\mathrm{V}}^2. $$ \end{document}</tex-math>
<graphic xlink:href="12859_2014_346_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Thus,
<inline-formula id="IEq14">
<alternatives>
<tex-math id="M16">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \widehat{\mathrm{p}}= \min \left( \max \left(\frac{{\displaystyle \sum {e}_s^2-{\displaystyle \sum {e}_v{e}_s}}}{{\displaystyle \sum {e}_s^2-2{\displaystyle \sum {e}_v{e}_s+{\displaystyle \sum {e}_v^2}}}},0\right),1\right) $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq14.gif"></inline-graphic>
</alternatives>
</inline-formula>
. We simulated second layer of missing values 20 times and estimated
<inline-formula id="IEq15">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\widehat{p}}_i $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq15.gif"></inline-graphic>
</alternatives>
</inline-formula>
and took the average
<inline-formula id="IEq16">
<alternatives>
<tex-math id="M18">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \frac{{\displaystyle {\sum}_1^{20}}{\widehat{p}}_i}{20} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq16.gif"></inline-graphic>
</alternatives>
</inline-formula>
as the estimate of p. Similar to KNN-V imputation, KNN-H imputed values are rounded to the closest integer for the ordinal variables and the weighted majority vote for nominal variables.</p>
</sec>
<sec id="Sec10">
<title>Hybrid imputation using adaptive weight (KNN-A)</title>
<p>Bø et. al. [
<xref ref-type="bibr" rid="CR28">28</xref>
] observed that the log-ratios of the squared errors
<inline-formula id="IEq17">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \log \left({\mathrm{e}}_{\mathrm{v}}^2/{\mathrm{e}}_{\mathrm{s}}^2\right) $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
was a decreasing function of r
<sub>max</sub>
in microarray missing value imputation, where r
<sub>max</sub>
is the correlation between the variable with missing value and its closest neighbour. Such a trend suggested that when r
<sub>max</sub>
is larger, more weight should be given to KNN-V. Thus, p should vary for different r
<sub>max</sub>
. We adopted the same procedure to estimate the adaptive weight of p: we estimated p based on e
<sub>S</sub>
and e
<sub>V</sub>
within each sliding window of r
<sub>max</sub>
, (r
<sub>max</sub>
 − 0.1, r
<sub>max</sub>
 + 0.1), and require that at least 10 observations need to be extracted for the computation of p.</p>
</sec>
</sec>
<sec id="Sec11">
<title>Evaluation method</title>
<p>We compared different missing value imputation methods in both simulated data and real datasets. We evaluated the imputation performance by calculating root mean squared error (RMSE) for continuous and ordinal variables and proportion of false classification (PFC) for nominal variables. The pure simulated data are discussed in
<xref rid="Sec12" ref-type="sec">Simulated datasets</xref>
below. For real datasets, we first generated the complete dataset (CD) from the original raw dataset (RD) with missing values. We then simulated missing values (e.g. randomly at 5% missing rate) to obtain the dataset with missing values (MD), performed imputation on the MD and assessed the performance by calculating the RMSE between the imputed and the real values. The squared errors are defined as
<inline-formula id="IEq18">
<alternatives>
<tex-math id="M20">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {e}^2=\frac{{\left({\widehat{y}}_{ij}-{y}_{ij}\right)}^2}{var\left({y}_j\right)} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
for continuous variables (
<italic>ŷ</italic>
<sub>
<italic>ij</italic>
</sub>
and
<italic>y</italic>
<sub>
<italic>ij</italic>
</sub>
are the imputed and the true values for subject i and variable j and var(
<italic>y</italic>
<sub>
<italic>j</italic>
</sub>
) is the variance for variable
<italic>j</italic>
),
<inline-formula id="IEq19">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {e}^2={\left(\frac{{\widehat{y}}_{ij}-{y}_{ij}}{p-1}\right)}^2 $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
for ordinal variables (p is the number of possible levels of
<italic>y</italic>
<sub>
<italic>j</italic>
</sub>
), and
<italic>e</italic>
<sup>2</sup>
 = χ(
<italic>ŷ</italic>
<sub>
<italic>ij</italic>
</sub>
 ≠ 
<italic>y</italic>
<sub>
<italic>ij</italic>
</sub>
) for nominal variables (χ(⋅) is an indicator function). The RMSE for continuous and ordinal variables is defined as
<inline-formula id="IEq20">
<alternatives>
<tex-math id="M22">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \sqrt{ave\left({e}^2\right)} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
and the PFC for nominal variables is
<italic>ave</italic>
(
<italic>e</italic>
). We estimated the RMSE and the PFC by 20 randomly generated MDs.</p>
</sec>
<sec id="Sec12">
<title>Simulated datasets</title>
<p>Simulation of complete datasets (CD): To demonstrate the performance of various methods under different correlation structure, we considered three scenarios to simulate N = 600 subjects and P = 300 variables.</p>
<p>Simulation I (six variable clusters + six subject clusters): We first generated the number of subjects in each cluster from Pois(80), and number of variables in each cluster from Pois(40). To create the correlation structure among variables, we first generated a common basis δ
<sub>i</sub>
(i =1…6) with length N for variables in cluster i from N(μ, 4), where μ is randomly sampled from UNIF(−2, 2). Then we generated a set of slope and intercept (α
<sub>ip</sub>
, β
<sub>ip</sub>
), p = 1… v
<sub>i</sub>
, so that each variable is a linear transformation of the common basis and therefore the correlation structure is preserved. The rest of the variables which were independent of those grouped variables were random samples from N(0, 4). The subject correlation structure was generated following the similar strategy: we first generated common basis γ
<sub>j</sub>
(j =1…6) from N(1,2) with length P. For all subjects in cluster j, γ
<sub>j</sub>
was added to each of them to create correlation within subjects. And the rest of subjects were generated from N(0, 4 × I
<sub>P × P</sub>
). To create data of mixed types, we randomly converted 100 variables into nominal variables and 60 variables into ordinal variables by randomly generating 3 to 6 ordinal/nominal levels. The proportions of different variable types were similar to that of the COPD data set. The heatmaps of subject and variable distance matrixes of the simulated data are shown in Figure 
<xref rid="Fig1" ref-type="fig">1</xref>
.
<fig id="Fig1">
<label>Figure 1</label>
<caption>
<p>
<bold>Heatmap of distance matrix in simulation I. (a)</bold>
Variable and
<bold>(b)</bold>
Subject distance matrixes of Simulation I. (black: small distance/high correlation; white: large distance/low correlation).</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>Simulation II (twenty variable groups + twenty subject groups): The number of clusters is increased to 20. The numbers of subjects in each cluster were generated from Pois(25) and the numbers of variables in each cluster were from Pois(15) (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1).</p>
<p>Simulation III (No variable groups + forty subject groups): In this simulation, we generated data with sparse between-variable correlation but strong between-subject correlations, a setting similar to the nominal variables in the SARP data set (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6(c)). The number of subjects in each cluster followed Pois (14). In each subject cluster, a common base γ
<sub>c</sub>
(c =1…40) with length P were shared, and was added by a random error from N(0, 0.01). We created sparse categorical variable by cutting continuous variable at the extreme quantiles (≤ 5 % or ≥ 95 %) and generated the other cutting point randomly from UNIF(0.01, 0.99) which created up to 30 levels. (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2).</p>
<p>Generate datasets with missing values (MD) from complete data (CD): MD were generated by randomly removing m% values from simulated CD described above or CD from real data described in Section 
<xref rid="Sec3" ref-type="sec">Real data</xref>
. We considered m% = 5%, 20%, 40% in our simulation studies. All three settings were repeated for 20 times.</p>
</sec>
<sec id="Sec13">
<title>Imputability measure</title>
<p>Current practice in the field is to impute all missing data after filtering out variables or subjects with more than a fixed percent (e.g. 20%) of missing values. This practice implicitly assumes that all missing values are imputable by borrowing information from other variables or subjects. This assumption is usually true in microarray or other high-throughput marker data since genes usually interact with each other and are co-regulated at the systemic level. For high-dimensional phenomic data, however, we have observed that many variables do not associate or interact with other variables and are difficult to impute. Therefore, to identify these missing values, we introduce a novel concept of “imputability” and develop a quantitative “imputability measure” (IM). Specifically, given a dataset with missing values, we generate “second layer” of missing values as described above. We then perform the KNN-V and the KNN-S method on a “secondary simulated layer” of missing values. The procedure is repeated for t times (t =10 is usually sufficient) and E
<sub>i</sub>
and E
<sub>j</sub>
could be calculated as the average of the RMSEs for the second layer missing values of subject i (i = 1,…,N) and variable j (j = 1,…,P) of the t times of imputations. Let IMs
<sub>i</sub>
 = exp(−E
<sub>i</sub>
) and IMv
<sub>j</sub>
 = exp(−E
<sub>j</sub>
). The IM for a missing value Dij is defined as max(IMs
<sub>i</sub>
, IMv
<sub>j</sub>
). IM provides quantitative evidence of how well each missing value can be imputed by borrowing information from other variables or subjects. IM ranges between 0 and 1 and small IM values represent large imputation errors that should raise concerns of using imputation. Detailed Procedure of generating IM is described in Additional file
<xref rid="MOESM2" ref-type="media">2</xref>
algorithm 1. In the application guideline to be proposed in the Result section, we will recommend users to avoid imputation or impute with caution for missing values with IM less than a pre-specified threshold.</p>
</sec>
<sec id="Sec14">
<title>The self-training selection (STS) scheme</title>
<p>In our analyses, no imputation method performed universally better than all other methods. Thus, the best choice of imputation method depends on the particular structure of a given data. Previously, we proposed a Self-Training Selection (STS) scheme for microarray missing value imputation [
<xref ref-type="bibr" rid="CR24">24</xref>
]. Here we applied the STS scheme and evaluated its performance in the complete real datasets. Figure 
<xref rid="Fig2" ref-type="fig">2</xref>
. shows a diagram of the STS scheme and how we evaluated the STS scheme. From a CD, we simulated 20 MDs (MD
<sub>1</sub>
, MD
<sub>2</sub>
, …, MD
<sub>20</sub>
). Our goal was to identify the best method for the data set. To achieve that, we randomly generated a second layer of missing values within each MD
<sub>b</sub>
(1 ≤ b ≤ 20) for 20 times and denoted the data sets with two layers of missing values as MD
<sub>b,i</sub>
(1 ≤ i ≤ 20). The method that performs the best in the second layer missing values imputation, i.e., generate the smallest average RMSE, was identified as the method selected by the STS scheme for missing value imputation of MD
<sub>b</sub>
(denoted as M
<sub>b, STS</sub>
). Consider the optimal method identified by the first layer STS as the “true” optimal imputation method, denoted as M
<sub>b*</sub>
, we counted how many times of the 20 simulations that M
<sub>b, STS</sub>
 = M
<sub>b*</sub>
(i.e.
<inline-formula id="IEq21">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {\displaystyle {\sum}_{\mathrm{b}=1}^{20}\mathrm{I}\left({\mathrm{M}}_{\mathrm{b},\mathrm{S}\mathrm{T}\mathrm{S}}={\mathrm{M}}_{\mathrm{b}*}\right)} $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
/20, where I(⋅) is the indicator function) as the accuracy of STS scheme.
<fig id="Fig2">
<label>Figure 2</label>
<caption>
<p>
<bold>Diagram of evaluating performance of STS scheme in a real complete data set (CD).</bold>
Missing data sets are randomly generated for 20 times (MD
<sub>1</sub>
, ⋅⋅⋅, MD
<sub>20</sub>
). The STS scheme is applied to learn the best method from STS simulation (denoted as M
<sub>b,STS</sub>
for the b-th missing data set MD
<sub>b</sub>
). The true best (in terms of RMSE) method for MD
<sub>b</sub>
is denoted as M
<sub>b*</sub>
and the STS best (in terms of RMSE across MD
<sub>b,1</sub>
, …, MD
<sub>b,20</sub>
) method is denoted as M
<sub>b,STS</sub>
. When M
<sub>b,STS</sub>
 = M
<sub>b*</sub>
, the STS scheme successfully selects the optimal method.</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec15" sec-type="results">
<title>Results</title>
<sec id="Sec16">
<title>Simulation results</title>
<p>We compared the performance of seven methods – mean imputation (MeanImp), KNN-V, KNN-S, KNN-H, KNN-A, missForest and MICE – on the three simulation scenarios described above. When implementing MICE, the R packages returned errors when the nominal or ordinal variables contained large number of levels and any level contained a small number of observations. As a result, MICE was not applied to Simulation III evaluation. We first performed simulation to determine effects on the imputation by the choice of K. We tested K = 5, 10 and 15 for missing value = 5%, 10% and 20% on different types of data. The imputation results with different K values are similar (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3). We thus chose K = 5 for both simulation and real data applications as it generated good performance in most situations.</p>
<p>Figure 
<xref rid="Fig3" ref-type="fig">3</xref>
shows the boxplots of the RMSEs of the three types of variables from 20 simulations for the three simulation scenarios. For simulation I and II, we observed that missForest performed the best in all three data types. MICE performed better than the KNN-methods in nominal missing imputation, but performed worse in the imputation of continuous and ordinal variables. The two hybrid KNN methods (KNN-A and KNN-H) consistently performed better than KNN-V and KNN-S, showing the effectiveness to combine information from variables and subjects. KNN-A performed slightly better than KNN-H especially in the first two simulation scenarios, indicating the advantages of adaptive weight in combining KNN-V and KNN-S information. For simulation III, KNN-S performed overall the best while KNN-V failed. This is expected due to the lack of correlation between variables. missForest was also not as good as KNN-S in the continuous and nominal variable imputations. In this case, the performance of KNN-S, KNN-H and KNN-A were not affected much by missing percentages, due to the strong correlation among subjects.
<fig id="Fig3">
<label>Figure 3</label>
<caption>
<p>
<bold>Boxplots of RMSE/PFC for (a) Simulation I and (b) Simulation II and (c) Simulation III.</bold>
KNN-based methods: KNN-V, KNN-S, KNN-H and KNN-A; RF: MissForest algorithim; MICE: multivariate imputation by chained equations; MeanImp: mean imputation.</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
</sec>
<sec id="Sec17">
<title>Real data applications</title>
<p>Next we compared different methods in three real datasets. Similar to the above simulation study, we first investigate the choice of K for the simulation of real data sets and reached the same conclusion (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4). In order to implement MICE in our comparative analysis, we had to remove categorical variables with any sparse level (i.e. having <10% of the total observations) and those with greater than 10 levels. The numbers of variables after such filtering are shown in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1. Since only 26% (38/144), 14% (16/118) and 45% (49/108) of nominal and ordinal variables were retained after the filtering, we decided to remove MICE from the comparison and report the comparative results of the remaining methods with the unfiltered data in Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
. The comparative results for all methods including MICE on the filtered data are available in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5. As expected, the mean imputation almost always performed the worst (Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
). KNN-V usually performed better than KNN-S (except for the nominal variables in SARP), indicating better information borrowed from neighboring variables than subjects. The hybrid methods KNN-H and KNN-A performed better than either KNN-S or KNN-V alone. KNN-A seemed to slightly out performed KNN-H. missForest was usually the best performer with an exception of nominal variables in the SARP data set. This is probably because of the low mutual correlation of nominal variables with other variables in this data set as demonstrated in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6. (note that missForest only borrows information from variables). Overall, no method universally outperformed other methods. In Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S5 after filtering, the comparative result is similar to Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
for KNN methods and missForest. The MICE method had unstable performance: sometimes performs among the best and sometimes much worse than all the others.
<fig id="Fig4">
<label>Figure 4</label>
<caption>
<p>
<bold>Boxplots of RMSE/PFC for (a) COPD; (b) SARP and (c) LTRC.</bold>
KNN-based methods: KNN-V, KNN-S, KNN-H and KNN-A; RF: MissForest algorithm; MeanImp: Mean imputation.</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
</sec>
<sec id="Sec18">
<title>Imputability measure</title>
<p>The motivation of imputability concept rests in that some variables or subjects have no near neighbour to borrow information from, hence cannot be imputed accurately. The distribution of imputability measure (IM; defined in Section 
<xref rid="Sec13" ref-type="sec">Imputability measure</xref>
) of the variables (IMv) and subjects (IMs) of COPD, LTRC and SARP data are shown in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S7. We observed a heavy tail to the left, which indicated existence of many un-imputable subjects and variables. By including these poorly imputed values, we risk to reduce the accuracy and power of downstream analyses. To demonstrate the usefulness of IM, we compared the RMSE/PFC before and after removing un-imputable values. Figure 
<xref rid="Fig5" ref-type="fig">5</xref>
shows significant reduction of RMSE and PFC by removing missing values with the lowest 25% IMs. In Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S8, heatmaps of IMs for the three real datasets are presented. Values colored in green are with low IMs and should be imputed with caution.
<fig id="Fig5">
<label>Figure 5</label>
<caption>
<p>
<bold>Boxplots of RMSE/PFC evaluated using (1) all imputed values and (2) only imputable values in LTRC dataset.</bold>
Boxplots of RMSE/PFC evaluated using (1) all imputed values and (2) only imputable values in LTRC dataset with m =5% missingness. Color: grey (evaluation using all imputed values); white (evaluation using only imputable values).</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
</sec>
<sec id="Sec19">
<title>The self-training selection scheme (STS) and an application guideline</title>
<p>Finally, we applied the STS scheme to the real datasets and the performance is reported in Table 
<xref rid="Tab4" ref-type="table">4</xref>
. Methods with RMSE difference within 5% range are considered comparable. Thus, if a method generates RMSE within 5% of the minimum RMSE of all methods, we considered the method not distinguishable from the optimal method and the method is also an optimal choice. We found that the STS scheme can almost always select the true optimal missing value imputation method with perfect accuracy (with only several exceptions down to 75%-95% accuracy). Figure 
<xref rid="Fig6" ref-type="fig">6</xref>
describes an application guideline for the phenomic missing value imputation. Firstly, the STS scheme is applied to the MD of different data types separately to identify the best imputation method. The IMs are then calculated based on the selected optimal method. Finally, imputation is performed based on the optimal method selected by the STS scheme and the users have two options to move on to downstream analyses. For Option A, all missing values are imputed accompanied by IMs that can be incorporated in downstream analyses. In Option B, only missing values with IMs higher than a pre-specified threshold are imputed and reported.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>
<bold>Accuracy of STS in real data applications</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr valign="top">
<th rowspan="2">
<bold>Data</bold>
</th>
<th rowspan="2">
<bold>m%</bold>
</th>
<th colspan="2">
<bold>Continuous variables</bold>
</th>
<th colspan="2">
<bold>Nominal variables</bold>
</th>
<th colspan="2">
<bold>Ordinal variables</bold>
</th>
</tr>
<tr valign="top">
<th>
<bold>Predicted optimal method (No. of time selected)</bold>
</th>
<th>
<bold>Accuracy</bold>
</th>
<th>
<bold>Predicted optimal method (No. of time selected)</bold>
</th>
<th>
<bold>Accuracy</bold>
</th>
<th>
<bold>Predicted optimal method (No. of time selected)</bold>
</th>
<th>
<bold>Accuracy</bold>
</th>
</tr>
</thead>
<tbody>
<tr valign="top">
<td rowspan="3">COPD</td>
<td>5%</td>
<td>KNN-V(10), RF(10)</td>
<td>100%</td>
<td>RF(10), KNN-A(8), KNN-V(2)</td>
<td>100%</td>
<td>RF(20)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>20%</td>
<td>KNN-V(13), RF(6), KNN-H(1)</td>
<td>100%</td>
<td>RF(14), KNN-A(4), KNN-V(2)</td>
<td>100%</td>
<td>RF(20)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>40%</td>
<td>KNN-V(10), RF(10)</td>
<td>100%</td>
<td>KNN-V(16), RF(1), KNN-A(3)</td>
<td>95%</td>
<td>RF(20)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td rowspan="3">LTRC</td>
<td>5%</td>
<td>KNN-V(15), KNN-A(3), RF(2)</td>
<td>95%</td>
<td>RF(14), KNN-A(3), KNN-V(3)</td>
<td>75%</td>
<td>RF(19), KNN-A(1)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>20%</td>
<td>KNN-V(12), RF(8)</td>
<td>85%</td>
<td>RF(15), KNN-V(1), KNN-A(4)</td>
<td>100%</td>
<td>RF(16), KNN-A(4)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>40%</td>
<td>RF(13), KNN-V(7)</td>
<td>90%</td>
<td>KNN-A(13), RF(6), KNN-V(1)</td>
<td>100%</td>
<td>RF(20)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td rowspan="3">SARP</td>
<td>5%</td>
<td>KNN-V(13), KNN-A(6), RF(1)</td>
<td>100%</td>
<td>KNN-A(20)</td>
<td>100%</td>
<td>RF(18), KNN-H(2)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>20%</td>
<td>KNN-V(16), KNN-A(4)</td>
<td>100%</td>
<td>KNN-A(20)</td>
<td>100%</td>
<td>RF(16), KNN-H(4)</td>
<td>100%</td>
</tr>
<tr valign="top">
<td>40%</td>
<td>KNN-V(17), KNN-A(3)</td>
<td>100%</td>
<td>KNN-A(20)</td>
<td>100%</td>
<td>RF(20)</td>
<td>100%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Note: Here “predicted optimal method” means the predicted method with minimal RMSE for second layer of missing values; and “accuracy” means the chances we correctly predict optimal method. (
<inline-formula id="IEq22">
<alternatives>
<tex-math id="M24">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \mathrm{Accuracy} = \frac{{\displaystyle {\sum}_{b=1}^{20}}I\left({M}_{b,STS}={M}_{b*}\right)}{20}\times 100\% $$ \end{document}</tex-math>
<inline-graphic xlink:href="12859_2014_346_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
).</p>
</table-wrap-foot>
</table-wrap>
<fig id="Fig6">
<label>Figure 6</label>
<caption>
<p>
<bold>An application guideline to apply the STS scheme for a real dataset with missing values.</bold>
</p>
</caption>
<graphic xlink:href="12859_2014_346_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec20" sec-type="discussion">
<title>Discussion</title>
<p>In our comparative study of the imputation methods available for phenomic data, MICE encountered difficulty in nominal and ordinal data types when any level in the variable has few observations. This limited its application to some real data. It also had unstable performance, with some situations among the top performers while in some other situations it performed much worse than the KNN methods and missForest. For the KNN methods, the hybrid methods (KNN-H and KNN-A) that combined information from neighboring subjects and variables usually performed better than borrowing information from either subjects (KNN-S) or variables (KNN-V) alone. missForest usually was among the top performers while it could fail when correlations among variables are sparse. In the proposed KNN-based methods, when there are lots of nominal variables with sparse levels, ordinary logistic regression will also fail to work. When this happen, contingency table is used to impute the missing values. This partly explained why across different missing percentage, (5% to 40%) the accuracy remained mostly unchanged. It is also due to the lack of similar variables with nominal missing values. Overall, no method universally performed the best in all situations. Thus, we implemented a STS scheme [
<xref ref-type="bibr" rid="CR24">24</xref>
] previously developed for microarray missing value imputation to identify the best method for phenomic data. Our evaluation showed that STS selected the true best method with almost perfect accuracy.</p>
<p>In missing value imputation of microarray data, it is a common practice to impute all missing values and return a complete data matrix for down-stream analyses. In our analysis, we, however, found that many variables or subjects are intrinsically difficult to impute in phenomic data. Our proposed IM was found effective in identification of missing values that intrinsically cannot be imputed well and improved the imputation performance. As a result, our application guideline recommended to always report both the imputed values and IMs when all missing values were imputed (option A) or only to impute missing values with high IMs (option B). In the former output, it is possible to incorporate the IM values in downstream analyses (e.g. by down-weighting imputed values in the analysis with low IMs).</p>
<p>We note that RMSE has been used to evaluate performance of different methods in this paper. Depending on the final biological objectives, there are many choices of downstream analyses after imputation; for example, association analysis, cluster analysis, classification analysis, pathway enrichment analysis and graphical models, to name a few. While the impact of imputation methods to these downstream analyses is the ultimate interest, it is beyond the scope of this paper. We decided that RMSE is the most direct assessment that we could use to evaluate the methods. In our simulation and real data, we examined data size of hundreds of clinical variables and hundreds of samples. This is a common scale of phenomic datasets we usually expect. In the future, if larger scale of variables or patients are expected (e.g. up to thousands), more evaluations on the methodological and computational capabilities of different methods will be needed.</p>
<p>With the accelerated pace of phenomic data generation in many complex diseases nowadays, missing values are almost always inevitable. Ignoring subjects or variables with any missing value is no longer practical as it significantly reduces the statistical power and may distort the conclusion. Missing value imputation is a practical and powerful solution while such a practice in high-dimensional phenomic data has not drawn much attention in the literature. To our knowledge, our pipeline is the first complete guideline to the missing value imputation in high-dimensional phenomic data. We believe that the methods, the imputability concept, the STS scheme and the application guideline we proposed in this paper will provide practical guidance to researchers in the field.</p>
</sec>
<sec id="Sec21" sec-type="conclusion">
<title>Conclusions</title>
<p>In this paper, we conducted comprehensive comparison of existing imputation methods for phenomic data, including four variations of KNN imputation methods developed by us in this paper, missForest and MICE, using three simulation scenarios and three phenomic real datasets. We proposed a novel “imputability” concept with a quantitative imputability measure (IM) to characterize whether a missing value is imputable or not. More importantly, since the choice of the best imputation method depends on different data types and data structure, we implemented a simulation-based “self-training selection” (STS) scheme to select the best methods in a given application. Finally, we illustrated an application guideline for practitioners to apply to real phenomic applications. The R package “phenomeImpute” is available to implement all methods and the analytical pipeline proposed in this paper.</p>
<sec id="Sec22">
<title>Availability of supporting data</title>
<p>The R package “PhenomeImpute” is available in the webpage
<ext-link ext-link-type="uri" xlink:href="http://tsenglab.biostat.pitt.edu/software.htm">http://tsenglab.biostat.pitt.edu/software.htm</ext-link>
. Three real datasets and R codes are available in
<ext-link ext-link-type="uri" xlink:href="http://tsenglab.biostat.pitt.edu/publication.htm">http://tsenglab.biostat.pitt.edu/publication.htm</ext-link>
.</p>
</sec>
</sec>
<sec sec-type="supplementary-material">
<title>Additional files</title>
<sec id="Sec23">
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12859_2014_346_MOESM1_ESM.docx">
<label>Additional file 1:</label>
<caption>
<p>
<bold>Supplementary materials.</bold>
This file contains supplementary figures, tables and detailed description of correlation measures.
<bold>Figure S1.</bold>
Heatmaps of (a) Variable (b) Subject distance in Simulation II.
<bold>Figure S2.</bold>
Heatmaps of (a) Variable (b) Subject distance in Simulation III.
<bold>Figure S3.</bold>
Selection of
<italic>K</italic>
for KNN-S (A) and KNN-V (B). First row: Simulation I; Second row: Simulation II; Third row: Simulation III.
<bold>Figure S4.</bold>
Selection of
<italic>K</italic>
for KNN-S (A) and KNN-V (B). First row: COPD; Second row: LTRC; Third row: SARP.
<bold>Figure S5.</bold>
Comparison of different missing value imputation methods in filtered data such that MICE can be implemented (First row: COPD; Second row: LTRC; Third row: SARP).
<bold>Figure S6.</bold>
Heatmaps of variable distance matrix (above) and subject distance matrix (below) of real data (COPD/LTRC/SARP).
<bold>Figure S7.</bold>
Density of IMv and IMs for three real datasets.
<bold>Figure S8.</bold>
Heatmaps of imputability measures for (a)COPD;(b)LTRC;(c)SARP. Red indicates larger imputability measures; green indicates smaller imputability measures. Detailed description of correlation measures. Table S1. Number of variables after filtering out sparse ordinal or nominal variables for MICE implementation.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM2">
<media xlink:href="12859_2014_346_MOESM2_ESM.docx">
<label>Additional file 2:</label>
<caption>
<p>
<bold>Algorithm 1.</bold>
Procedure of generating Imputability Measure (IM).</p>
</caption>
</media>
</supplementary-material>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>Serena G Liao and Yan Lin contributed equally to this work.</p>
</fn>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>GCT supervised the whole project. SGL developed all statistical analysis. DDK involved in initial discussion and method development. NK and FCS provided clinical dataset for method evaluation. SGL, YL and GCT drafted the manuscript. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>Funding: This study is supported by NIH R21MH094862, U01HL108642, U01HL112707 and RC2HL101715.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Ritchie</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Basford</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Pulley</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Bastarache</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Brown-Gentry</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Masys</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Roden</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Crawford</surname>
<given-names>DC</given-names>
</name>
</person-group>
<article-title>PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>9</issue>
<fpage>1205</fpage>
<lpage>1210</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq126</pub-id>
<pub-id pub-id-type="pmid">20335276</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hanauer</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Ramakrishnan</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Modeling temporal relationships in large scale clinical associations</article-title>
<source>J Am Med Inform Assoc</source>
<year>2013</year>
<volume>20</volume>
<issue>2</issue>
<fpage>332</fpage>
<lpage>341</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2012-001117</pub-id>
<pub-id pub-id-type="pmid">23019240</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lyalina</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Percha</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lependu</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Iyer</surname>
<given-names>SV</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>RB</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>NH</given-names>
</name>
</person-group>
<article-title>Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records</article-title>
<source>J Am Med Inform Assoc</source>
<year>2013</year>
<volume>20</volume>
<issue>e2</issue>
<fpage>e297</fpage>
<lpage>e305</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2013-001933</pub-id>
<pub-id pub-id-type="pmid">23956017</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ritchie</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Zuvich</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Crawford</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Schildcrout</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Bastarache</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ramirez</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Mosley</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Pulley</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Basford</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Bradford</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rasmussen</surname>
<given-names>LV</given-names>
</name>
<name>
<surname>Pathak</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chute</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Kullo</surname>
<given-names>IJ</given-names>
</name>
<name>
<surname>McCarty</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Chisholm</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Kho</surname>
<given-names>AN</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Larson</surname>
<given-names>EB</given-names>
</name>
<name>
<surname>Jarvik</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Sotoodehnia</surname>
<given-names>N</given-names>
</name>
<collab>Cohorts for Heart Aging Research in Genomic Epidemiology (CHARGE) QRS Group</collab>
<name>
<surname>Manolio</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Masys</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Haines</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Roden</surname>
<given-names>DM</given-names>
</name>
</person-group>
<article-title>Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk</article-title>
<source>Circulation</source>
<year>2013</year>
<volume>127</volume>
<issue>13</issue>
<fpage>1377</fpage>
<lpage>1385</lpage>
<pub-id pub-id-type="doi">10.1161/CIRCULATIONAHA.112.000604</pub-id>
<pub-id pub-id-type="pmid">23463857</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Warner</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Alterovitz</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bodio</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Joyce</surname>
<given-names>RM</given-names>
</name>
</person-group>
<article-title>External phenome analysis enables a rational federated query strategy to detect changing rates of treatment-related complications associated with multiple myeloma</article-title>
<source>J Am Med Inform Assoc</source>
<year>2013</year>
<volume>20</volume>
<issue>4</issue>
<fpage>696</fpage>
<lpage>699</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2012-001355</pub-id>
<pub-id pub-id-type="pmid">23515788</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fernald</surname>
<given-names>GH</given-names>
</name>
<name>
<surname>Capriotti</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Daneshjou</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Karczewski</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>RB</given-names>
</name>
</person-group>
<article-title>Bioinformatics challenges for personalized medicine</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>13</issue>
<fpage>1741</fpage>
<lpage>1748</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr295</pub-id>
<pub-id pub-id-type="pmid">21596790</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Singer</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>“Phenome” project set to pin down subgroups of autism</article-title>
<source>Nat Med</source>
<year>2005</year>
<volume>11</volume>
<issue>6</issue>
<fpage>583</fpage>
<pub-id pub-id-type="doi">10.1038/nm0605-583a</pub-id>
<pub-id pub-id-type="pmid">15937456</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sterne</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>White</surname>
<given-names>IR</given-names>
</name>
<name>
<surname>Carlin</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Spratt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Royston</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kenward</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Carpenter</surname>
<given-names>JR</given-names>
</name>
</person-group>
<article-title>Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls</article-title>
<source>BMJ</source>
<year>2009</year>
<volume>338</volume>
<issue>jun29 1</issue>
<fpage>b2393</fpage>
<pub-id pub-id-type="doi">10.1136/bmj.b2393</pub-id>
<pub-id pub-id-type="pmid">19564179</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Little</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>D’Agostino</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Dickersin</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Emerson</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Farrar</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Frangakis</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hogan</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Molenberghs</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Murphy</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Neaton</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Rotnitzky</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Scharfstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Shih</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Siegel</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Stern</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>The prevention and treatment of missing data in clinical trials</article-title>
<source>N Engl J Med</source>
<year>2012</year>
<volume>367</volume>
<fpage>1355</fpage>
<lpage>1360</lpage>
<pub-id pub-id-type="doi">10.1056/NEJMsr1203730</pub-id>
<pub-id pub-id-type="pmid">23034025</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tanner</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>WH</given-names>
</name>
</person-group>
<article-title>The calculation of posterior distributions by data augmentation</article-title>
<source>J Am Stat Assoc</source>
<year>1987</year>
<volume>82</volume>
<fpage>528</fpage>
<lpage>550</lpage>
<pub-id pub-id-type="doi">10.1080/01621459.1987.10478458</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Tanner</surname>
<given-names>MA</given-names>
</name>
</person-group>
<source>Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions</source>
<year>1996</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Springer-Verlag</publisher-name>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Missing data imputation using the multivariate
<italic>t</italic>
distribution</article-title>
<source>J Multivar Anal</source>
<year>1995</year>
<volume>53</volume>
<issue>1</issue>
<fpage>139</fpage>
<lpage>158</lpage>
<pub-id pub-id-type="doi">10.1006/jmva.1995.1029</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Little</surname>
<given-names>RJA</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>DB</given-names>
</name>
</person-group>
<source>Statistical Analysis with Missing Data</source>
<year>2002</year>
<edition>2</edition>
<publisher-loc>New York</publisher-loc>
<publisher-name>John Wiley</publisher-name>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raghunathan</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Lepkowski</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Hoewyk</surname>
<given-names>JV</given-names>
</name>
<name>
<surname>Solenberger</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>A multivariate technique for multiply imputing missing values using a sequence of regression models</article-title>
<source>Survey Methodology</source>
<year>2001</year>
<volume>27</volume>
<issue>1</issue>
<fpage>85</fpage>
<lpage>95</lpage>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Rubin</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Schafer</surname>
<given-names>JL</given-names>
</name>
</person-group>
<article-title>Efficiently creating multiple imputations for incomplete multivariate normal data</article-title>
<source>Proceeding of the Statistical Computing Section of the American Statistical Association</source>
<year>1990</year>
<fpage>83</fpage>
<lpage>88</lpage>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>van Buuren KG-O</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Mice: multivariate imputation by chained equations in R</article-title>
<source>J Stat Softw</source>
<year>2011</year>
<volume>45</volume>
<issue>3</issue>
<fpage>1</fpage>
<lpage>67</lpage>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andridge</surname>
<given-names>RR</given-names>
</name>
<name>
<surname>Little</surname>
<given-names>RJ</given-names>
</name>
</person-group>
<article-title>A review of hot deck imputation for survey non-response</article-title>
<source>Int Stat Rev</source>
<year>2010</year>
<volume>78</volume>
<issue>1</issue>
<fpage>40</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="doi">10.1111/j.1751-5823.2010.00103.x</pub-id>
<pub-id pub-id-type="pmid">21743766</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Little</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Yosef</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Cain</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Nan</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Harlow</surname>
<given-names>SD</given-names>
</name>
</person-group>
<article-title>A hot-deck multiple imputation procedure for gaps in longitudinal data on recurrent events</article-title>
<source>Stat Med</source>
<year>2008</year>
<volume>27</volume>
<issue>1</issue>
<fpage>103</fpage>
<lpage>120</lpage>
<pub-id pub-id-type="doi">10.1002/sim.2939</pub-id>
<pub-id pub-id-type="pmid">17592832</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Rubin</surname>
<given-names>DB</given-names>
</name>
</person-group>
<source>Multiple Imputation for Nonresponse in Surveys</source>
<year>1987</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Wiley</publisher-name>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raghunathan</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Grizzle</surname>
<given-names>JE</given-names>
</name>
</person-group>
<article-title>A split questionnaire survey design</article-title>
<source>J Am Stat Assoc</source>
<year>1995</year>
<volume>90</volume>
<fpage>54</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1080/01621459.1995.10476488</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raghunathan</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Siscovick</surname>
<given-names>DS</given-names>
</name>
</person-group>
<article-title>A multile imputation analysis of a case–control study of the risk of primary cardiac arrest among pharmacologically treated hypertensives</article-title>
<source>Appl Stat</source>
<year>1996</year>
<volume>45</volume>
<fpage>335</fpage>
<lpage>352</lpage>
<pub-id pub-id-type="doi">10.2307/2986092</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Schafer</surname>
<given-names>JL</given-names>
</name>
</person-group>
<source>Analysis of Incomplete Multivariate Data by Simulation</source>
<year>1997</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Chapman and Hall</publisher-name>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jerez</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Molina</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Garcia-Laencina</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Alba</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Ribelles</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Franco</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Missing data imputation using statistical and machine learning methods in a real breast cancer problem</article-title>
<source>Artif Intell Med</source>
<year>2010</year>
<volume>50</volume>
<issue>2</issue>
<fpage>105</fpage>
<lpage>115</lpage>
<pub-id pub-id-type="doi">10.1016/j.artmed.2010.05.002</pub-id>
<pub-id pub-id-type="pmid">20638252</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brock</surname>
<given-names>GN</given-names>
</name>
<name>
<surname>Shaffer</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Blakesley</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Lotz</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Tseng</surname>
<given-names>GC</given-names>
</name>
</person-group>
<article-title>Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>12</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-12</pub-id>
<pub-id pub-id-type="pmid">18186917</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sunghee Oh</surname>
<given-names>DDK</given-names>
</name>
<name>
<surname>Brock</surname>
<given-names>GN</given-names>
</name>
<name>
<surname>Tseng</surname>
<given-names>GC</given-names>
</name>
</person-group>
<article-title>Biological impact of missing-value imputation on downstream analyses of gene expression profiles</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>1</issue>
<fpage>78</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq613</pub-id>
<pub-id pub-id-type="pmid">21045072</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buhlmann</surname>
<given-names>DJSP</given-names>
</name>
</person-group>
<article-title>MissForest - nonparametric missing value imputation for mixed-type data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>28</volume>
<fpage>113</fpage>
<lpage>118</lpage>
</element-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Acuna</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>The treatment of missing values and its effect in the classifier accuracy</article-title>
<source>Clustering and Data Mining Applications</source>
<year>2004</year>
<fpage>639</fpage>
<lpage>648</lpage>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname></surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Dysvik</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Jonassen</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>LSimpute: accurate estimation of missing values in microarray data with least squares methods</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>3</issue>
<fpage>e34</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gnh026</pub-id>
<pub-id pub-id-type="pmid">14978222</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Olkin</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>RF</given-names>
</name>
</person-group>
<article-title>Multivariate correlation models with mixed disrete and continuous variables</article-title>
<source>Ann Math Stat</source>
<year>1961</year>
<volume>32</volume>
<issue>2</issue>
<fpage>448</fpage>
<lpage>465</lpage>
<pub-id pub-id-type="doi">10.1214/aoms/1177705052</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agresti</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Measures of nominal-ordinal association</article-title>
<source>J Am Stat Assoc</source>
<year>1981</year>
<volume>76</volume>
<issue>375</issue>
<fpage>524</fpage>
<lpage>529</lpage>
<pub-id pub-id-type="doi">10.1080/01621459.1981.10477679</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ulf Olsson</surname>
<given-names>FD</given-names>
</name>
<name>
<surname>Dorans</surname>
<given-names>NJ</given-names>
</name>
</person-group>
<article-title>The polyserial correlation coefficient</article-title>
<source>Psychometrika</source>
<year>1982</year>
<volume>47</volume>
<issue>3</issue>
<fpage>337</fpage>
<lpage>347</lpage>
<pub-id pub-id-type="doi">10.1007/BF02294164</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Olsson</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Maximum likelihood estimation of the polychoric correlation coefficient</article-title>
<source>Psychometrika</source>
<year>1979</year>
<volume>44</volume>
<issue>4</issue>
<fpage>443</fpage>
<lpage>460</lpage>
<pub-id pub-id-type="doi">10.1007/BF02296207</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boas</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Determination of the coefficient of correlation</article-title>
<source>Science</source>
<year>1909</year>
<volume>29</volume>
<fpage>823</fpage>
<lpage>824</lpage>
<pub-id pub-id-type="doi">10.1126/science.29.751.823</pub-id>
<pub-id pub-id-type="pmid">17743093</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pearson</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable</article-title>
<source>Philos Trans R Soc Lond Ser A Math Phys Eng Sci</source>
<year>1900</year>
<volume>195</volume>
<fpage>1</fpage>
<lpage>47</lpage>
<pub-id pub-id-type="doi">10.1098/rsta.1900.0022</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yule</surname>
<given-names>GU</given-names>
</name>
</person-group>
<article-title>On the methods of measuring the association between two attributes</article-title>
<source>J Roy Statist Soc</source>
<year>1912</year>
<volume>75</volume>
<fpage>579</fpage>
<lpage>652</lpage>
<pub-id pub-id-type="doi">10.2307/2340126</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Cramér</surname>
<given-names>H</given-names>
</name>
</person-group>
<source>Mathematical Methods of Statistics</source>
<year>1946</year>
<publisher-loc>Princeton</publisher-loc>
<publisher-name>Princeton University Press</publisher-name>
</element-citation>
</ref>
<ref id="CR37">
<label>37.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gower</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>A general coefficient of similarity and some of its properties</article-title>
<source>Biometrics</source>
<year>1971</year>
<volume>27</volume>
<issue>4</issue>
<fpage>857</fpage>
<lpage>871</lpage>
<pub-id pub-id-type="doi">10.2307/2528823</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Amérique/explor/PittsburghV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000440 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000440 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Amérique
   |area=    PittsburghV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4228077
   |texte=   Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25371041" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a PittsburghV1 

Wicri

This area was generated with Dilib version V0.6.38.
Data generation: Fri Jun 18 17:37:45 2021. Site generation: Fri Jun 18 18:15:47 2021