Le SIDA en Afrique subsaharienne (serveur d'exploration)

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Constructing socio-economic status indices: how to use principal components analysis

Identifieur interne : 004132 ( Istex/Corpus ); précédent : 004131; suivant : 004133

Constructing socio-economic status indices: how to use principal components analysis

Auteurs : Seema Vyas ; Lilani Kumaranayake

Source :

RBID : ISTEX:C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6

Abstract

Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.

Url:
DOI: 10.1093/heapol/czl029

Links to Exploration step

ISTEX:C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Constructing socio-economic status indices: how to use principal components analysis</title>
<author>
<name sortKey="Vyas, Seema" sort="Vyas, Seema" uniqKey="Vyas S" first="Seema" last="Vyas">Seema Vyas</name>
<affiliation>
<mods:affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: seema.vyas@lshtm.ac.uk</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Correspondence: Seema Vyas, HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK. Tel: +44</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: seema.vyas@lshtm.ac.uk</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kumaranayake, Lilani" sort="Kumaranayake, Lilani" uniqKey="Kumaranayake L" first="Lilani" last="Kumaranayake">Lilani Kumaranayake</name>
<affiliation>
<mods:affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6</idno>
<date when="2006" year="2006">2006</date>
<idno type="doi">10.1093/heapol/czl029</idno>
<idno type="url">https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">004132</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">004132</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Constructing socio-economic status indices: how to use principal components analysis</title>
<author>
<name sortKey="Vyas, Seema" sort="Vyas, Seema" uniqKey="Vyas S" first="Seema" last="Vyas">Seema Vyas</name>
<affiliation>
<mods:affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: seema.vyas@lshtm.ac.uk</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Correspondence: Seema Vyas, HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK. Tel: +44</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: seema.vyas@lshtm.ac.uk</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kumaranayake, Lilani" sort="Kumaranayake, Lilani" uniqKey="Kumaranayake L" first="Lilani" last="Kumaranayake">Lilani Kumaranayake</name>
<affiliation>
<mods:affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Health Policy and Planning</title>
<idno type="ISSN">0268-1080</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2006-11">2006-11</date>
<biblScope unit="volume">21</biblScope>
<biblScope unit="issue">6</biblScope>
<biblScope unit="page" from="459">459</biblScope>
<biblScope unit="page" to="468">468</biblScope>
</imprint>
<idno type="ISSN">0268-1080</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0268-1080</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract">Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.</div>
</front>
</TEI>
<istex>
<corpusName>oup</corpusName>
<author>
<json:item>
<name>Seema Vyas</name>
<affiliations>
<json:string>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</json:string>
<json:string>E-mail: seema.vyas@lshtm.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Lilani Kumaranayake</name>
<affiliations>
<json:string>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<value>How to do (or not to do) …</value>
</json:item>
<json:item>
<value>socio-economic status</value>
</json:item>
<json:item>
<value>principal components analysis</value>
</json:item>
<json:item>
<value>cluster analysis</value>
</json:item>
<json:item>
<value>methodology</value>
</json:item>
</subject>
<language>
<json:string>unknown</json:string>
</language>
<originalGenre>
<json:string>research-article</json:string>
</originalGenre>
<abstract>Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.</abstract>
<qualityIndicators>
<score>8.548</score>
<pdfWordCount>6131</pdfWordCount>
<pdfCharCount>42318</pdfCharCount>
<pdfVersion>1.5</pdfVersion>
<pdfPageCount>10</pdfPageCount>
<pdfPageSize>612 x 791 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<abstractWordCount>129</abstractWordCount>
<abstractCharCount>918</abstractCharCount>
<keywordCount>5</keywordCount>
</qualityIndicators>
<title>Constructing socio-economic status indices: how to use principal components analysis</title>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<title>Health Policy and Planning</title>
<language>
<json:string>unknown</json:string>
</language>
<issn>
<json:string>0268-1080</json:string>
</issn>
<publisherId>
<json:string>heapol</json:string>
</publisherId>
<volume>21</volume>
<issue>6</issue>
<pages>
<first>459</first>
<last>468</last>
</pages>
<genre>
<json:string>journal</json:string>
</genre>
</host>
<categories>
<wos>
<json:string>social science</json:string>
<json:string>health policy & services</json:string>
<json:string>science</json:string>
<json:string>health care sciences & services</json:string>
</wos>
<scienceMetrix>
<json:string>health sciences</json:string>
<json:string>public health & health services</json:string>
<json:string>health policy & services</json:string>
</scienceMetrix>
</categories>
<publicationDate>2006</publicationDate>
<copyrightDate>2006</copyrightDate>
<doi>
<json:string>10.1093/heapol/czl029</json:string>
</doi>
<id>C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6</id>
<score>1</score>
<fulltext>
<json:item>
<extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/fulltext/pdf</uri>
</json:item>
<json:item>
<extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a">Constructing socio-economic status indices: how to use principal components analysis</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher scheme="https://publisher-list.data.istex.fr">Oxford University Press</publisher>
<availability>
<licence>
<p>© The Author 2006. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine. All rights reserved.</p>
</licence>
<p scheme="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-GTWS0RDP-M">oup</p>
</availability>
<date>2006-10-09</date>
</publicationStmt>
<notesStmt>
<note type="research-article" scheme="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</note>
<note type="journal" scheme="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</note>
</notesStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a">Constructing socio-economic status indices: how to use principal components analysis</title>
<author xml:id="author-0000" corresp="yes">
<persName>
<forename type="first">Seema</forename>
<surname>Vyas</surname>
</persName>
<email>seema.vyas@lshtm.ac.uk</email>
<email>seema.vyas@lshtm.ac.uk</email>
<note type="biography">Seema Vyas is a Research Fellow with the Health Policy Unit, Department of Public and Policy, LSHTM. She specializes in quantitative and private health sector analysis.</note>
<affiliation>Seema Vyas is a Research Fellow with the Health Policy Unit, Department of Public and Policy, LSHTM. She specializes in quantitative and private health sector analysis.</affiliation>
<affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</affiliation>
<affiliation>Correspondence: Seema Vyas, HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK. Tel: +44</affiliation>
</author>
<author xml:id="author-0001">
<persName>
<forename type="first">Lilani</forename>
<surname>Kumaranayake</surname>
</persName>
<note type="biography">Lilani Kumaranayake is a Lecturer in Health Economics and Policy, Department of Public Health and Policy, LSHTM. She specializes in the economics of HIV/AIDS, private health sector and quantitative analysis.</note>
<affiliation>Lilani Kumaranayake is a Lecturer in Health Economics and Policy, Department of Public Health and Policy, LSHTM. She specializes in the economics of HIV/AIDS, private health sector and quantitative analysis.</affiliation>
<affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</affiliation>
</author>
<idno type="istex">C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6</idno>
<idno type="ark">ark:/67375/HXZ-04ZP596X-N</idno>
<idno type="DOI">10.1093/heapol/czl029</idno>
</analytic>
<monogr>
<title level="j">Health Policy and Planning</title>
<idno type="pISSN">0268-1080</idno>
<idno type="publisher-id">heapol</idno>
<idno type="PublisherID-hwp">heapol</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2006-11"></date>
<biblScope unit="volume">21</biblScope>
<biblScope unit="issue">6</biblScope>
<biblScope unit="page" from="459">459</biblScope>
<biblScope unit="page" to="468">468</biblScope>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2006-10-09</date>
</creation>
<abstract>
<p>Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<head>keywords</head>
<item>
<term>socio-economic status</term>
</item>
<item>
<term>principal components analysis</term>
</item>
<item>
<term>cluster analysis</term>
</item>
<item>
<term>methodology</term>
</item>
</list>
</keywords>
</textClass>
<textClass>
<keywords scheme="Journal Subject">
<list>
<head></head>
<item>
<term>How to do (or not to do) …</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2006-10-09">Created</change>
<change when="2006-11">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="corpus oup, element #text not found" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="utf-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" URI="journalpublishing.dtd" name="istex:docType"></istex:docType>
<istex:document>
<article article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="hwp">heapol</journal-id>
<journal-id journal-id-type="publisher-id">heapol</journal-id>
<journal-id journal-id-type="pmc">heapol</journal-id>
<journal-title>Health Policy and Planning</journal-title>
<issn pub-type="ppub">0268-1080</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.1093/heapol/czl029</article-id>
<article-categories>
<subj-group>
<subject>How to do (or not to do) …</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Constructing socio-economic status indices: how to use principal components analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Vyas</surname>
<given-names>Seema</given-names>
</name>
<bio>
<p>Seema Vyas is a Research Fellow with the Health Policy Unit, Department of Public and Policy, LSHTM. She specializes in quantitative and private health sector analysis.</p>
</bio>
<xref ref-type="aff" rid="AFF1"></xref>
<xref ref-type="corresp" rid="COR1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kumaranayake</surname>
<given-names>Lilani</given-names>
</name>
<bio>
<p>Lilani Kumaranayake is a Lecturer in Health Economics and Policy, Department of Public Health and Policy, LSHTM. She specializes in the economics of HIV/AIDS, private health sector and quantitative analysis.</p>
</bio>
</contrib>
</contrib-group>
<aff id="AFF1">HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</aff>
<author-notes>
<corresp id="COR1">
<italic>Correspondence</italic>
: Seema Vyas, HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK. Tel: +44 (0) 20 7612 7828; Fax: +44 (0) 20 7637 5391; E-mail:
<email>seema.vyas@lshtm.ac.uk</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>11</month>
<year>2006</year>
</pub-date>
<pub-date pub-type="epub">
<day>9</day>
<month>10</month>
<year>2006</year>
</pub-date>
<volume>21</volume>
<issue>6</issue>
<fpage>459</fpage>
<lpage>468</lpage>
<copyright-statement>© The Author 2006. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine. All rights reserved.</copyright-statement>
<copyright-year>2006</copyright-year>
<abstract>
<p>Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.</p>
</abstract>
<kwd-group>
<kwd>socio-economic status</kwd>
<kwd>principal components analysis</kwd>
<kwd>cluster analysis</kwd>
<kwd>methodology</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="SEC1">
<title>1. Introduction</title>
<p>Common to health research and policy interventions is the concern that there is a differential impact with respect to health outcomes or health service utilization based on socio-economic status (SES) (Deaton
<xref ref-type="bibr" rid="B3">2003</xref>
; Schellenberg et al.
<xref ref-type="bibr" rid="B19">2003</xref>
). Thus, information about how households vary by SES, and the extent to which this relates to variables of interest, is central to questions such as how to target the poorest. Standard economic measures of SES use monetary information, such as income or consumption expenditure. However, the collection of accurate income data is a demanding task (Montgomery et al.
<xref ref-type="bibr" rid="B16">2000</xref>
), requiring extensive resources for household surveys; for example, allowances need to be made for households and individuals drawing income from multiple sources. Also, in some instances, an indicator of income is quite difficult to use (Cortinovis et al.
<xref ref-type="bibr" rid="B2">1993</xref>
). For example, income information does not capture the fact that people (and especially the poor) may have income in kind, such as crops which are traded, and measuring income can be difficult for the self or transitory employed (e.g. agricultural work), due to accounting issues and seasonality (McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
).</p>
<p>By comparison, consumption or expenditure measures are much more reliable and are easier to collect than income, especially in most rural settings (Filmer and Pritchett
<xref ref-type="bibr" rid="B5">2001</xref>
). However, again a limitation is the extensive data collection required, which is time-consuming and therefore costly. Given the resource constraints to measuring household income or expenditure in low- and middle-income country settings, other methods of developing SES indices are being used which streamline the variables required, enabling data to be collected more rapidly. Rather than income or expenditure, data are collected for variables that capture living standards, such as household ownership of durable assets (e.g. TV, car) and infrastructure and housing characteristics (e.g. source of water, sanitation facility).</p>
<p>While asset-based measures are increasingly being used, there continues to be some debate about their use. Importantly, a key argument revolves around their interpretation. These measures are more reflective of longer-run household wealth or living standards, failing to take account of short-run or temporary interruptions, or shocks to the household (Filmer and Pritchett
<xref ref-type="bibr" rid="B5">2001</xref>
). Therefore, if the outcome of interest is associated with current resources available to the household, then an index based on assets may not be the appropriate measure.</p>
<p>Falkingham and Namazie (
<xref ref-type="bibr" rid="B4">2002</xref>
) highlight a second issue which is that ownership does not always capture the quality of assets. For example, collecting information on TV ownership does not distinguish between better-off households that are more likely to own a newer or colour TV, and less well-off households that may own an older or black and white one. However, they also point out that in many countries, this would not alter the overall picture of wealth.</p>
<p>A third issue is that some variables may have a different relationship with SES across sub-groups; for example, ownership of farmland may be more reflective of wealth in rural areas.</p>
<p>A final issue is how to aggregate over the range of different variables to derive a uni-dimensional measure of SES, and produce a range of critical points differentiating socio-economic levels. This is because each variable, used individually, may not be sufficient to differentiate household SES. One approach has been to sum the number of assets in households, for example Montgomery et al. (
<xref ref-type="bibr" rid="B16">2000</xref>
), but this assumes that all assets should be weighted equally. More recently, studies have applied principal components analysis (PCA) to such data to derive a SES index (Gwatkin et al. 2000; Filmer and Pritchett
<xref ref-type="bibr" rid="B5">2001</xref>
; McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
), and then grouped households into pre-determined categories, such as quintiles, reflecting different SES levels.</p>
<p>Given the increasingly routine application of PCA using asset data in creating SES indices, we review how PCA-based indices are constructed and how they can be used, and assess their advantages and limitations by presenting a worked example. PCA is explained in
<xref ref-type="sec" rid="SEC2">section 2</xref>
, and construction and how to use a SES index is demonstrated in
<xref ref-type="sec" rid="SEC3">section 3</xref>
, with data from both urban and rural settings. An evaluation of PCA-based indices is undertaken in
<xref ref-type="sec" rid="SEC4">section 4</xref>
.</p>
</sec>
<sec id="SEC2">
<title>2. What is PCA?</title>
<p>PCA is a multivariate statistical technique used to reduce the number of variables in a data set into a smaller number of ‘dimensions’. In mathematical terms, from an initial set of
<italic>n</italic>
correlated variables, PCA creates uncorrelated indices or components, where each component is a linear weighted combination of the initial variables. For example, from a set of variables
<italic>X</italic>
<sub>1</sub>
through to
<italic>X
<sub>n</sub>
</italic>
,
<disp-formula>
<graphic xlink:href="czl029um1"></graphic>
</disp-formula>
</p>
<p>where
<italic>a
<sub>mn</sub>
</italic>
represents the weight for the
<italic>m</italic>
th principal component and the
<italic>n</italic>
th variable.</p>
<p>Diagrammatically, the concept of PCA can be shown as in
<xref ref-type="fig" rid="F1">Figure 1</xref>
. The uncorrelated property of the components is highlighted by the fact they are perpendicular, i.e. at right angles to each other, which mean the indices are measuring different dimensions in the data (Manly
<xref ref-type="bibr" rid="B14">1994</xref>
).
<fig id="F1">
<label>
<bold>Figure 1.</bold>
</label>
<caption>
<p>Representation of two sequential components in PCA.
<italic>Source</italic>
: [
<ext-link ext-link-type="uri" xlink:href="http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm">http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm</ext-link>
], accessed 7 September 2005</p>
</caption>
<graphic xlink:href="czl029f1"></graphic>
</fig>
</p>
<p>The weights for each principal component are given by the eigenvectors of the correlation matrix, or if the original data were standardized, the co-variance matrix. The variance (λ) for each principal component is given by the eigenvalue of the corresponding eigenvector.
<xref ref-type="fn" rid="FN1">
<sup>1</sup>
</xref>
The components are ordered so that the first component (PC
<sub>1</sub>
) explains the largest possible amount of variation in the original data, subject to the constraint that the sum of the squared weights
<inline-formula>
<inline-graphic xlink:href="czl029i1"></inline-graphic>
</inline-formula>
is equal to one. As the sum of the eigenvalues equals the number of variables in the initial data set, the proportion of the total variation in the original data set accounted by each principal component is given by λ
<sub>i</sub>
/
<italic>n</italic>
. The second component (PC
<sub>2</sub>
) is completely uncorrelated with the first component, and explains additional but less variation than the first component, subject to the same constraint. Subsequent components are uncorrelated with previous components; therefore, each component captures an additional dimension in the data, while explaining smaller and smaller proportions of the variation of the original variables. The higher the degree of correlation among the original variables in the data, the fewer components required to capture common information.</p>
</sec>
<sec id="SEC3">
<title>3. Constructing a SES index with PCA</title>
<p>Using data from the Demographic Health Survey (DHS) (from [
<ext-link ext-link-type="uri" xlink:href="http://www.measuredhs.com">http://www.measuredhs.com</ext-link>
]), PCA-based SES measures are derived in this section for two contrasting countries, Brazil and Ethiopia.
<xref ref-type="fn" rid="FN2">
<sup>2</sup>
</xref>
DHS household surveys have been undertaken in more than 60 countries, focusing on health outcomes and nutrition, and contain data on household characteristics rather than income or expenditure. The World Bank, in its series of ‘Socio-economic differences in health, nutrition, and population’, has also constructed PCA-based asset indices using DHS data (e.g. Gwatkin et al. 2000), constructing an index for each country as a whole. In our example, we construct a socio-economic index for each site, that is, households in urban and rural locations in both countries, to illustrate some of the issues that arise when using and interpreting PCA-based SES. Standard statistical software can be used and in this instance, STATA (Version 8.1) was used.</p>
<p>We divide this section into four parts to reflect the main steps in constructing a SES index: selection of asset variables; application of PCA; interpretation of results; and classification of households into socio-economic groups. The first part examines the issues relating to the choice of assets and variables that have been commonly used, in particular, clumping and truncation, stability of household classification and reliability. The second highlights methodological issues such as preparation of data, and identifying the number of principal components to extract that would measure SES. Results of a PCA analysis on asset data are interpreted in the third sub-section, and the methods used to classify households into socio-economic groups are presented in the fourth.</p>
<sec id="SEC3.1">
<title>3.1 Selection of asset variables</title>
<p>To measure SES, studies have used variables such as ownership of land (Filmer and Pritchett
<xref ref-type="bibr" rid="B5">2001</xref>
), farm animals and whether living in rented or owner-occupied housing (Schellenberg et al.
<xref ref-type="bibr" rid="B19">2003</xref>
), literacy or education level of head of household, demographic conditions (e.g. the ratio of number of people to the number of rooms in the household to proxy crowding), and other economic proxies such as occupation of head of household (Cortinovis et al.
<xref ref-type="bibr" rid="B2">1993</xref>
). Montgomery et al. (
<xref ref-type="bibr" rid="B16">2000</xref>
) identified the absence of a ‘best practice’ approach of selecting variables to proxy living standards, as, in many studies, variables were chosen on an ‘ad-hoc’ basis.</p>
<p>In the DHS, information is collected on durable asset ownership, access to utilities and infrastructure (e.g. sanitation facility and source of water), and housing characteristics (e.g. number of rooms for sleeping and building material), which we include in our analysis.</p>
<p>PCA works best when asset variables are correlated, but also when the distribution of variables varies across cases, or in this instance, households. It is the assets that are more unequally distributed between households that are given more weight in PCA (McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
). Variables with low standard deviations would carry a low weight from the PCA; for example, an asset which all households own or which no households own (i.e. zero standard deviation) would exhibit no variation between households and would be zero weighted, and so of little use in differentiating SES.</p>
<p>Therefore, as a first step, we carried out descriptive analyses for all the variables, looking at means, frequencies and standard deviations (see
<xref ref-type="table" rid="T1">Table 1</xref>
). Descriptive analysis can inform decisions on which variables to include in the analysis, and highlight data management issues, such as coding of variables and missing values. In rural Brazil and urban Ethiopia, indicators of durable asset ownership range from the majority of households owning a radio to a few owning a car. Also, the source of water supply in rural Brazil, and type of floor material in urban Ethiopia, vary across households. In urban Brazil, the vast majority of households owned all or most of the assets listed, and had a tap in residence, though there is variation in type of sanitation facility. However, in rural Ethiopia, few households have assets or any formal sanitation facility, and most have rudimentary types of flooring material (e.g. earth or sand, dung).
<table-wrap id="T1">
<label>
<bold>Table 1.</bold>
</label>
<caption>
<p>Results from principal components analysis</p>
</caption>
<table frame="hsides">
<thead align="left">
<tr>
<th>Variable description Brazil/Ethiopia</th>
<th colspan="3">Brazil urban</th>
<th colspan="3">Brazil rural</th>
<th colspan="3">Ethiopia urban</th>
<th colspan="3">Ethiopia rural</th>
</tr>
<tr>
<th></th>
<th colspan="3">
<hr></hr>
</th>
<th colspan="3">
<hr></hr>
</th>
<th colspan="3">
<hr></hr>
</th>
<th colspan="3">
<hr></hr>
</th>
</tr>
<tr>
<th></th>
<th>Mean</th>
<th>Std. dev.</th>
<th>Factor score</th>
<th>Mean</th>
<th>Std. dev.</th>
<th>Factor score</th>
<th>Mean</th>
<th>Std. dev.</th>
<th>Factor score</th>
<th>Mean</th>
<th>Std. dev.</th>
<th>Factor score</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td>Electricity</td>
<td>0.987</td>
<td>0.114</td>
<td>0.158</td>
<td>0.694</td>
<td>0.461</td>
<td>0.347</td>
<td>0.829</td>
<td>0.376</td>
<td>0.297</td>
<td>0.012</td>
<td>0.107</td>
<td>0.171</td>
</tr>
<tr>
<td>Radio</td>
<td>0.881</td>
<td>0.323</td>
<td>0.216</td>
<td>0.765</td>
<td>0.423</td>
<td>0.171</td>
<td>0.689</td>
<td>0.463</td>
<td>0.294</td>
<td>0.139</td>
<td>0.345</td>
<td>0.210</td>
</tr>
<tr>
<td>Television</td>
<td>0.716</td>
<td>0.450</td>
<td>0.372</td>
<td>0.314</td>
<td>0.463</td>
<td>0.345</td>
<td>0.215</td>
<td>0.411</td>
<td>0.327</td>
<td>0.000</td>
<td>0.010</td>
<td>0.024</td>
</tr>
<tr>
<td>Refrigerator</td>
<td>0.821</td>
<td>0.383</td>
<td>0.363</td>
<td>0.425</td>
<td>0.493</td>
<td>0.397</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Car</td>
<td>0.296</td>
<td>0.456</td>
<td>0.295</td>
<td>0.135</td>
<td>0.341</td>
<td>0.256</td>
<td>0.035</td>
<td>0.184</td>
<td>0.176</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bicycle</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.003</td>
<td>0.058</td>
<td>0.106</td>
</tr>
<tr>
<td>Telephone</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.136</td>
<td>0.343</td>
<td>0.291</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>No. of rooms for sleeping</td>
<td>2.150</td>
<td>0.899</td>
<td>0.143</td>
<td>2.175</td>
<td>0.916</td>
<td>0.105</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="13">
<bold>Source of water supply</bold>
</td>
</tr>
<tr>
<td>Piped into residence/dwelling</td>
<td>0.760</td>
<td>0.427</td>
<td>0.243</td>
<td>0.200</td>
<td>0.400</td>
<td>0.179</td>
<td>0.007</td>
<td>0.086</td>
<td>0.033</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Piped into yard, plot/compound</td>
<td>0.044</td>
<td>0.204</td>
<td>−0.182</td>
<td>0.051</td>
<td>0.219</td>
<td>−0.033</td>
<td>0.414</td>
<td>0.493</td>
<td>0.367</td>
<td>0.001</td>
<td>0.024</td>
<td>0.105</td>
</tr>
<tr>
<td>/Piped outside compound</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.441</td>
<td>0.497</td>
<td>−0.221</td>
<td>0.061</td>
<td>0.239</td>
<td>0.092</td>
</tr>
<tr>
<td>Well, spring inside/covered well</td>
<td>0.075</td>
<td>0.264</td>
<td>−0.126</td>
<td>0.381</td>
<td>0.485</td>
<td>0.096</td>
<td>0.012</td>
<td>0.108</td>
<td>−0.077</td>
<td>0.065</td>
<td>0.247</td>
<td>0.103</td>
</tr>
<tr>
<td>Well or spring outside/open well</td>
<td>0.054</td>
<td>0.227</td>
<td>0.122</td>
<td>0.276</td>
<td>0.447</td>
<td>−0.154</td>
<td>0.049</td>
<td>0.217</td>
<td>−0.060</td>
<td>0.072</td>
<td>0.259</td>
<td>−0.106</td>
</tr>
<tr>
<td>Bottled water/</td>
<td>0.047</td>
<td>0.212</td>
<td>0.062</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>/Covered, open spring</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.023</td>
<td>0.151</td>
<td>−0.103</td>
<td>0.427</td>
<td>0.495</td>
<td>0.071</td>
</tr>
<tr>
<td>/River</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.048</td>
<td>0.214</td>
<td>−0.152</td>
<td>0.333</td>
<td>0.471</td>
<td>−0.108</td>
</tr>
<tr>
<td>Other/and pond, lake, dam, rain</td>
<td>0.019</td>
<td>0.138</td>
<td>−0.135</td>
<td>0.092</td>
<td>0.289</td>
<td>−0.143</td>
<td>0.005</td>
<td>0.070</td>
<td>−0.011</td>
<td>0.041</td>
<td>0.199</td>
<td>−0.033</td>
</tr>
<tr>
<td colspan="13">
<bold>Sanitation facility</bold>
</td>
</tr>
<tr>
<td>Toilet to sewer/flush toilet</td>
<td>0.410</td>
<td>0.491</td>
<td>0.277</td>
<td>0.059</td>
<td>0.235</td>
<td>0.109</td>
<td>0.035</td>
<td>0.184</td>
<td>0.147</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Toilet to open space or river/</td>
<td>0.054</td>
<td>0.225</td>
<td>−0.089</td>
<td>0.093</td>
<td>0.290</td>
<td>0.057</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Latrine to sewer/</td>
<td>0.128</td>
<td>0.334</td>
<td>0.062</td>
<td>0.035</td>
<td>0.182</td>
<td>0.098</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Latrine no connection/</td>
<td>0.217</td>
<td>0.411</td>
<td>−0.049</td>
<td>0.176</td>
<td>0.380</td>
<td>0.210</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Traditional latrine/pit</td>
<td>0.138</td>
<td>0.345</td>
<td>−0.184</td>
<td>0.218</td>
<td>0.412</td>
<td>0.133</td>
<td>0.714</td>
<td>0.452</td>
<td>0.218</td>
<td>0.096</td>
<td>0.294</td>
<td>0.580</td>
</tr>
<tr>
<td>/Ventilated improved pit latrine</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.033</td>
<td>0.177</td>
<td>0.056</td>
<td>0.001</td>
<td>0.026</td>
<td>0.077</td>
</tr>
<tr>
<td>No facility/and bush or field</td>
<td>0.053</td>
<td>0.224</td>
<td>−0.238</td>
<td>0.420</td>
<td>0.493</td>
<td>−0.395</td>
<td>0.218</td>
<td>0.413</td>
<td>−0.328</td>
<td>0.904</td>
<td>0.295</td>
<td>−0.586</td>
</tr>
<tr>
<td colspan="13">
<bold>Type of floor material</bold>
</td>
</tr>
<tr>
<td>Earth or sand</td>
<td>0.032</td>
<td>0.175</td>
<td>−0.175</td>
<td>0.191</td>
<td>0.393</td>
<td>−0.294</td>
<td>0.345</td>
<td>0.475</td>
<td>−0.312</td>
<td>0.696</td>
<td>0.460</td>
<td>−0.263</td>
</tr>
<tr>
<td>/Dung</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.108</td>
<td>0.310</td>
<td>−0.121</td>
<td>0.283</td>
<td>0.451</td>
<td>0.228</td>
</tr>
<tr>
<td>Wood planks/and reed or bamboo</td>
<td>0.070</td>
<td>0.255</td>
<td>0.004</td>
<td>0.059</td>
<td>0.236</td>
<td>0.096</td>
<td>0.028</td>
<td>0.166</td>
<td>0.037</td>
<td>0.002</td>
<td>0.048</td>
<td>0.065</td>
</tr>
<tr>
<td>Polished wood/and parquet</td>
<td>0.097</td>
<td>0.296</td>
<td>0.116</td>
<td>0.071</td>
<td>0.256</td>
<td>0.161</td>
<td>0.237</td>
<td>0.425</td>
<td>0.131</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Vinyl/and sheet tiles</td>
<td>0.007</td>
<td>0.082</td>
<td>0.043</td>
<td>0.004</td>
<td>0.063</td>
<td>0.049</td>
<td>0.030</td>
<td>0.171</td>
<td>0.166</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Ceramic tiles/and brick</td>
<td>0.317</td>
<td>0.465</td>
<td>0.275</td>
<td>0.083</td>
<td>0.277</td>
<td>0.192</td>
<td>0.040</td>
<td>0.195</td>
<td>0.095</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cement</td>
<td>0.436</td>
<td>0.496</td>
<td>−0.309</td>
<td>0.568</td>
<td>0.495</td>
<td>−0.009</td>
<td>0.170</td>
<td>0.375</td>
<td>0.150</td>
<td>0.005</td>
<td>0.074</td>
<td>0.205</td>
</tr>
<tr>
<td>Carpet</td>
<td>0.036</td>
<td>0.186</td>
<td>0.090</td>
<td>0.008</td>
<td>0.091</td>
<td>0.062</td>
<td>0.026</td>
<td>0.160</td>
<td>0.071</td>
<td>0.013</td>
<td>0.112</td>
<td>−0.007</td>
</tr>
<tr>
<td>Other</td>
<td>0.006</td>
<td>0.076</td>
<td>0.000</td>
<td>0.016</td>
<td>0.124</td>
<td>−0.048</td>
<td>0.016</td>
<td>0.125</td>
<td>0.007</td>
<td>0.000</td>
<td>0.022</td>
<td>0.049</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) highlights that a major challenge for PCA-based asset indices is to ensure the range of asset variables included is broad enough to avoid problems of ‘clumping’ and ‘truncation’. Clumping or clustering is described as households being grouped together in a small number of distinct clusters. Truncation implies a more even distribution of SES, but spread over a narrow range, making differentiating between socio-economic groups difficult (e.g. not being able to distinguish between the poor and the very poor). From the distribution of asset ownership, access to utilities and infrastructure, and housing characteristics in our analysis, clumping and truncation are likely to be issues for the data from rural Ethiopia. This is because many households do not own the durable items, have similar access to utilities and infrastructure, and similar housing characteristics, and so will be grouped together. Also, of the households that do own assets, they have the same ones, which will make differentiating among them difficult. Clumping and truncation may be an issue for urban Brazil due to high levels of ownership of most of the included durable assets, but we do not expect it to be an issue for rural Brazil or urban Ethiopia.</p>
<p>If clumping and truncation are identified as potential problems from the descriptive analysis, as is the case for rural Ethiopia, then one method that could solve this issue is to add more variables to the analysis. The number of variables used in studies has ranged from 10 (Schellenberg et al.
<xref ref-type="bibr" rid="B19">2003</xref>
) to 30 (McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
). Other methods could be to use continuous variables (e.g. the number of acres of land) and using a combination of asset durable ownership, access to utilities and infrastructure, housing characteristics and other variables that appear relevant in assessing household wealth. A preliminary analysis correlating assets and monthly household expenditure was used to inform the choice of indicators to be collected in a study by Hanson et al. (
<xref ref-type="bibr" rid="B10">2005</xref>
). The analysis used the Living Standards Measurement Survey which collected information on both expenditure and asset data. Only asset variables that were significantly correlated with expenditure were included in their subsequent survey.</p>
<p>However, the key is to include additional variables that capture inequality between households. McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) compared SES distributions using housing characteristics only, access to utilities and infrastructure only, durable asset ownership only and all three categories of variables. For both the housing characteristics only and utilities only distributions, there was evidence of clumping and truncation, while durable asset ownership showed some evidence of truncation. The index based on combined assets showed no evidence of clumping or truncation and yielded the smoothest distribution of SES.</p>
<p>Another issue related to selection of asset variables is the stability of household classification into SES groups. In some studies, this has been found to be closely associated with the choice of variables included in the index. For example, Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
) compared the relative economic position of households using either durable assets, infrastructure, housing characteristics or a combination of all variables to derive four different PCA-based measures. However, Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
), in their analysis, concluded the categorization of households was robust to the measure used.</p>
<p>In addition, Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
) found that variables included in the index that were directly associated with child health outcomes (e.g. sanitation facility) increased inequality among households. Similarly, Lindelow (
<xref ref-type="bibr" rid="B13">2002</xref>
) found including infrastructure variables such as source of water increased socio-economic inequality in health facility utilization. Higher quality infrastructure variables were geographically biased to urban locations where access to health facilities is assumed to be greater. Including infrastructure variables in the index increased the representation of households from urban areas into the richer groups, and subsequently increased inequality. An explanatory analysis should consider an index without direct determinants of the outcome of interest. However, exclusion of variables may make it more difficult to divide households, particularly when considering similar groups, for example in a rural community.</p>
<p>An advantage of collecting asset data, highlighted by McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
), is that measurement error is minimized. Onwujekwe et al. (
<xref ref-type="bibr" rid="B17">2006</xref>
) report on the reliability of collecting some asset data commonly used in generating SES indices (e.g. radio, bicycle). Two methods of assessing reliability were used. The first employed two different interviewers to measure observations separated by up to 5 days (inter-rater), and the second employed the same interviewer to measure observations within 1 month of the original survey being administered (test-retest). In both cases, reliability was found not to be high, and resulted in differences in classification of households into SES groups. Therefore, the user should be aware of issues relating to the accuracy of data collection. A possible way to improve reliability is to include assets that are observable by interviewers, but this may not always be feasible.</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Application of PCA</title>
<p>Data in categorical form (such as religion) are not suitable for PCA, as the categories are converted into a quantitative scale which does not have any meaning.
<xref ref-type="fn" rid="FN3">
<sup>3</sup>
</xref>
To avoid this, qualitative categorical variables should be re-coded into binary variables. In our example, similar variables with low frequencies were combined together: for example, ‘covered spring’ and ‘spring’ were combined for Ethiopia data; ‘toilet to open space’ and ‘toilet to river’ for Brazil data. Similar variables with relatively high frequencies were kept as separate variables (spring and river for Ethiopia data). We included all binary variables created from a categorical variable, including those that had low frequencies but were not similar enough to another variable to combine, in order to ensure all the data for each household were measured. We excluded durable assets that were initially binary with very low counts, for instance motorcycle in urban Ethiopia, which was owned by 0.1% of households.</p>
<p>Another data issue is that of missing values. Cortinovis et al. (
<xref ref-type="bibr" rid="B2">1993</xref>
) excluded households with at least one missing value from their analysis to develop socio-economic groups. Gwatkin et al. (2000) replaced missing values with the mean value for that variable. Exclusion of households based on missing socio-economic data could significantly lower sample sizes and the statistical power of study results, and may lead to bias towards higher SES households, as missing data may occur more frequently in lower social classes (Cortinovis et al.
<xref ref-type="bibr" rid="B2">1993</xref>
). However, attributing mean scores for missing values reduces variation among households, and increases the potential for clumping and truncation. This is more pronounced with high numbers of missing values, though software packages such as STATA offer a range of methods for estimating missing values. In our example, the percentage of households with missing data was small (less than 1% in each site). We expect inclusion or exclusion of these households would have little impact on the distribution of SES, but for variables with missing values, we chose to impute the mean value of that variable.</p>
<p>The analysis of data on household characteristics and asset ownership is complicated by the fact that there are potentially a large number of variables which could be collected, some of which may yield similar information. Thus a natural approach is to use methods such as PCA to try and organize the data to reduce its dimensionality with as little loss of information as possible in the total variation these variables explain (Giri
<xref ref-type="bibr" rid="B6">2004</xref>
).</p>
<p>In STATA, when specifying PCA, the user is given the choice of deriving eigenvectors (weights) from either the correlation matrix or the co-variance matrix of the data. If the raw data has been standardized, then PCA should use the co-variance matrix.
<xref ref-type="fn" rid="FN4">
<sup>4</sup>
</xref>
As we did not standardize our data, and they are therefore not expressed in the same units, we ran the analysis using the correlation matrix to ensure that all data have equal weight. For example, the number of rooms for sleeping is a quantitative variable and has greater variance than the other binary variables, and would therefore dominate the first principal component if the co-variance matrix was used.</p>
<p>The number of principal components extracted can also be defined by the user, and a common method used is to select components where the associated eigenvalue is greater than one. However, it is assumed that the first principal component is a measure of economic status (Houweling et al.
<xref ref-type="bibr" rid="B11">2003</xref>
). McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) considered the use of additional principal components in characterizing household SES, investigating whether they related to non-durable consumption, and concluded that only the first principal component was necessary for measuring wealth. Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
) also considered the use of additional components in their analysis, and though they found the factor scores for each variable difficult to interpret, they included ‘higher order’ components in a multivariate regression analysis, and concluded their results were robust to including additional components.</p>
<p>The eigenvalue (variance) for each principal component indicates the percentage of variation in the total data explained. In the studies included in this review, the first principal component accounted for a range from 12% (Houweling et al.
<xref ref-type="bibr" rid="B11">2003</xref>
) to 27% (McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
) of total variation. These percentages are not high, and this could reflect the number of variables included in the analysis or the complexity of correlations between variables, as each included variable may have its own determinant other than SES.</p>
<p>Results from the first principal component for each site are shown in
<xref ref-type="table" rid="T1">Table 1</xref>
, and their associated eigenvalues are 4 (rural Brazil and urban Ethiopia), 3.5 (urban Brazil) and 2.2 (rural Ethiopia), accounting for 16.0%, 14.9%, 13.4% and 11.1%, respectively, of the variation in the original data.</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Interpretation of results</title>
<p>The output from a PCA is a table of factor scores or weights for each variable (see
<xref ref-type="table" rid="T1">Table 1</xref>
). Generally, a variable with a positive factor score is associated with higher SES, and conversely a variable with a negative factor score is associated with lower SES. It is useful to note that in some studies, ownership of durable assets such as a bicycle have been attributed a negative weight from PCA (Gwatkin et al. 2000; Houweling et al.
<xref ref-type="bibr" rid="B11">2003</xref>
; McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
). This implies, all things being equal, that a household with a bicycle will be ranked lower in terms of SES than a household that does not own a bicycle. The reason for such a result may be due to ownership of a bicycle being more strongly correlated with variables that are expected to be associated with lower SES, for instance lower quality housing and sanitation conditions. Findings like these can occur when indices have been constructed for combined urban and rural locations, or regions, where the asset represents wealth in some parts of the country but not others. However, in Gwatkin et al. (2000) and McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
), the weights for ownership of a bicycle were among the smallest in absolute terms compared with other durable assets, and Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
) argued their finding was not likely to have influenced their overall conclusions.
<xref ref-type="fn" rid="FN5">
<sup>5</sup>
</xref>
</p>
<p>As we constructed a separate index for urban and rural locations in both countries, we find for each site the factor scores are positive for all durable assets, as is usage of higher quality source of water and sanitation facility (relative to the alternative available). Low quality type of flooring (e.g. earth or sand) has a negative factor score in all sites.</p>
<p>As a further analysis, we considered an additional principal component. The second principal component showed that for urban Brazil the weights were concentrated on source of water, and on floor type for rural Brazil and rural Ethiopia. For urban Ethiopia, the weights were concentrated on sanitation facility. In all cases, the second principal component explained a sub-group of variables. Therefore, we conclude that the first principal component provided a measure of wealth.</p>
<p>Using the factor scores from the first principal component as weights, a dependent variable can then be constructed for each household (Y
<sub>1</sub>
) which has a mean equal to zero, and a standard deviation equal to one. This dependent variable can be regarded as the households ‘socio-economic’ score, and the higher the household socio-economic score, the higher the implied SES of that household. The issue of adjusting for household size was raised by McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
). As in the study by Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
), McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) does not adjust for household size, arguing the benefits of indicators used are available at household level.</p>
<p>Interpreting the weights from our example, an urban Brazil household with more assets, piped drinking water to residence, sanitation facility that leads to a sewer, finished floor coverings and higher number of rooms for sleeping would attain a higher SES score. The finding is similar for rural Brazil, except it includes any sanitation facility and a well in residence. In urban Ethiopia, a household with more assets and drinking water piped to compound would attain a higher SES score. In rural Ethiopia, ownership of any asset, or access to infrastructure facilities such as water or sanitation, would lead to a higher SES score.</p>
</sec>
<sec id="SEC3.4">
<title>3.4 Classification of households into socio-economic groups</title>
<p>The constructed household socio-economic score (Y
<sub>1</sub>
) could be included as a continuous independent variable in a regression model, though the estimated coefficient may not be easy to interpret. Other studies have used cut-off points to differentiate households into broad socio-economic categories, and the approaches used were either arbitrarily defined (based on the assumption SES is uniformly distributed), or data driven. Commonly used arbitrary cut-off points are classification of the lowest 40% of households into ‘poor’, the highest 20% as ‘rich’ and the rest as the ‘middle’ group (Filmer and Pritchett
<xref ref-type="bibr" rid="B5">2001</xref>
), or the division of households into quintiles (Gwatkin et al. 2000). We classified households into quintiles and calculated the mean socio-economic score for each group (
<xref ref-type="table" rid="T2">Table 2</xref>
), because if SES is uniformly distributed, the difference in mean socio-economic score between adjoining quintiles should be even. The differences in the average scores were even for rural Brazil and urban Ethiopia. The mean difference is higher between the poorest and second poorest group for urban Brazil than any other adjoining quintile. For rural Ethiopia, the difference is small among the poorest three quintiles, as each group has a similar mean score.
<table-wrap id="T2">
<label>
<bold>Table 2.</bold>
</label>
<caption>
<p>Mean socio-economic score by quintile</p>
</caption>
<table frame="hsides">
<thead align="left">
<tr>
<th>Site</th>
<th>N</th>
<th>Poorest</th>
<th>Second</th>
<th>Middle</th>
<th>Fourth</th>
<th>Richest</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td>Urban Brazil</td>
<td>10 527</td>
<td>−2.96</td>
<td>−0.82</td>
<td>0.35</td>
<td>1.33</td>
<td>2.14</td>
</tr>
<tr>
<td>Rural Brazil</td>
<td>2756</td>
<td>−2.68</td>
<td>−1.44</td>
<td>−0.01</td>
<td>1.40</td>
<td>2.80</td>
</tr>
<tr>
<td>Urban Ethiopia</td>
<td>3629</td>
<td>−2.82</td>
<td>−1.17</td>
<td>0.02</td>
<td>1.22</td>
<td>2.83</td>
</tr>
<tr>
<td>Rural Ethiopia</td>
<td>10 443</td>
<td>−1.08</td>
<td>−0.72</td>
<td>−0.43</td>
<td>0.20</td>
<td>2.85</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Internal coherence compares the mean value for each asset variable by socio-economic group, in our example, quintiles. Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
) and McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) examined internal coherence of the asset-based index in their studies, and both found mean asset ownership differed by socio-economic group. In our example, ownership of all asset variables, piped water in residence and toilet to sewer increased by socio-economic group in urban and rural Brazil. For example, 31% and 0.2% of households owned a refrigerator in the poorest quintile (urban and rural Brazil, respectively), compared with over 99% in the richest quintiles in both sites (data not shown). In urban Ethiopia, ownership of all assets (except telephone), piped water in residence, tap in compound and use of a flush toilet increased by socio-economic group. In rural Ethiopia, access to a pit latrine increased by socio-economic group, and the proportion of households reporting no sanitation facility decreased by socio-economic group. However, there was no clear trend by socio-economic group of sources of water or most types of floor material (
<xref ref-type="table" rid="T3">Table 3</xref>
).
<table-wrap id="T3">
<label>
<bold>Table 3.</bold>
</label>
<caption>
<p>Ownership of durable assets and housing characteristics by SES quintile</p>
</caption>
<table frame="hsides">
<thead align="left">
<tr>
<th>Variable description</th>
<th colspan="5">Urban Ethiopia</th>
<th colspan="5">Rural Ethiopia</th>
</tr>
<tr>
<th></th>
<th colspan="5">
<hr></hr>
</th>
<th colspan="5">
<hr></hr>
</th>
</tr>
<tr>
<th></th>
<th>Poorest</th>
<th>Second</th>
<th>Middle</th>
<th>Fourth</th>
<th>Richest</th>
<th>Poorest</th>
<th>Second</th>
<th>Middle</th>
<th>Fourth</th>
<th>Richest</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td>Electricity</td>
<td>0.350</td>
<td>0.824</td>
<td>0.980</td>
<td>0.997</td>
<td>1.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.074</td>
</tr>
<tr>
<td>Radio</td>
<td>0.237</td>
<td>0.621</td>
<td>0.741</td>
<td>0.864</td>
<td>0.990</td>
<td>0.000</td>
<td>0.000</td>
<td>0.284</td>
<td>0.178</td>
<td>0.407</td>
</tr>
<tr>
<td>Television</td>
<td>0.000</td>
<td>0.007</td>
<td>0.039</td>
<td>0.170</td>
<td>0.869</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.001</td>
</tr>
<tr>
<td>Car</td>
<td>0.000</td>
<td>0.000</td>
<td>0.003</td>
<td>0.010</td>
<td>0.165</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bicycle</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.022</td>
</tr>
<tr>
<td>Telephone</td>
<td>0.000</td>
<td>0.001</td>
<td>0.110</td>
<td>0.059</td>
<td>0.614</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="11">
<bold>Source of water supply</bold>
</td>
</tr>
<tr>
<td>In-residence tap</td>
<td>0.001</td>
<td>0.003</td>
<td>0.001</td>
<td>0.014</td>
<td>0.018</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>In-compound tap</td>
<td>0.008</td>
<td>0.055</td>
<td>0.300</td>
<td>0.782</td>
<td>0.949</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.004</td>
</tr>
<tr>
<td>Out-of-compound tap</td>
<td>0.590</td>
<td>0.732</td>
<td>0.643</td>
<td>0.187</td>
<td>0.033</td>
<td>0.000</td>
<td>0.000</td>
<td>0.295</td>
<td>0.027</td>
<td>0.135</td>
</tr>
<tr>
<td>Covered well</td>
<td>0.062</td>
<td>0.039</td>
<td>0.014</td>
<td>0.001</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.295</td>
<td>0.020</td>
<td>0.175</td>
</tr>
<tr>
<td>Open well</td>
<td>0.034</td>
<td>0.019</td>
<td>0.005</td>
<td>0.000</td>
<td>0.000</td>
<td>0.220</td>
<td>0.003</td>
<td>0.118</td>
<td>0.002</td>
<td>0.027</td>
</tr>
<tr>
<td>Covered spring</td>
<td>0.118</td>
<td>0.099</td>
<td>0.022</td>
<td>0.006</td>
<td>0.000</td>
<td>0.000</td>
<td>0.893</td>
<td>0.022</td>
<td>0.599</td>
<td>0.384</td>
</tr>
<tr>
<td>River</td>
<td>0.182</td>
<td>0.047</td>
<td>0.007</td>
<td>0.003</td>
<td>0.000</td>
<td>0.780</td>
<td>0.001</td>
<td>0.238</td>
<td>0.320</td>
<td>0.244</td>
</tr>
<tr>
<td>Other water</td>
<td>0.004</td>
<td>0.005</td>
<td>0.008</td>
<td>0.007</td>
<td>0.000</td>
<td>0.000</td>
<td>0.104</td>
<td>0.031</td>
<td>0.032</td>
<td>0.030</td>
</tr>
<tr>
<td colspan="11">
<bold>Sanitation facility</bold>
</td>
</tr>
<tr>
<td>Flush toilet</td>
<td>0.000</td>
<td>0.000</td>
<td>0.005</td>
<td>0.028</td>
<td>0.144</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Traditional pit latrine</td>
<td>0.190</td>
<td>0.750</td>
<td>0.924</td>
<td>0.917</td>
<td>0.792</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.615</td>
</tr>
<tr>
<td>Ventilated improved pit latrine</td>
<td>0.000</td>
<td>0.028</td>
<td>0.023</td>
<td>0.051</td>
<td>0.061</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.004</td>
</tr>
<tr>
<td>No sanitation facility</td>
<td>0.810</td>
<td>0.222</td>
<td>0.047</td>
<td>0.004</td>
<td>0.003</td>
<td>1.000</td>
<td>1.000</td>
<td>1.000</td>
<td>1.000</td>
<td>0.381</td>
</tr>
<tr>
<td colspan="11">
<bold>Type of floor material</bold>
</td>
</tr>
<tr>
<td>Earth floor</td>
<td>0.795</td>
<td>0.614</td>
<td>0.270</td>
<td>0.027</td>
<td>0.003</td>
<td>1.000</td>
<td>0.998</td>
<td>0.877</td>
<td>0.169</td>
<td>0.439</td>
</tr>
<tr>
<td>Dung floor</td>
<td>0.194</td>
<td>0.191</td>
<td>0.099</td>
<td>0.050</td>
<td>0.001</td>
<td>0.000</td>
<td>0.000</td>
<td>0.053</td>
<td>0.821</td>
<td>0.498</td>
</tr>
<tr>
<td>Wood floor</td>
<td>0.001</td>
<td>0.016</td>
<td>0.043</td>
<td>0.047</td>
<td>0.035</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.014</td>
</tr>
<tr>
<td>Polished wood/parquet floor</td>
<td>0.000</td>
<td>0.005</td>
<td>0.003</td>
<td>0.058</td>
<td>0.135</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Vinyl floor</td>
<td>0.001</td>
<td>0.038</td>
<td>0.218</td>
<td>0.269</td>
<td>0.328</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Ceramic/tiles/brick floor</td>
<td>0.001</td>
<td>0.001</td>
<td>0.007</td>
<td>0.038</td>
<td>0.085</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cement floor</td>
<td>0.004</td>
<td>0.108</td>
<td>0.305</td>
<td>0.431</td>
<td>0.344</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.035</td>
</tr>
<tr>
<td>Carpet floor</td>
<td>0.000</td>
<td>0.011</td>
<td>0.030</td>
<td>0.045</td>
<td>0.065</td>
<td>0.000</td>
<td>0.002</td>
<td>0.070</td>
<td>0.011</td>
<td>0.010</td>
</tr>
<tr>
<td>Other floor</td>
<td>0.001</td>
<td>0.015</td>
<td>0.024</td>
<td>0.035</td>
<td>0.004</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.000</td>
<td>0.003</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>While we find there is evidence for internal coherence for urban and rural Brazil and urban Ethiopia, we cannot conclude the index to be internally coherent for rural Ethiopia.</p>
<p>The assumption that the distribution of SES is quite uniform may not be appropriate in all settings, for example in rural Ethiopia. Histograms of the household socio-economic scores for each site are shown in
<xref ref-type="fig" rid="F2">Figure 2</xref>
. The distribution of scores tends to follow a normal curve for rural Brazil and urban Ethiopia. For urban Brazil, it is skewed to the left. For rural Ethiopia, it is heavily skewed to the right, highlighting the extent of clumping and truncation which have made it difficult to differentiate between socio-economic groups.
<fig id="F2">
<label>
<bold>Figure 2.</bold>
</label>
<caption>
<p>Distribution of socio-economic scores</p>
</caption>
<graphic xlink:href="czl029f2"></graphic>
</fig>
</p>
<p>A data driven approach to classifying households is cluster analysis, as used in a study by Cortinovis et al. (
<xref ref-type="bibr" rid="B2">1993</xref>
). Cluster analysis is a statistical procedure that allows for assignment of cases to a fixed number of groups or clusters according to a set of variables. The procedure attempts to group and derive cluster centres. The difference between cluster means is made as large as possible. We used cluster analysis on the household socio-economic score derived for each site to investigate the distribution of ‘low’, ‘medium’ and ‘high’ socio-economic groups (
<xref ref-type="table" rid="T4">Table 4</xref>
). Cluster analysis generally fitted the patterns found from the distribution of the household socio-economic scores shown in the histograms. So in our case, applying arbitrary cut-off points, such as the 40–40–20 split as in Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
), would disaggregate the distribution for, for example, urban Ethiopia, but it would not reflect the clustered nature of the underlying data for rural Ethiopia.
<table-wrap id="T4">
<label>
<bold>Table 4.</bold>
</label>
<caption>
<p>Proportion of households in low, medium and high socio-economic group for entire sample</p>
</caption>
<table frame="hsides">
<thead align="left">
<tr>
<th>Site</th>
<th>N</th>
<th>Low (%)</th>
<th>Medium (%)</th>
<th>High (%)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td>Urban Brazil</td>
<td>10 527</td>
<td>17.77</td>
<td>36.28</td>
<td>45.95</td>
</tr>
<tr>
<td>Rural Brazil</td>
<td>2 756</td>
<td>35.92</td>
<td>29.75</td>
<td>34.33</td>
</tr>
<tr>
<td>Urban Ethiopia</td>
<td>3629</td>
<td>38.58</td>
<td>40.20</td>
<td>21.22</td>
</tr>
<tr>
<td>Rural Ethiopia</td>
<td>10 443</td>
<td>59.26</td>
<td>30.73</td>
<td>10.01</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>To summarize, for rural Brazil and urban Ethiopia, the distribution of socio-economic scores show little evidence of clumping and truncation, suggesting appropriate and sufficient choice of variables, and the results were found to be internally coherent across quintiles. While the results for urban Brazil were internally coherent, there is some evidence of truncation at the top, suggesting the variables included in the analysis were not sufficient to distinguish households among the rich.</p>
<p>For rural Ethiopia, the distribution of SES was heavily skewed, reflected by almost 60% of households being classified into the low socio-economic group using cluster analysis. The example for rural Ethiopia has highlighted the difficulties of using asset-based indices in some settings. Clumping or truncation can result from using variables which are unable to distinguish households, or it could reflect that households are in fact homogenous in terms of SES.</p>
<p>The decision on whether to construct a socio-economic index at country level (e.g. Gwatkin et al. 2000) or at community level (e.g. Schellenberg et al.
<xref ref-type="bibr" rid="B19">2003</xref>
) depends on the objectives of the study and the comparisons to be made. Constructing an index at country level risks failing to capture wealth differences in, for example, rural or regional communities, and constructing an index at community level increases the risk of clumping and truncation. If the analysis is to be undertaken for a rural community, Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
) advise including items associated with SES for that location. Planning surveys before hand, and using local knowledge to pick out variables that could discriminate households into groups, could help to determine such a list of indicators. However, there will continue to be a trade-off in terms of the additional expense of obtaining more specialized data for a particular setting, and the simplicity of using asset-based measures.</p>
</sec>
</sec>
<sec sec-type="discussion" id="SEC4">
<title>4. Discussion</title>
<p>This paper describes the process to derive a SES index in the absence of income or consumption data by performing PCA on durable asset ownership, access to utilities and infrastructure, and housing characteristic variables. The main advantage of this method over the more traditional methods based on income and consumption expenditure is that it avoids many of the measurement problems associated with income- and consumption-based methods, such as recall bias, seasonality and data collection time. Compared with other statistical alternatives, PCA is computationally easier, can use the type of data that can be more easily collected in household surveys, and uses all of the variables in reducing the dimensionality of the data (Jobson
<xref ref-type="bibr" rid="B12">1992</xref>
). Socio-economic categorization is obtained by ranking then classifying households within the distribution into various groupings. The indices derived are relative measures of SES, so while this type of measure is useful for considering inequality between households, it cannot provide information on absolute levels of poverty within a community (McKenzie
<xref ref-type="bibr" rid="B15">2003</xref>
). It can be used for comparison across countries or settings (such as urban/rural), or over time, provided the separate indices are calculated with the same variables.</p>
<p>Debate about the use of PCA reflects the fact that principal components are artificially constructed indices. Critics of PCA argue that the technique is arbitrary, that the method of choosing the number of components and the variables to include is not well defined. The empirical basis for the technique rests on whether the first principal component can predict SES status. This is entirely dependent on the nature of data and the relationships between variables that are being considered, the validity of the variables included and also their reliability.</p>
<p>The choice of variables included can have an impact on the observed poor-rich difference in health outcomes. For example, Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
) found variables which were a direct determinant of child health outcomes influenced the classification of socio-economic groups, and Lindelow (
<xref ref-type="bibr" rid="B13">2002</xref>
) found geographic bias with the inclusion of infrastructure variables. For comparative purposes, consideration needs to be given to the variables used.</p>
<p>Many studies using asset-based indices appear to have relied on the ‘face validity’ of the variables included, i.e. they appear to capture household wealth. Validation of PCA-based SES indices has been undertaken by Filmer and Pritchett (
<xref ref-type="bibr" rid="B5">2001</xref>
) for data from India, Indonesia, Pakistan and Nepal. Their study contained both asset and expenditure information and found coherence between the results of the PCA- and expenditure-based classifications, and also concluded that the index was robust to the variables included. In addition, Lindelow (
<xref ref-type="bibr" rid="B13">2002</xref>
) concluded that consumption expenditure and the PCA-based index are different proxies for the same underlying construct of interest.</p>
<p>Few studies have considered the reliability of collecting asset data. While a study by Onwujekewe et al. (2006) found reliability of collecting some asset variables not to be high, Montgomery et al. (
<xref ref-type="bibr" rid="B16">2000</xref>
) suggest that household income data are also unreliable.</p>
<p>There are alternatives to PCA that can reduce the dimensionality of the data using methods such as correspondence analysis, multivariate regression or factor analysis. Cortinovis et al. (
<xref ref-type="bibr" rid="B2">1993</xref>
) used correspondence analysis to derive a SES measure. However, the analysis can only be used for categorical data (nominal and ordinal); continuous data would need to be reorganized into ranges. With multivariate regression, dimensionality reduction is accomplished by simply choosing which variables to leave out, at the expense of ignoring some dimensions of the data. Factor analysis was used by Sahn and Stifel (
<xref ref-type="bibr" rid="B18">2003</xref>
) and has a similar aim to PCA, in terms of expressing a set of variables into a smaller number of indices or factors. The difference between the two is that while there are no assumptions associated with PCA, the factors derived from factor analysis are assumed to represent the underlying processes that result in the correlations between the variables.</p>
<p>Issues related to the underlying data will affect PCA and this should be considered when creating and interpreting results. Clearly, there are methodological issues that need to be considered when developing PCA-based indices. The recent work on PCA-based SES indices suggests that these can be validated and are robust. McKenzie (
<xref ref-type="bibr" rid="B15">2003</xref>
) states that there are a number of theoretical questions of interest in which wealth inequality is more important than consumption or income inequality, so an asset-based inequality measure may be preferred in empirical tests. However, it is up to the user to bear in mind that PCA is best considered as a summary empirical method.</p>
</sec>
</body>
<back>
<fn-group>
<fn id="FN1">
<p>
<sup>1</sup>
A vector that results in a scalar multiple of itself when multiplied by a matrix is known as an eigenvector, and the scalar is its associated eigenvalue. Eigenvectors can only be found for square matrices (though not all), and for an n × n matrix, there are n eigenvectors. For a more detailed description of matrix algebra, and in particular eigenvectors and eigenvalues, see Manly (
<xref ref-type="bibr" rid="B14">1994</xref>
).</p>
</fn>
<fn id="FN2">
<p>
<sup>2</sup>
Brazil is a lower-middle-income country with a GNI per capita of US$3090. With a GNI per capita of US$110 Ethiopia is one of the world's poorest countries ([
<ext-link ext-link-type="uri" xlink:href="http://www.worldbank.org">http://www.worldbank.org</ext-link>
]). The urban population was 83% in Brazil and 16% in Ethiopia in 2003 (UNDP
<xref ref-type="bibr" rid="B20">2005</xref>
). We used the 1996 Brazil DHS and 2000 Ethiopia DHS.</p>
</fn>
<fn id="FN3">
<p>
<sup>3</sup>
The construction of a number of binary variables from categorical variables is another way to organize the data, although nominally new variables are created. For example, the categorical variable RELIGION, with the values Christian, Muslim, Jewish, Buddhist, converted to binary form would mean the creation of four new variables CHRISTIAN, MUSLIM, JEWISH, BUDDHIST, all of which took on the value of 0 or 1. As the nature of categorical variables is that there is no hierarchical relationship between the variables (which is why they cannot be converted into a meaningful quantitative scale), their conversion into binary variables and inclusion as additional variables does not change the relationship between the variables nor add any additional variation or correlation in the dataset. Rather, having individual variables, PCA can determine which of the particular religion variables can differentiate between households.</p>
</fn>
<fn id="FN4">
<p>
<sup>4</sup>
PCA is not invariant to differences in the units of measurement among variables, therefore it is usual to standardize the variables in this instance (Bolch and Huang
<xref ref-type="bibr" rid="B1">1974</xref>
). Standardization is the process of transforming variables so that the new set of scores has a mean equal to zero and standard deviation equal to one. The correlation matrix is a standardized version of the co-variance matrix.</p>
</fn>
<fn id="FN5">
<p>
<sup>5</sup>
Factor score for ownership of a bicycle not stated in Houweling et al. (
<xref ref-type="bibr" rid="B11">2003</xref>
).</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bolch</surname>
<given-names>BW</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>CJ</given-names>
</name>
</person-group>
<source>Multivariate statistical methods for business and economics</source>
<year>1974</year>
<publisher-loc>Englewood Cliffs, NJ</publisher-loc>
<publisher-name>Prentice Hall</publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cortinovis</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Vela</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Ndiku</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Construction of a socio-economic index to facilitate analysis of health in data in developing countries</article-title>
<source>Social Science and Medicine</source>
<year>1993</year>
<volume>36</volume>
<fpage>1087</fpage>
<lpage>97</lpage>
</nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deaton</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Health, inequality and economic development</article-title>
<source>Journal of Economic Literature</source>
<year>2003</year>
<volume>41</volume>
<fpage>113</fpage>
<lpage>58</lpage>
</nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Falkingham</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Namazie</surname>
<given-names>C</given-names>
</name>
</person-group>
<source>Measuring health and poverty: a review of approaches to identifying the poor</source>
<year>2002</year>
<access-date>Accessed 5 April 2006</access-date>
<publisher-loc>London</publisher-loc>
<publisher-name>Department for International Development Health Systems Resource Centre (DFID HSRC)</publisher-name>
<comment>at: [
<ext-link ext-link-type="uri" xlink:href="http://www.eldis.org/static/DOC11501.htm">http://www.eldis.org/static/DOC11501.htm</ext-link>
]</comment>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Filmer</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pritchett</surname>
<given-names>LH</given-names>
</name>
</person-group>
<article-title>Estimating wealth effect without expenditure data – or tears: an application to educational enrollments in states of India</article-title>
<source>Demography</source>
<year>2001</year>
<volume>38</volume>
<fpage>115</fpage>
<lpage>32</lpage>
</nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Giri</surname>
<given-names>NC</given-names>
</name>
</person-group>
<source>Multivariate statistical analysis</source>
<year>2004</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Marcel Dekker Inc</publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gwatkin</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Rustein</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<source>Socio-economic differences in Brazil</source>
<year>2000</year>
<access-date>Accessed 5 January 2004</access-date>
<publisher-loc>Washington, DC</publisher-loc>
<publisher-name>HNP/Poverty Thematic Group of the World Bank</publisher-name>
<comment>online at: [
<ext-link ext-link-type="uri" xlink:href="http://www.worldbank.org/poverty/health/data/index.htm#lcr">http://www.worldbank.org/poverty/health/data/index.htm#lcr</ext-link>
]</comment>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gwatkin</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Rustein</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<source>Socio-economic differences in Ethiopia. Health, Nutrition, and Population in Ethiopia</source>
<year>2000</year>
<access-date>Accessed 5 January 2004</access-date>
<publisher-loc>Washington, DC</publisher-loc>
<publisher-name>HNP/Poverty Thematic Group of the World Bank</publisher-name>
<comment>online at [
<ext-link ext-link-type="uri" xlink:href="http://www.worldbank.org/poverty/health/data/index.htm#lcr">http://www.worldbank.org/poverty/health/data/index.htm#lcr</ext-link>
]</comment>
</nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gwatkin</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Rustein</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<source>Socio-economic differences in Nigeria. Health, Nutrition, and Population in Nigeria</source>
<year>2000</year>
<access-date>Accessed 19 March 2002</access-date>
<publisher-loc>Washington, DC</publisher-loc>
<publisher-name>HNP/Poverty Thematic Group of the World Bank</publisher-name>
<comment>online at: [
<ext-link ext-link-type="uri" xlink:href="http://www.worldbank.org/poverty/health/data/index.htm#lcr">http://www.worldbank.org/poverty/health/data/index.htm#lcr</ext-link>
]</comment>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hanson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>McPake</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Nakamba</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Archard</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Preferences for hospital quality in Zambia: results from a discrete choice experiment</article-title>
<source>Health Economics</source>
<year>2005</year>
<volume>14</volume>
<fpage>687</fpage>
<lpage>701</lpage>
</nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Houweling</surname>
<given-names>TAJ</given-names>
</name>
<name>
<surname>Kunst</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Mackenbach</surname>
<given-names>JP</given-names>
</name>
</person-group>
<article-title>Measuring health inequality among children in developing countries: does the choice of the indicator of economic status matter?</article-title>
<source>International Journal for Equity in Health</source>
<year>2003</year>
<volume>2</volume>
<fpage>8</fpage>
</nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jobson</surname>
<given-names>JD</given-names>
</name>
</person-group>
<source>Applied multivariate data analysis</source>
<year>1992</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Springer-Verlag</publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lindelow</surname>
<given-names>M</given-names>
</name>
</person-group>
<source>Sometimes more equal than others: How the choice of welfare indicator can affect the measurement of health inequalities and the incidence of public spending</source>
<year>2002</year>
<conf-name>CSAE Working Paper Series 2002–15</conf-name>
<conf-loc>Oxford: Centre for Study of African Economies, University of Oxford</conf-loc>
</nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Manly</surname>
<given-names>BFJ</given-names>
</name>
</person-group>
<source>Multivariate statistical methods. A primer</source>
<year>1994</year>
<edition>2nd Edition</edition>
<publisher-loc>London</publisher-loc>
<publisher-name>Chapman and Hall</publisher-name>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>McKenzie</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Measure inequality with asset indicators.
<italic>BREAD Working Paper</italic>
No. 042</article-title>
<year>2003</year>
<conf-loc>Cambridge, MA: Bureau for Research and Economic Analysis of Development, Center for International Development, Harvard University</conf-loc>
</nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Montgomery</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Gragnolati</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Burke</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Paredes</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Measuring living standards with proxy variables</article-title>
<source>Demography</source>
<year>2000</year>
<volume>37</volume>
<fpage>155</fpage>
<lpage>74</lpage>
</nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Onwujekwe</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Hanson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Fox-Rushby</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Some indicators of socio-economic status may not be reliable and use of indices with these data could worsen equity</article-title>
<source>Health Economics</source>
<year>2006</year>
<volume>15</volume>
<fpage>639</fpage>
<lpage>44</lpage>
</nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sahn</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stifel</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Exploring alternative measures of welfare in the absence of expenditure data</article-title>
<source>Review of Income and Wealth</source>
<year>2003</year>
<volume>49</volume>
<fpage>463</fpage>
<lpage>89</lpage>
</nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schellenberg</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Victora</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Mushi</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Inequities among the very poor: health care for children in southern Tanzania</article-title>
<source>The Lancet</source>
<year>2003</year>
<volume>361</volume>
<fpage>561</fpage>
<lpage>6</lpage>
</nlm-citation>
</ref>
<ref id="B20">
<nlm-citation citation-type="gov">
<collab>UNDP</collab>
<source>Human development report 2005. International cooperation at a crossroads: aid, trade and security in an unequal world</source>
<year>2005</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Oxford University Press for the United Nations Development Programme (UNDP)</publisher-name>
</nlm-citation>
</ref>
</ref-list>
<ack>
<title>Acknowledgements</title>
<p>We would like to thank Kara Hanson, Peter Vickerman and the two anonymous reviewers for their technical input and comments on a draft of the manuscript. We would also like to thank Jo Borghi, Jolene Skordis, Damian Walker and Natasha Palmer for comments on an earlier version of the manuscript.</p>
</ack>
</back>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo>
<title>Constructing socio-economic status indices: how to use principal components analysis</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA">
<title>Constructing socio-economic status indices: how to use principal components analysis</title>
</titleInfo>
<name type="personal" displayLabel="corresp">
<namePart type="given">Seema</namePart>
<namePart type="family">Vyas</namePart>
<affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</affiliation>
<affiliation>E-mail: seema.vyas@lshtm.ac.uk</affiliation>
<affiliation>Correspondence: Seema Vyas, HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK. Tel: +44</affiliation>
<affiliation>E-mail: seema.vyas@lshtm.ac.uk</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
<description>Seema Vyas is a Research Fellow with the Health Policy Unit, Department of Public and Policy, LSHTM. She specializes in quantitative and private health sector analysis.</description>
</name>
<name type="personal">
<namePart type="given">Lilani</namePart>
<namePart type="family">Kumaranayake</namePart>
<affiliation>HIVTools Research Group, Health Policy Unit, Department of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
<description>Lilani Kumaranayake is a Lecturer in Health Economics and Policy, Department of Public Health and Policy, LSHTM. She specializes in the economics of HIV/AIDS, private health sector and quantitative analysis.</description>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article" authority="ISTEX" authorityURI="https://content-type.data.istex.fr" valueURI="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</genre>
<originInfo>
<publisher>Oxford University Press</publisher>
<dateIssued encoding="w3cdtf">2006-11</dateIssued>
<dateCreated encoding="w3cdtf">2006-10-09</dateCreated>
<copyrightDate encoding="w3cdtf">2006</copyrightDate>
</originInfo>
<abstract>Theoretically, measures of household wealth can be reflected by income, consumption or expenditure information. However, the collection of accurate income and consumption data requires extensive resources for household surveys. Given the increasingly routine application of principal components analysis (PCA) using asset data in creating socio-economic status (SES) indices, we review how PCA-based indices are constructed, how they can be used, and their validity and limitations. Specifically, issues related to choice of variables, data preparation and problems such as data clustering are addressed. Interpretation of results and methods of classifying households into SES groups are also discussed. PCA has been validated as a method to describe SES differentiation within a population. Issues related to the underlying data will affect PCA and this should be considered when generating and interpreting results.</abstract>
<subject>
<genre>keywords</genre>
<topic>socio-economic status</topic>
<topic>principal components analysis</topic>
<topic>cluster analysis</topic>
<topic>methodology</topic>
</subject>
<relatedItem type="host">
<titleInfo>
<title>Health Policy and Planning</title>
</titleInfo>
<genre type="journal" authority="ISTEX" authorityURI="https://publication-type.data.istex.fr" valueURI="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</genre>
<subject>
<topic>How to do (or not to do) …</topic>
</subject>
<identifier type="ISSN">0268-1080</identifier>
<identifier type="PublisherID">heapol</identifier>
<identifier type="PublisherID-hwp">heapol</identifier>
<part>
<date>2006</date>
<detail type="volume">
<caption>vol.</caption>
<number>21</number>
</detail>
<detail type="issue">
<caption>no.</caption>
<number>6</number>
</detail>
<extent unit="pages">
<start>459</start>
<end>468</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6</identifier>
<identifier type="DOI">10.1093/heapol/czl029</identifier>
<accessCondition type="use and reproduction" contentType="copyright">© The Author 2006. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine. All rights reserved.</accessCondition>
<recordInfo>
<recordContentSource authority="ISTEX" authorityURI="https://loaded-corpus.data.istex.fr" valueURI="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-GTWS0RDP-M">oup</recordContentSource>
<recordOrigin>© The Author 2006. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine. All rights reserved.</recordOrigin>
</recordInfo>
</mods>
<json:item>
<extension>json</extension>
<original>false</original>
<mimetype>application/json</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/metadata/json</uri>
</json:item>
</metadata>
<covers>
<json:item>
<extension>tiff</extension>
<original>true</original>
<mimetype>image/tiff</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/covers/tiff</uri>
</json:item>
</covers>
<annexes>
<json:item>
<extension>jpeg</extension>
<original>true</original>
<mimetype>image/jpeg</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/annexes/jpeg</uri>
</json:item>
<json:item>
<extension>gif</extension>
<original>true</original>
<mimetype>image/gif</mimetype>
<uri>https://api.istex.fr/document/C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6/annexes/gif</uri>
</json:item>
</annexes>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sante/explor/SidaSubSaharaV1/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 004132 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 004132 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sante
   |area=    SidaSubSaharaV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:C867F36DE82CC38B57BD6FA6578E4D4C347ED9A6
   |texte=   Constructing socio-economic status indices: how to use principal components analysis
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Mon Nov 13 19:31:10 2017. Site generation: Wed Mar 6 19:14:32 2024