Serveur d'exploration sur le lymphœdème

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A robust model for read count data in exome sequencing experiments and implications for copy number variant calling

Identifieur interne : 003B85 ( Pmc/Corpus ); précédent : 003B84; suivant : 003B86

A robust model for read count data in exome sequencing experiments and implications for copy number variant calling

Auteurs : Vincent Plagnol ; James Curtis ; Michael Epstein ; Kin Y. Mok ; Emma Stebbings ; Sofia Grigoriadou ; Nicholas W. Wood ; Sophie Hambleton ; Siobhan O. Burns ; Adrian J. Thrasher ; Dinakantha Kumararatne ; Rainer Doffinger ; Sergey Nejentsev

Source :

RBID : PMC:3476336

Abstract

Motivation: Exome sequencing has proven to be an effective tool to discover the genetic basis of Mendelian disorders. It is well established that copy number variants (CNVs) contribute to the etiology of these disorders. However, calling CNVs from exome sequence data is challenging. A typical read depth strategy consists of using another sample (or a combination of samples) as a reference to control for the variability at the capture and sequencing steps. However, technical variability between samples complicates the analysis and can create spurious CNV calls.

Results: Here, we introduce ExomeDepth, a new CNV calling algorithm designed to control for this technical variability. ExomeDepth uses a robust model for the read count data and uses this model to build an optimized reference set in order to maximize the power to detect CNVs. As a result, ExomeDepth is effective across a wider range of exome datasets than the previously existing tools, even for small (e.g. one to two exons) and heterozygous deletions. We used this new approach to analyse exome data from 24 patients with primary immunodeficiencies. Depending on data quality and the exact target region, we find between 170 and 250 exonic CNV calls per sample. Our analysis identified two novel causative deletions in the genes GATA2 and DOCK8.

Availability: The code used in this analysis has been implemented into an R package called ExomeDepth and is available at the Comprehensive R Archive Network (CRAN).

Contact: v.plagnol@ucl.ac.uk

Supplementary Information:Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/bts526
PubMed: 22942019
PubMed Central: 3476336

Links to Exploration step

PMC:3476336

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A robust model for read count data in exome sequencing experiments and implications for copy number variant calling</title>
<author>
<name sortKey="Plagnol, Vincent" sort="Plagnol, Vincent" uniqKey="Plagnol V" first="Vincent" last="Plagnol">Vincent Plagnol</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Genetics Institute, UCL, London,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Curtis, James" sort="Curtis, James" uniqKey="Curtis J" first="James" last="Curtis">James Curtis</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Epstein, Michael" sort="Epstein, Michael" uniqKey="Epstein M" first="Michael" last="Epstein">Michael Epstein</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Genetics Institute, UCL, London,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL CoMPLEX program,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mok, Kin Y" sort="Mok, Kin Y" uniqKey="Mok K" first="Kin Y." last="Mok">Kin Y. Mok</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Institute of Neurology, UCL,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stebbings, Emma" sort="Stebbings, Emma" uniqKey="Stebbings E" first="Emma" last="Stebbings">Emma Stebbings</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Grigoriadou, Sofia" sort="Grigoriadou, Sofia" uniqKey="Grigoriadou S" first="Sofia" last="Grigoriadou">Sofia Grigoriadou</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Royal London Hospital, London,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wood, Nicholas W" sort="Wood, Nicholas W" uniqKey="Wood N" first="Nicholas W." last="Wood">Nicholas W. Wood</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Institute of Neurology, UCL,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hambleton, Sophie" sort="Hambleton, Sophie" uniqKey="Hambleton S" first="Sophie" last="Hambleton">Sophie Hambleton</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Burns, Siobhan O" sort="Burns, Siobhan O" uniqKey="Burns S" first="Siobhan O." last="Burns">Siobhan O. Burns</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="bts526-AFF1">Molecular Immunology Unit, Wolfson Centre for Gene Therapy of Childhood Disease, UCL Institute of Child Health, Great Ormond Street Hospital for Children, London</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Thrasher, Adrian J" sort="Thrasher, Adrian J" uniqKey="Thrasher A" first="Adrian J." last="Thrasher">Adrian J. Thrasher</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="bts526-AFF1">Molecular Immunology Unit, Wolfson Centre for Gene Therapy of Childhood Disease, UCL Institute of Child Health, Great Ormond Street Hospital for Children, London</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kumararatne, Dinakantha" sort="Kumararatne, Dinakantha" uniqKey="Kumararatne D" first="Dinakantha" last="Kumararatne">Dinakantha Kumararatne</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Clinical Biochemistry and Immunology, Addenbrookes Hospital, Cambridge, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Doffinger, Rainer" sort="Doffinger, Rainer" uniqKey="Doffinger R" first="Rainer" last="Doffinger">Rainer Doffinger</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Clinical Biochemistry and Immunology, Addenbrookes Hospital, Cambridge, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nejentsev, Sergey" sort="Nejentsev, Sergey" uniqKey="Nejentsev S" first="Sergey" last="Nejentsev">Sergey Nejentsev</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22942019</idno>
<idno type="pmc">3476336</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3476336</idno>
<idno type="RBID">PMC:3476336</idno>
<idno type="doi">10.1093/bioinformatics/bts526</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">003B85</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">003B85</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A robust model for read count data in exome sequencing experiments and implications for copy number variant calling</title>
<author>
<name sortKey="Plagnol, Vincent" sort="Plagnol, Vincent" uniqKey="Plagnol V" first="Vincent" last="Plagnol">Vincent Plagnol</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Genetics Institute, UCL, London,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Curtis, James" sort="Curtis, James" uniqKey="Curtis J" first="James" last="Curtis">James Curtis</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Epstein, Michael" sort="Epstein, Michael" uniqKey="Epstein M" first="Michael" last="Epstein">Michael Epstein</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Genetics Institute, UCL, London,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL CoMPLEX program,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mok, Kin Y" sort="Mok, Kin Y" uniqKey="Mok K" first="Kin Y." last="Mok">Kin Y. Mok</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Institute of Neurology, UCL,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stebbings, Emma" sort="Stebbings, Emma" uniqKey="Stebbings E" first="Emma" last="Stebbings">Emma Stebbings</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Grigoriadou, Sofia" sort="Grigoriadou, Sofia" uniqKey="Grigoriadou S" first="Sofia" last="Grigoriadou">Sofia Grigoriadou</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Royal London Hospital, London,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wood, Nicholas W" sort="Wood, Nicholas W" uniqKey="Wood N" first="Nicholas W." last="Wood">Nicholas W. Wood</name>
<affiliation>
<nlm:aff id="bts526-AFF1">UCL Institute of Neurology, UCL,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hambleton, Sophie" sort="Hambleton, Sophie" uniqKey="Hambleton S" first="Sophie" last="Hambleton">Sophie Hambleton</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Burns, Siobhan O" sort="Burns, Siobhan O" uniqKey="Burns S" first="Siobhan O." last="Burns">Siobhan O. Burns</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="bts526-AFF1">Molecular Immunology Unit, Wolfson Centre for Gene Therapy of Childhood Disease, UCL Institute of Child Health, Great Ormond Street Hospital for Children, London</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Thrasher, Adrian J" sort="Thrasher, Adrian J" uniqKey="Thrasher A" first="Adrian J." last="Thrasher">Adrian J. Thrasher</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="bts526-AFF1">Molecular Immunology Unit, Wolfson Centre for Gene Therapy of Childhood Disease, UCL Institute of Child Health, Great Ormond Street Hospital for Children, London</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kumararatne, Dinakantha" sort="Kumararatne, Dinakantha" uniqKey="Kumararatne D" first="Dinakantha" last="Kumararatne">Dinakantha Kumararatne</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Clinical Biochemistry and Immunology, Addenbrookes Hospital, Cambridge, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Doffinger, Rainer" sort="Doffinger, Rainer" uniqKey="Doffinger R" first="Rainer" last="Doffinger">Rainer Doffinger</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Clinical Biochemistry and Immunology, Addenbrookes Hospital, Cambridge, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nejentsev, Sergey" sort="Nejentsev, Sergey" uniqKey="Nejentsev S" first="Sergey" last="Nejentsev">Sergey Nejentsev</name>
<affiliation>
<nlm:aff id="bts526-AFF1">Department of Medicine, University of Cambridge, Cambridge,</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation</bold>
: Exome sequencing has proven to be an effective tool to discover the genetic basis of Mendelian disorders. It is well established that copy number variants (CNVs) contribute to the etiology of these disorders. However, calling CNVs from exome sequence data is challenging. A typical read depth strategy consists of using another sample (or a combination of samples) as a reference to control for the variability at the capture and sequencing steps. However, technical variability between samples complicates the analysis and can create spurious CNV calls.</p>
<p>
<bold>Results</bold>
: Here
<bold>,</bold>
we introduce ExomeDepth, a new CNV calling algorithm designed to control for this technical variability. ExomeDepth uses a robust model for the read count data and uses this model to build an optimized reference set in order to maximize the power to detect CNVs. As a result, ExomeDepth is effective across a wider range of exome datasets than the previously existing tools, even for small (e.g. one to two exons) and heterozygous deletions. We used this new approach to analyse exome data from 24 patients with primary immunodeficiencies. Depending on data quality and the exact target region, we find between 170 and 250 exonic CNV calls per sample. Our analysis identified two novel causative deletions in the genes
<italic>GATA2</italic>
and
<italic>DOCK8</italic>
.</p>
<p>
<bold>Availability:</bold>
The code used in this analysis has been implemented into an R package called ExomeDepth and is available at the Comprehensive R Archive Network (CRAN).</p>
<p>
<bold>Contact</bold>
:
<email>v.plagnol@ucl.ac.uk</email>
</p>
<p>
<bold>Supplementary Information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Agresti, A" uniqKey="Agresti A">A Agresti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conrad, Df" uniqKey="Conrad D">DF Conrad</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karakoc, E" uniqKey="Karakoc E">E Karakoc</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krumm, N" uniqKey="Krumm N">N Krumm</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Love, Mi" uniqKey="Love M">MI Love</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marioni, Jc" uniqKey="Marioni J">JC Marioni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mortazavi, A" uniqKey="Mortazavi A">A Mortazavi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ng, Sb" uniqKey="Ng S">SB Ng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ostergaard, P" uniqKey="Ostergaard P">P Ostergaard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sathirapongsasuti, Jf" uniqKey="Sathirapongsasuti J">JF Sathirapongsasuti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xie, C" uniqKey="Xie C">C Xie</name>
</author>
<author>
<name sortKey="Tammi, Mt" uniqKey="Tammi M">MT Tammi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, K" uniqKey="Ye K">K Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeitouni, B" uniqKey="Zeitouni B">B Zeitouni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22942019</article-id>
<article-id pub-id-type="pmc">3476336</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/bts526</article-id>
<article-id pub-id-type="publisher-id">bts526</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Papers</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A robust model for read count data in exome sequencing experiments and implications for copy number variant calling</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Plagnol</surname>
<given-names>Vincent</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="bts526-COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Curtis</surname>
<given-names>James</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Epstein</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Mok</surname>
<given-names>Kin Y.</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stebbings</surname>
<given-names>Emma</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Grigoriadou</surname>
<given-names>Sofia</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wood</surname>
<given-names>Nicholas W.</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hambleton</surname>
<given-names>Sophie</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>6</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Burns</surname>
<given-names>Siobhan O.</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>7</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Thrasher</surname>
<given-names>Adrian J.</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>7</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kumararatne</surname>
<given-names>Dinakantha</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>8</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Doffinger</surname>
<given-names>Rainer</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>8</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Nejentsev</surname>
<given-names>Sergey</given-names>
</name>
<xref ref-type="aff" rid="bts526-AFF1">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="bts526-AFF1">
<sup>1</sup>
UCL Genetics Institute, UCL, London,
<sup>2</sup>
Department of Medicine, University of Cambridge, Cambridge,
<sup>3</sup>
UCL CoMPLEX program,
<sup>4</sup>
UCL Institute of Neurology, UCL,
<sup>5</sup>
Royal London Hospital, London,
<sup>6</sup>
Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne,
<sup>7</sup>
Molecular Immunology Unit, Wolfson Centre for Gene Therapy of Childhood Disease, UCL Institute of Child Health, Great Ormond Street Hospital for Children, London and
<sup>8</sup>
Department of Clinical Biochemistry and Immunology, Addenbrookes Hospital, Cambridge, UK</aff>
<author-notes>
<corresp id="bts526-COR1">*To whom correspondence should be addressed.</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>1</day>
<month>11</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>31</day>
<month>8</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>31</day>
<month>8</month>
<year>2012</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>28</volume>
<issue>21</issue>
<fpage>2747</fpage>
<lpage>2754</lpage>
<history>
<date date-type="received">
<day>1</day>
<month>7</month>
<year>2012</year>
</date>
<date date-type="rev-recd">
<day>12</day>
<month>8</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>8</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2012. Published by Oxford University Press.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/3.0/">http://creativecommons.org/licenses/by/3.0/</ext-link>
), which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation</bold>
: Exome sequencing has proven to be an effective tool to discover the genetic basis of Mendelian disorders. It is well established that copy number variants (CNVs) contribute to the etiology of these disorders. However, calling CNVs from exome sequence data is challenging. A typical read depth strategy consists of using another sample (or a combination of samples) as a reference to control for the variability at the capture and sequencing steps. However, technical variability between samples complicates the analysis and can create spurious CNV calls.</p>
<p>
<bold>Results</bold>
: Here
<bold>,</bold>
we introduce ExomeDepth, a new CNV calling algorithm designed to control for this technical variability. ExomeDepth uses a robust model for the read count data and uses this model to build an optimized reference set in order to maximize the power to detect CNVs. As a result, ExomeDepth is effective across a wider range of exome datasets than the previously existing tools, even for small (e.g. one to two exons) and heterozygous deletions. We used this new approach to analyse exome data from 24 patients with primary immunodeficiencies. Depending on data quality and the exact target region, we find between 170 and 250 exonic CNV calls per sample. Our analysis identified two novel causative deletions in the genes
<italic>GATA2</italic>
and
<italic>DOCK8</italic>
.</p>
<p>
<bold>Availability:</bold>
The code used in this analysis has been implemented into an R package called ExomeDepth and is available at the Comprehensive R Archive Network (CRAN).</p>
<p>
<bold>Contact</bold>
:
<email>v.plagnol@ucl.ac.uk</email>
</p>
<p>
<bold>Supplementary Information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</abstract>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="SEC">
<title>1 INTRODUCTION</title>
<p>The improvement of DNA sequencing technologies in recent years has radically changed the identification of genetic variants associated with human diseases and in particular, rare disorders (
<xref ref-type="bibr" rid="bts526-B9">Ng
<italic>et al.</italic>
, 2010</xref>
). The use of sequence capture technologies to target protein-coding regions in the human genome followed by high-throughput DNA sequencing (known as exome sequencing) currently provides a cost-efficient approach to discover causal mutations in patients with Mendelian disorders. The majority of published work using exome sequence data focuses on single nucleotide polymorphisms (SNPs) or small insertions/deletions (indels), mostly because short read DNA sequencing technologies are best suited to call these variants. Nevertheless, copy number variants (CNVs), e.g. larger chromosomal indels, also significantly contribute to the aetiology of Mendelian disorders. Three general strategies exist to call CNVs from short read sequence data (
<xref ref-type="bibr" rid="bts526-B7">Medvedev
<italic>et al.</italic>
, 2009</xref>
): split reads (
<xref ref-type="bibr" rid="bts526-B3">Karakoc
<italic>et al.</italic>
, 2011</xref>
;
<xref ref-type="bibr" rid="bts526-B13">Ye
<italic>et al.</italic>
, 2009</xref>
), paired-end reads (
<xref ref-type="bibr" rid="bts526-B14">Zeitouni
<italic>et al.</italic>
, 2010</xref>
) and read depth approaches (
<xref ref-type="bibr" rid="bts526-B4">Krumm
<italic>et al.</italic>
, 2012</xref>
;
<xref ref-type="bibr" rid="bts526-B11">Sathirapongsasuti
<italic>et al.</italic>
, 2011</xref>
;
<xref ref-type="bibr" rid="bts526-B12">Xie and Tammi, 2009</xref>
). Read depth analysis is particularly effective for exome data as it does not rely on sequencing into or near the CNV breakpoints. Generally speaking, read depth-based approaches for CNV calling compare the number of reads mapping to a chromosome window with its expectation under a statistical model. Deviations from this expectation are indicative of CNV calls. Similar to the array comparative genomic hybridization (aCGH) methodology, the ratio of read count between a test and a reference sample is usually preferred to a single-sample analysis in order to control for the typically extensive variability in capture efficiency across exons (
<xref ref-type="bibr" rid="bts526-B4">Krumm
<italic>et al.</italic>
, 2012</xref>
;
<xref ref-type="bibr" rid="bts526-B11">Sathirapongsasuti
<italic>et al.</italic>
, 2011</xref>
;
<xref ref-type="bibr" rid="bts526-B12">Xie and Tammi, 2009</xref>
). Most of the existing tools for CNV calling that are based on read depth, such as ExomeCNV (
<xref ref-type="bibr" rid="bts526-B11">Sathirapongsasuti
<italic>et al.</italic>
, 2011</xref>
) and CNV-seq (
<xref ref-type="bibr" rid="bts526-B12">Xie and Tammi, 2009</xref>
), make Gaussian assumptions about the distribution of read count ratio. In the absence of technical variability, the proportion of reads matching to a specific sample should follow a binomial distribution whose success rate is determined by genome-wide read count ratio between the test sample and the reference set, as well as the potential presence of CNVs. Additional covariates, such as GC content, can alter this success rate in situations where the effects of these covariates vary across samples (
<xref ref-type="bibr" rid="bts526-B6">Marioni
<italic>et al.</italic>
, 2007</xref>
).</p>
<p>Here, we evaluate two different exome sequence datasets and show that Gaussian assumptions generally do not hold. Technical variability at the library preparation, capture and sequencing creates noise that affects the numbers of reads matching to particular exons in a sample-specific manner. As a result, the observed variance exceeds what is predicted by a binomial model that affects the CNV calls. Motivated by this observation, we propose a modified and more robust statistical framework for CNV calling. We apply this model to provide guidelines for the construction of an optimized reference sequence dataset for CNV calling purposes, as well as realistic power estimates. We find that two main factors improve statistical power: increasing the read depth and controlling for any source of technical variability across samples at the capture and sequencing steps. We have developed and coded a new set of tools in an R package called ExomeDepth. We then illustrated its efficiency by discovering novel small causative CNVs in two patients with primary immunodeficiencies, a heterozygous deletion of two exons of the
<italic>GATA2</italic>
gene and a single-exon homozygous deletion in the
<italic>DOCK8</italic>
gene.</p>
</sec>
<sec id="SEC2">
<title>2 SYSTEM AND METHODS</title>
<sec id="SEC2.1">
<title>2.1 Fitting a robust beta-binomial model for the read depth data</title>
<p>We analysed the read count data for 24 exome samples from primary immunodeficiency patients (divided into two datasets,
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Table S1</ext-link>
and Section 3). An overview of a normalized measure of read depth [matching fragments per million reads and per kilobase, FPKM, (
<xref ref-type="bibr" rid="bts526-B8">Mortazavi
<italic>et al.</italic>
, 2008</xref>
),
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
A] showed extensive exon–exon variability. Inference of CNV status can therefore, not rely on the highly variable single-sample read count data. However, a comparison between pairs of exome datasets (
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
A) demonstrates the high level of correlations of the normalized read count data across samples (squared FPKM correlation coefficients 0.98–0.988 among 15 exomes in Dataset 1 and 0.72–0.987 for the 9 exomes in Dataset 2). It is therefore possible to use one exome or combine several exomes to construct a reference set to base the CNV inference on. Initially, we analysed pairs of exomes and fitted a binomial model to the genome-wide distribution of read depth data for the reference and the test sample (see Section 3). For the purpose of parameter estimation (but not for subsequent CNV calling steps), we removed exons located in regions harbouring common CNVs (
<xref ref-type="bibr" rid="bts526-B2">Conrad
<italic>et al.</italic>
, 2010</xref>
) to limit the possibility that copy number variable regions increase the variance of the read count ratio. The outcomes of two representative comparisons between a pair of exomes from Dataset 1 and a pair of exomes from Dataset 2 are shown in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
B and C, respectively. The larger variance observed in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
C compared with 1B illustrates that the outcome of this analysis varies extensively, depending on how sequencing and capture were conducted.
<fig id="bts526-F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>(
<bold>A</bold>
) Comparison of fragment per kilobase and million base pairs (FPKM) between two exomes (FPKM squared correlation coefficient = 0.992). (
<bold>B</bold>
) Total read depth for two typical well-matched exomes (
<italic>y</italic>
-axis) as a function of the proportion of reads mapping to one of two exomes (
<italic>x</italic>
-axis). The red lines show the 99% confidence interval assuming the best fitting binomial distribution for the read count data. The blue lines show the same 99% confidence interval assuming the best fitting beta-binomial robust model for the same dataset. (
<bold>C</bold>
) Same as (
<bold>B</bold>
) but for two typical exomes that are poorly matched to each other. (
<bold>D</bold>
)
<inline-formula>
<inline-graphic xlink:href="bts526i1.jpg"></inline-graphic>
</inline-formula>
statistic (
<italic>x</italic>
-axis) and correlation between FPKM values (
<italic>y</italic>
-axis), both of them computed for each exome with its associated reference set</p>
</caption>
<graphic xlink:href="bts526f1p"></graphic>
</fig>
</p>
<p>The larger variance observed in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
C compared with 1B illustrates that the outcome of this analysis varies extensively depending on how sequencing and capture were conducted. To quantify the variability between B and C, we defined the statistic
<inline-formula>
<inline-graphic xlink:href="bts526i2.jpg"></inline-graphic>
</inline-formula>
as the ratio between the standard errors of the beta-binomial model and the binomial model (Section 3). This statistic can be intuitively understood as the ratio between the typical distances separating the blue and red curves in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
B and C.</p>
<p>Our results show that a binomial model fails to properly capture the extensive variability in read count ratio across samples. Even in the best case scenario of two well-matched exomes (red line in
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
B), 6.8% of the exons were located outside of the 99% confidence interval. When two exomes were poorly matched (red line in
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
C), a total of 23.2% of exons were outside of the 99% confidence interval. We therefore modified this binomial model and fitted instead a beta-binomial distribution (seeSection 3) to account for the over-dispersion in read count ratio. We further modified the model to account for observed correlations between depth of sequencing and the over-dispersion parameter (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S1</ext-link>
). This beta-binomial model significantly improved the fit (blue line in
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
B and C). The proportion of exons outside of the 99% confidence interval was reduced to 1.8% for the well-matched pair of exomes and to 2.3% for the poorly matched pair (blue lines in
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
B and C, respectively) To quantify this noise in the sequence data, we defined the statistic
<inline-formula>
<inline-graphic xlink:href="bts526i3.jpg"></inline-graphic>
</inline-formula>
as the ratio between the standard errors of the beta-binomial model and the binomial model. This statistic can be intuitively understood as the ratio between the typical distances separating the blue and red curves in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
B and C. For each sample in Datasets 1 and 2, we estimated the optimum reference set (see below for a description of this procedure) and computed the
<inline-formula>
<inline-graphic xlink:href="bts526i4.jpg"></inline-graphic>
</inline-formula>
statistic. For Dataset 1 typical values of
<inline-formula>
<inline-graphic xlink:href="bts526i5.jpg"></inline-graphic>
</inline-formula>
varied between 1.5 and 2, depending on the sample, with an average value of 1.62. For Dataset 2, typical values of
<inline-formula>
<inline-graphic xlink:href="bts526i6.jpg"></inline-graphic>
</inline-formula>
varied between 2 and 4.5 (average: 2.76). The
<inline-formula>
<inline-graphic xlink:href="bts526i7.jpg"></inline-graphic>
</inline-formula>
statistic measures the correlations across samples and can be well approximated using the squared pairwise correlation coefficient of FPKM values between the test sample and its associated reference set (
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
D). We hypothesized that some of the differences between samples might be explained by a differential effect of the DNA sequence GC content on capture and sequencing efficiency. Therefore, in the regression analysis, we added GC content as a percentage. The noise reduction was consistent but relatively limited: in Dataset 1, the average
<inline-formula>
<inline-graphic xlink:href="bts526i8.jpg"></inline-graphic>
</inline-formula>
decreased from 1.62 to 1.59. In Dataset 2 the average
<inline-formula>
<inline-graphic xlink:href="bts526i9.jpg"></inline-graphic>
</inline-formula>
decreased from 2.76 to 2.45.</p>
</sec>
<sec id="SEC2.2">
<title>2.2 Power study and optimization of the reference exome set</title>
<p>The different levels of noise illustrated in
<xref ref-type="fig" rid="bts526-F1">Figure 1</xref>
B and C have large implications for the power to detect CNVs. We used a single-exon heterozygous deletion as a typical CNV and estimated the expected value of the posterior probability for this heterozygous deletion given different sets of parameters. We considered three scenarios:
<inline-formula>
<inline-graphic xlink:href="bts526i10.jpg"></inline-graphic>
</inline-formula>
= 1 (absence of any technical bias),
<inline-formula>
<inline-graphic xlink:href="bts526i11.jpg"></inline-graphic>
</inline-formula>
= 1.6 (typical of Dataset 1) and
<inline-formula>
<inline-graphic xlink:href="bts526i12.jpg"></inline-graphic>
</inline-formula>
= 2.5 (typical of Dataset 2).</p>
<p>The construction of the optimum reference set is coded in the select.reference.set function of the ExomeDepth package. To summarize briefly, for each test exome we rank the remaining samples by order of correlation with the test exome. Samples are then added sequentially to the aggregate reference set. At each iteration we fit our robust model and compute the expected value of the posterior probability in favour of a single-exon heterozygous deletion call. This process of adding samples to the aggregate reference stops once the posterior probability stops to increase. This optimization is essentially a trade-off between limiting the variance (by increasing the size of the reference set) and increasing the bias (by adding exome samples to the reference in spite of being less correlated). In Dataset 1 (
<xref ref-type="fig" rid="bts526-F2">Fig. 2</xref>
A) we found that the optimum size of the reference set was ∼10. In several instances adding further samples in the reference set actually decreased the power.
<fig id="bts526-F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>Power study showing the expected posterior probability for a heterozygous deletion call. (
<bold>A</bold>
) Expected value of the posterior probability (averaged over all exons) for the 15 exomes in Dataset 1 as a function of the (test:reference) read count ratio (which is closely approximated by the number of exomes in the aggregate reference set). Each line shows a different test exome sample and the most correlated exome is added to the reference at each step. (
<bold>B</bold>
) The expected number of reads that would be mapping to a normal copy number exon varies (along the x-axis) but the (reference:test) sequencing depth remains constant at 10 (i.e. the reference set approximately consists of an aggregate of 10 exomes). Other parameters, including the level of correlations between test and reference exome, are kept constant. Power estimates assume a typical exome from each of the two Datasets 1 and 2 (the median value of the posterior probability is shown). (
<bold>C</bold>
), (
<bold>D</bold>
) The number of exomes in the aggregate reference set varies but the expected number of reads mapping to a normal copy number exon for the test sample is set to 100 (C) and 200 (D). For (B), (C) and (D), the black line refers to an optimum dataset in the absence of sample-to-sample technical variability (
<inline-formula>
<inline-graphic xlink:href="bts526i13.jpg"></inline-graphic>
</inline-formula>
= 1), longer dash to the typical dispersion parameter estimated from Dataset 1 (
<italic>R</italic>
<sub>s</sub>
= 1.6) and shorter dash for the typical dispersion parameter estimated from Dataset 2 (
<inline-formula>
<inline-graphic xlink:href="bts526i14.jpg"></inline-graphic>
</inline-formula>
= 2.5)</p>
</caption>
<graphic xlink:href="bts526f2p"></graphic>
</fig>
</p>
<p>
<xref ref-type="fig" rid="bts526-F2">Figure 2</xref>
B investigates the role of read depth on the power to detect a CNV. For Datasets 1 and 2, we selected the optimum reference sets and extracted the parameters associated with this fit. We investigated the effect of read depth by changing the expected number of reads that map to the heterozygous deleted exon in the test sample. In contrast with
<xref ref-type="fig" rid="bts526-F2">Figure 2</xref>
A, this computation holds the
<inline-formula>
<inline-graphic xlink:href="bts526i15.jpg"></inline-graphic>
</inline-formula>
parameter constant, i.e. we assume that all additional exome samples are similarly correlated with the test exome. It therefore only considers the role of read depth and not the added complexity of adding exome samples that are not necessarily as well correlated with the test exome. This analysis showed strong differences between Datasets 1 and 2. In Dataset 1, 300 reads mapping to an exon in the test sample were sufficient to provide complete power, whereas for Dataset 2, the posterior probability in favour of the deletion call could never exceed 30%, even with more than 500 reads. This result indicates that an increase in the read depth cannot compensate for low levels of correlations between the exomes. Using 100-bp paired-end reads, and assuming a 500-bp long exon, 300 mapping reads amount to an average read depth of ∼100. This read depth would need to be twice larger (i.e. 200) if the exon was only 250-bp long. To provide a more general boundary for the size of the reference set, we investigated in
<xref ref-type="fig" rid="bts526-F2">Figure 2</xref>
C and D the behaviour of the power estimates as the size of the reference set increases. As for
<xref ref-type="fig" rid="bts526-F2">Figure 2</xref>
B, we used the optimum parameters estimated in
<xref ref-type="fig" rid="bts526-F2">Figure 2</xref>
A for the median samples in Datasets 1 and 2 and kept the
<inline-formula>
<inline-graphic xlink:href="bts526i16.jpg"></inline-graphic>
</inline-formula>
parameter constant. Hence, no bias is created by adding less correlated exome samples and only the effect of the size of the reference set is evaluated. We considered two scenarios of moderate and high read depth (
<xref ref-type="fig" rid="bts526-F2">Fig. 2</xref>
C and D). With these assumptions, while the power keeps increasing, the increase becomes slow once the reference:test ratio reaches a value of 10. This result suggests that very large reference sets would provide only limited increase in power to detect heterozygous deletions. In all tested scenarios, the difference between the power curves estimated from the bias-free model (black line in
<xref ref-type="fig" rid="bts526-F2">Fig. 2</xref>
B–D) compared with the estimates in either of the datasets (red and blue lines) was large. Datasets 1 and 2 also showed substantial difference in power (
<xref ref-type="fig" rid="bts526-F2">Fig. 2</xref>
C and D). These observations illustrate an important effect of variability between individual exomes in a dataset, which is captured by the
<inline-formula>
<inline-graphic xlink:href="bts526i17.jpg"></inline-graphic>
</inline-formula>
statistic, on the power to detect CNVs. Note that the power to detect single-exon heterozygous deletions in Dataset 2 remains very low and would not reach 1 in any realistic scenario, because the level of noise is too high. Therefore, in Dataset 2 CNVs need to overlap multiple exons to be detectable. Power estimates for heterozygous duplications were much lower than for heterozygous deletions (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S2</ext-link>
), a difference also commonly observed for all array-based CNV assays. Both datasets lacked power to detect single-exon heterozygous duplications (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S2A–D</ext-link>
), but larger duplications could be identified. For example, in Dataset 1, and to some extent in Dataset 2, a three-exon heterozygous duplication typically can be detected (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S2E–H</ext-link>
). Homozygous deletions are naturally easier to detect than heterozygous deletions. Although quantifying the power is complicated by the arbitrary parameterization of the background level of mapping reads, we found that under most realistic assumptions, an expected number of reads greater than 30 mapping to an exon in the test sample was sufficient to identify a homozygous deletion in either of the two datasets.</p>
</sec>
<sec id="SEC2.3">
<title>2.3 Characteristics of CNVs and comparison with other tools</title>
<p>The probability for the hidden Markov chain to enter a cn ≠ 2 state sets the sensitivity/specificity balance of ExomeDepth. We parameterized this value using the expected number of CNV calls for an exome sequence (Section 3 and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S3</ext-link>
). We found the total number of CNV calls to be relatively stable over the range of parameters considered (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S3</ext-link>
), only increasing sharply for a prior expectation of 1000 CNV calls genome-wide. The ExomeDepth default parameter uses a relatively stringent prior expectation of 20 CNV calls per exome sequence. With this choice, ExomeDepth called a median number of 213 CNVs per sample in Dataset 1, including 62.3% deletions. Consistent with the more limited power in Dataset 2, ExomeDepth identified a lower median number of 177 CNV calls in this dataset (62.9% of them deletions), in spite of a 31.5% larger target region (50 Mb versus 38 Mb). CNVs called by ExomeDepth in Datasets 1 and 2 included a median number of five exons and a median length of 10.6 kb. About 10% of the CNV calls were longer than 100 kb and 0.1% were longer than 500 kb.</p>
<p>Comparison with other algorithms is complicated by the absence of a ‘gold standard’ dataset and the bias inherent to CNV calls in available CNV databases. Nevertheless, the majority of CNVs called in our data should be present in a large-scale database such as the Database of Genomic Variants (DGV), (
<xref ref-type="bibr" rid="bts526-B15">Zhang
<italic>et al.</italic>
, 2006</xref>
). We defined a CNV as previously reported in DGV if a CNV listed in DGV overlaps more than 50% of our CNV call (after excluding DGV CNV calls larger than 500 kb). We found that 13.5% of CNVs in Dataset 1 (20% respectively in Dataset 2) were absent from DGV (
<xref ref-type="table" rid="bts526-T1">Table 1</xref>
).
<table-wrap id="bts526-T1" position="float">
<label>Table 1.</label>
<caption>
<p>Comparison between our package (ExomeDepth) and two other tools: exomeCopy and ExomeCNV</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">exomeDepth</th>
<th rowspan="1" colspan="1">exomeCopy</th>
<th rowspan="1" colspan="1">exomeCNV</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">
<bold>Dataset 1 (
<italic>n</italic>
= 15)</bold>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median nb of CNVs</td>
<td rowspan="1" colspan="1">213</td>
<td rowspan="1" colspan="1">495</td>
<td rowspan="1" colspan="1">2256</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Percentage in DGV</td>
<td rowspan="1" colspan="1">86.5</td>
<td rowspan="1" colspan="1">67.8</td>
<td rowspan="1" colspan="1">16.3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (kb)</td>
<td rowspan="1" colspan="1">8.9</td>
<td rowspan="1" colspan="1">1.83</td>
<td rowspan="1" colspan="1">0.16</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (exons)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<bold>Dataset 2 (
<italic>n</italic>
= 9)</bold>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median nb of CNVs</td>
<td rowspan="1" colspan="1">177</td>
<td rowspan="1" colspan="1">1228</td>
<td rowspan="1" colspan="1">11 046</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Percentage in DGV</td>
<td rowspan="1" colspan="1">80</td>
<td rowspan="1" colspan="1">36.9</td>
<td rowspan="1" colspan="1">26.6</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (kb)</td>
<td rowspan="1" colspan="1">12.2</td>
<td rowspan="1" colspan="1">10.04</td>
<td rowspan="1" colspan="1">0.26</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (exons)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<bold>1000 Genomes (
<italic>n</italic>
= 12)</bold>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median nb of CNVs</td>
<td rowspan="1" colspan="1">246</td>
<td rowspan="1" colspan="1">641</td>
<td rowspan="1" colspan="1">5261</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Percentage in DGV</td>
<td rowspan="1" colspan="1">66</td>
<td rowspan="1" colspan="1">37.2</td>
<td rowspan="1" colspan="1">34.2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (kb)</td>
<td rowspan="1" colspan="1">1.7</td>
<td rowspan="1" colspan="1">9.75</td>
<td rowspan="1" colspan="1">0.34</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median CNV size (exons)</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Percentage of known</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">CNVs found</td>
<td rowspan="1" colspan="1">75.2</td>
<td rowspan="1" colspan="1">52.8</td>
<td rowspan="1" colspan="1">41.2</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bts526-TF1">
<p>We define a CNV called from exome data as ‘in DGV’ (or a ‘known CNV’ in the 1000 Genome analysis) when the CNV in the database overlaps >50% of our CNV call.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>To estimate the false negative rate, we used an additional dataset of 12 high-depth exome samples (1000 Genomes Project) for which an independent experiment generated CNV calls using a high-density Nimblegen CGH array [(
<xref ref-type="bibr" rid="bts526-B2">Conrad
<italic>et al.</italic>
, 2010</xref>
) and Section 3]. Combining all 12 samples, the aCGH experiment identified 1344 exonic CNV calls (303 unique calls).
<xref ref-type="bibr" rid="bts526-B2">Conrad
<italic>et al.</italic>
(2010)</xref>
estimate that 40% of CNVs can be genotyped with their experimental design that translates into an approximate expected number of 280 exonic CNV calls per sample, which is broadly consistent with our findings.</p>
<p>To compare our algorithm with existing tools, we first tested ExomeCNV (
<xref ref-type="bibr" rid="bts526-B11">Sathirapongsasuti
<italic>et al.</italic>
, 2011</xref>
). Its underlying model assumes that the distribution of read count ratio between the test and reference exome is Gaussian and a similar assumption is made by CNV-Seq (
<xref ref-type="bibr" rid="bts526-B12">Xie and Tammi, 2009</xref>
). Second, we tested exomeCopy (
<xref ref-type="bibr" rid="bts526-B5">Love
<italic>et al.</italic>
, 2011</xref>
), which uses a negative binomial model that is related to our beta-binomial approach. In each case, we followed the methods suggested by these publications and we used the suggested default parameters. Venn diagrams summarizing the overlap between calling algorithms are shown in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Figure S4</ext-link>
.</p>
<p>Comparison between these three tools using a dataset of 12 exomes from 1000 Genomes and the two datasets from this study highlighted a clear trend. First, ExomeDepth is more conservative than the other tools, with the median number of CNV calls between 177 and 246 per exome, whereas exomeCopy and ExomeCNV called numerous additional CNVs (
<xref ref-type="table" rid="bts526-T1">Table 1</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S4</ext-link>
). Second, ExomeDepth detected 75.2% of the known exonic CNVs in the 12 exome samples from the 1000 Genomes Project (CNVs identified by the independent aCGH experiment). This was markedly higher than the fraction of CNVs identified by exomeCopy (52.8%) and ExomeCNV (41.2%;
<xref ref-type="table" rid="bts526-T1">Table 1</xref>
), indicating a higher sensitivity. Interestingly, the difference between our analysis and exomeCopy was more limited for Dataset 1, for which the exomes are better matched to each other, than for Dataset 2 and the 1000 Genomes dataset, consistent with the fact that the aggregate reference optimization step implemented in ExomeDepth is more helpful when the variability across samples is larger.</p>
</sec>
<sec id="SEC2.4">
<title>2.4 Discovery of two novel and likely disease-causing deletions in the
<italic>GATA2</italic>
and
<italic>DOCK8</italic>
genes in patients with primary immunodeficiency</title>
<p>We then investigated if any of the newly discovered rare CNVs in our data affects genes that previously have been involved in primary immunodeficiencies. In patient P1 ExomeDepth identified a heterozygous deletion of the consecutive exons 6 and 7 of the
<italic>GATA2</italic>
gene with a read count ratio <0.5% quantile (
<xref ref-type="fig" rid="bts526-F3">Fig. 3</xref>
A). Independently, each exon would yield a posterior probability for the deletion call of 77% (exon 7) and 15% (exon 6). The combined CNV call has a posterior probability >99.9%. We then designed a custom CGH array (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Data</ext-link>
) containing 26 probes in the
<italic>GATA2</italic>
gene region. In this patient we validated a heterozygous deletion of 6 kb that included
<italic>GATA2</italic>
exons 6 and 7 (
<xref ref-type="fig" rid="bts526-F3">Fig. 3</xref>
B). We then amplified the breakpoint region, sequenced it and mapped the exact boundaries of this 5797-bp deletion (
<xref ref-type="fig" rid="bts526-F3">Fig. 3</xref>
C; coordinates chr3: 128, 196, 444-128, 202, 240). The clinical presentation for this patient was consistent with previous reports of heterozygous variants in the
<italic>GATA2</italic>
gene (
<xref ref-type="bibr" rid="bts526-B10">Ostergaard
<italic>et al.</italic>
, 2011</xref>
), indicating that this two exons deletion is very likely to be causal for P1 (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Data</ext-link>
for clinical details).
<fig id="bts526-F3" position="float">
<label>Fig. 3.</label>
<caption>
<p>(
<bold>A</bold>
) Heterozygous deletion of exons 6 and 7 of the
<italic>GATA2</italic>
gene identified by ExomeDepth in the exome sequence data. The red crosses show the ratio of observed/expected number of reads for the test sample. The grey shaded region shows the estimated 99% confidence interval for this observed ratio in the absence of CNV call. The presence of two contiguous exons with read count ratio located outside of the condfidence interval is indicative of a heterozygous deletion in this sample. Independently, each exon would yield a posterior probability for the deletion call of 15% (exon 6) and 77% (exon 7). The combined CNV call has a posterior probability >99.9%. (
<bold>B</bold>
) Validation of a 6-kb deletion using a targeted array CGH (Agilent 15K format) containing 26 probes in the
<italic>GATA2</italic>
gene region. Each cross indicates a probe and red crosses indicate probes located in the region of a heterozygous deletion. (
<bold>C</bold>
) Sequencing of the deletion breakpoints identified the exact boundaries of this 5797-bp deletion overlapping exons 6 and 7 of the
<italic>GATA2</italic>
gene</p>
</caption>
<graphic xlink:href="bts526f3p"></graphic>
</fig>
</p>
<p>In another patient (P2), ExomeDepth identified a deletion of a single exon 8 of the
<italic>DOCK8</italic>
gene (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S5A</ext-link>
). This CNV call had a posterior probability >99.99%. Complete absence of reads mapping to exon 8 is indicative of a homozygous deletion. We sequenced the deletion breakpoints and identified the exact boundaries of a 3197-bp deletion (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S5B</ext-link>
; coordinates chr9: 323, 591–326, 787). The clinical presentation is consistent with previous reports involving homozygous mutations in the
<italic>DOCK8</italic>
(
<xref ref-type="bibr" rid="bts526-B16">Zhang
<italic>et al.</italic>
, 2009</xref>
) indicating that this variant is almost certainly causal (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Data</ext-link>
for clinical details).</p>
</sec>
</sec>
<sec id="SEC3">
<title>3 IMPLEMENTATION</title>
<sec id="SEC3.1">
<title>3.1 Primary immunodeficiency patients</title>
<p>We investigated 24 patients who suffered from severe and/or disseminated recurrent infections, and have been diagnosed with primary immunodeficiencies (PIDs). Of these patients, 13 were of European descent and 11 patients were of Asian descent, of which, 5 originated from consanguineous families. All material from patients was obtained with informed consent from adults and from the parents of children who participated in the study in accordance with the Declaration of Helsinki and with approval from the ethics committees (04/Q0501/119, amendment 2; 06/Q0508/16 and 10/H0906/22).</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Exome data</title>
<p>We isolated DNA samples from blood or peripheral blood mononuclear cells (PBMCs). Exome sequence data have been generated in two batches. Dataset 1 comprises exomes from 15 patients and Dataset 2 comprises exomes from 9 patients. For exome target enrichment Agilent SureSelect 38 Mb and 50 Mb kits have been used for samples in Datasets 1 and 2, respectively. Samples in both datasets have been sequenced using Illumina HiSeq with 94-bp paired-end reads. Reads were aligned to the hg19 reference sequence using the software novoalign (
<ext-link ext-link-type="uri" xlink:href="www.novocraft.com">www.novocraft.com</ext-link>
). Single sample summary statistics are provided in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Table S1</ext-link>
.</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Read count computation</title>
<p>Exon locations were defined using the Ensemble database release version 57 (human genome build hg19). We only considered autosomal chromosomes (chr1–22) to avoid the additional complication of gender (in which case, male and female samples would need to be analysed separately). Analysis of the GC content for each exon uses the same Ensembl release and the Perl Ensembl API tool. We used the R package Rsamtools to extract the read count information from the individual BAM files. All reads were paired-end. We only included consistent paired reads (i.e. both reads located <1000 bp away from each other and in the correct orientation) and with Phred scaled mapping quality ≥20. The location was defined by the middle location between the extreme ends of both paired reads. Exons closer than 50 bp were merged into a single location owing to the inability to properly separate reads mapping to either of them. After merging close exons, we considered a total number of 229 056 autosomal exons.</p>
</sec>
<sec id="SEC3.4">
<title>3.4 Statistical model</title>
<p>We denote the exonic read count X for the test sample and Y for the aggregate reference. Assuming that the distribution of the read count ratio
<inline-formula>
<inline-graphic xlink:href="bts526i18.jpg"></inline-graphic>
</inline-formula>
is only determined by the relative read depth of the test and reference samples, an appropriate model is binomial with the probability that a random read is assigned to the test sample is:
<disp-formula id="bts526-M1">
<label>(1)</label>
<graphic xlink:href="bts526m1"></graphic>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="bts526i19.jpg"></inline-graphic>
</inline-formula>
is the probability that a random read belongs to the test sample (rather than the reference). The intercept parameter
<inline-formula>
<inline-graphic xlink:href="bts526i20.jpg"></inline-graphic>
</inline-formula>
is estimated separately for each test sample. GC refers to the GC content. The index
<italic>i</italic>
denotes the exon and the covariate
<inline-formula>
<inline-graphic xlink:href="bts526i21.jpg"></inline-graphic>
</inline-formula>
relates to the copy number status for the exon
<italic>i</italic>
: the proportion of reads mapping to the test sample for deletions/duplications is computed based on the expected proportion for normal copy number and assuming a read ratio of 0.5 (for a deletion) or 1.5 (for a duplication). A motivation for our work is the fact that this binomial model does not fully capture sample specific biases. We propose instead the robust beta-binomial model (
<xref ref-type="bibr" rid="bts526-B1">Agresti, 2002</xref>
):
<disp-formula id="bts526-M2">
<label>(2)</label>
<graphic xlink:href="bts526m2"></graphic>
</disp-formula>
<disp-formula id="bts526-M3">
<label>(3)</label>
<graphic xlink:href="bts526m3"></graphic>
</disp-formula>
where the over-dispersion parameter
<inline-formula>
<inline-graphic xlink:href="bts526i24.jpg"></inline-graphic>
</inline-formula>
is numerically estimated from the read count data. Assuming this model, the mean value of the beta binomial variable
<italic>X</italic>
remains unchanged but its variance becomes
<inline-formula>
<inline-graphic xlink:href="bts526i25.jpg"></inline-graphic>
</inline-formula>
where
<inline-formula>
<inline-graphic xlink:href="bts526i26.jpg"></inline-graphic>
</inline-formula>
, adding to the binomial variance an additional over-dispersion term. Last, the addition of GC content to the model contributes to predicting individual specific biases.</p>
<p>However, an analysis of the data showed that a single
<inline-formula>
<inline-graphic xlink:href="bts526i27.jpg"></inline-graphic>
</inline-formula>
typically could not fully summarize the read count variance over the full range of read depth (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Fig. S1</ext-link>
). We therefore, modified the model to allow the parameter
<inline-formula>
<inline-graphic xlink:href="bts526i28.jpg"></inline-graphic>
</inline-formula>
to take different values depending on the total read count. We used a linear extrapolation to combine these estimates over the full range of read depth. The number of intervals for the read count data is set to two by default and can be modified by the user.
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Figure S1</ext-link>
describes this fitting process in more details.</p>
</sec>
<sec id="SEC3.5">
<title>3.5 Numerical estimation</title>
<p>We fitted the binomial logistic model described in
<xref ref-type="disp-formula" rid="bts526-M1">Equation (1</xref>
) using the glm function in R. We fitted the beta-binomial model described in
<xref ref-type="disp-formula" rid="bts526-M2">Equations (2)</xref>
and
<xref ref-type="disp-formula" rid="bts526-M3">(3</xref>
), including the maximum likelihood estimation of the over-dispersion parameter
<inline-formula>
<inline-graphic xlink:href="bts526i29.jpg"></inline-graphic>
</inline-formula>
, using a maximum likelihood approach implemented in the R package aod. The procedure estimates for each exon, an expected read ratio
<inline-formula>
<inline-graphic xlink:href="bts526i30.jpg"></inline-graphic>
</inline-formula>
and a genome-wide over-dispersion parameter
<inline-formula>
<inline-graphic xlink:href="bts526i31.jpg"></inline-graphic>
</inline-formula>
. This parameter estimation is done assuming cn = 2 for all exons. In a second step, and for each exon, covariates for deletions/duplications are added to estimate the likelihood of the read count data for the scenarios cn = 1 and 3. In the beta-binomial models expressed in
<xref ref-type="disp-formula" rid="bts526-M2">Equations (2)</xref>
and
<xref ref-type="disp-formula" rid="bts526-M3">(3</xref>
), the beta-binomial distribution is usually parameterized using two parameters
<italic>a</italic>
and
<italic>b</italic>
[mean
<inline-formula>
<inline-graphic xlink:href="bts526i32.jpg"></inline-graphic>
</inline-formula>
and variance
<inline-formula>
<inline-graphic xlink:href="bts526i33.jpg"></inline-graphic>
</inline-formula>
]. The regression formulation of the model described above links to this distribution with
<inline-formula>
<inline-graphic xlink:href="bts526i34.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="bts526i35.jpg"></inline-graphic>
</inline-formula>
. Prior to fitting these models (
<xref ref-type="fig" rid="bts526-F1">Fig. 1</xref>
B and C), we removed exons located in regions harbouring common CNVs (
<xref ref-type="bibr" rid="bts526-B2">Conrad
<italic>et al.</italic>
, 2010</xref>
) to limit the possibility that such CNVs significantly increase the variance of the read count ratio.</p>
</sec>
<sec id="SEC3.6">
<title>3.6 Hidden Markov chain and choice of prior probabilities</title>
<p>For each exon, our beta-binomial model generates a likelihood value under three distinct scenarios (copy number = deletion, normal, duplication). To combine the likelihood across multiple exons we used a hidden Markov model. Each step of the hidden Markov state corresponds to one exon in the human genome. This model serves the double purpose of merging CNV calls across exons, as well as specifying a prior probability of observing a CNV for each exon. This prior probability is coded into the transition probability of the hidden Markov chain between the normal copy number state (cn = 2) and either of the copy number variable states (cn = 1 or cn = 3). We parameterized this before using the expected number ne of CNVs
<italic>a-priori</italic>
, i.e. the probability of transitioning from cn = 2 to cn = 1 or cn = 3 is
<inline-formula>
<inline-graphic xlink:href="bts526i36.jpg"></inline-graphic>
</inline-formula>
where
<italic>n</italic>
= 229 056 is the total number of exons. We set the default model such that, from the hidden state cn = 2, the probability to move into a deletion state is the same as the probability to move into a duplication state. In a deletion/duplication state, the underlying Markov chain has a default probability 0.5 to revert back to cn = 2 and a probability 0.5 to remain in the same deletion/duplication state. To provide a set of calls for each sample, we use the maximum likelihood Viterbi algorithm. Each version of our statistical models in
<xref ref-type="disp-formula" rid="bts526-M1 bts526-M2 bts526-M3">Equations (1–3</xref>
) generates a likelihood under the cn = 1, 2, 3 scenarios. For CNV with lower (cn = 0 for homozygous deletion) or higher (cn
<inline-formula>
<inline-graphic xlink:href="bts526i37.jpg"></inline-graphic>
</inline-formula>
4) number of DNA copies, the model with cn = 1, 2 and 3 hidden rejects the null with added confidence compared with the simpler scenarios cn = 1 or cn = 3. Hence, we found no benefit in considering additional copy number states (besides, cn = 1, 2 and 3). Rather, we estimate copy number state using the read count ratio after the CNV is detected. Importantly, copy number is always estimated with respect to the reference and the absolute value cannot be estimated by this procedure.</p>
</sec>
<sec id="SEC3.7">
<title>3.7 Power estimates</title>
<p>Owing to the common particular interest in discovering loss-of-function variants we considered a heterozygous deletion as a typical scenario for our power estimations (
<xref ref-type="fig" rid="bts526-F2">Fig. 2</xref>
). The heterozygous duplication case is considered in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts526/DC1">Supplementary Figure S2</ext-link>
. Following the parameter estimation step, we can generate for each CNV, and given the test and reference read count data X and Y, a Bayes factor BF =
<inline-formula>
<inline-graphic xlink:href="bts526i38.jpg"></inline-graphic>
</inline-formula>
. Using our selected prior distribution (
<italic>P</italic>
=
<inline-formula>
<inline-graphic xlink:href="bts526i39.jpg"></inline-graphic>
</inline-formula>
of observing a CNV, i.e. corresponding to an expected number of 20 CNV calls genome-wide), we compute the expected value of the posterior distribution for the CNV call.</p>
</sec>
<sec id="SEC3.8">
<title>3.8 Choice of samples for the comparative analysis</title>
<p>We downloaded from the 1000 Genomes a dataset of 12 high-depth exome samples generated using the Solid sequencing technology (single-end reads), for which an independent experiment generated CNV calls using a high-density Nimblegen CGH array (
<xref ref-type="bibr" rid="bts526-B2">Conrad
<italic>et al.</italic>
, 2010</xref>
). These 12 samples are NA18502, NA19099, NA19239, NA19240 (Yoruba) and NA06985, NA1199, NA11995, NA12004, NA12044, NA12156, NA12414, NA12489 (CEPH).</p>
</sec>
<sec id="SEC3.9">
<title>3.9 Parameters of ExomeCNV/exomeCopy in the comparative analysis</title>
<p>To apply ExomeCNV we followed the instructions provided by the user guide, using 0.9999 as threshold for sensitivity and specificity, with the software set to optimize the specificity. For each test sample (labelled tumour sample in the ExomeCNV analysis), the aggregegate reference set (labelled as normal) consisted of the remaining exomes from the same dataset. We then applied the classify.eCNV function with default parameters setting the admixture rate to 0 (because we are not concerned by a mixture with tumour DNA). Finally, multi.CNV.analyze was used to obtain the final list of merged calls from the list of exonic CNV calls. For exomeCopy the read count data was estimated using the R built-in functions provided by this package. For each sample, the background noise was estimated on the basis of GC content and the read count data from the other exomes in the same dataset. We fitted the negative-binomial model and the hidden Markov chain using the steps recommended in the package vignette and the default parameters. For the exomeCopy analysis, the recommendation to split long exons into smaller units (and therefore, get more uniform numbers of reads in each bin) increased the number of CNV calls and the concordance with DGV dropped, strongly suggesting that this step did not improve the overall accuracy. We therefore used our set of exon delimited regions to compute the read count.</p>
</sec>
</sec>
<sec id="SEC4">
<title>4 DISCUSSION</title>
<p>We have developed a novel CNV calling methodology using read depth information from exome sequence data and implemented it within an R package called ExomeDepth. This allowed us to maximize the statistical power to detect even small CNVs in the presence of technical variability inherent to high-throughput DNA sequencing technologies. As a consequence, compared with other tools, we found the greatest improvement for datasets that show more technical variability across samples. Although ExomeDepth is designed to be used as a standalone software, the contruction of a reference set is a problem shared across all read depth CNV calling algorithms for exome data. Therefore, additional refinements proposed by other CNV calling algorithms to improve calling accuracy could potentially be applied after ExomeDepth has identified the optimum aggregate reference set.</p>
<p>Because ExomeDepth assumes that the CNV call is not present in the reference sample, it is best suited to call rare CNVs. Nevertheless, our analysis of exome samples from the 1000 Genome Project indicates that it can call common CNVs as well, even though some power is lost when the allele frequency is high. Our computations indicate that the power to detect rare CNVs is maximized for a reference:test ratio of ∼10:1. Hence, while we find no obvious benefit in using a very large dataset (> 100 exomes), it is essential to generate exome data in batches of six or more samples.</p>
<p>Our analysis shows that a Gaussian model for the read count data is not appropriate for exome sequence data, which is likely to explain the discrepancy with the observed larger number of CNV calls generated by ExomeCNV. The discrepancy with exomeCopy is more surprising, because exomeCopy fits a robust negative binomial model related to our model. Additionally, in the exomeCopy analysis, we used the optimized reference set identified by the ExomeDepth analysis. The default parameters of the exomeCopy hidden Markov chain were also similar. Comparing both methods, we find that the main difference is that exomeCopy essentially takes a profile likelihood approach: it uses the median normalized read depth at each exon to account for the variability in exon capture efficiency, which is a nuisance parameter. However, the median read depth can be unreliable, if the sample size is small and/or the data are noisier (e.g. in Dataset 2). Instead, in ExomeDepth we used a logistic model, which deals with that nuisance parameter by conditioning on the total read count for each exon. We hypothesize that exomeCopy would show results more similar to ours, if the exon-specific parameters were estimated within the negative binomial model rather than prior to model fitting. However, the number of exons, and therefore, the number of parameters to estimate, would be large which makes it difficult in practice.</p>
<p>In our study the statistical power to detect CNVs varied extensively between two datasets and was largely determined by sample-to-sample variability within datasets that emerged either at the exome capture or at the sequencing step. This feature of the data cannot be detected by commonly used single-sample quality metrics: it is the correlations across samples rather than the single-sample summary statistic that are relevant. Owing to its essential role for CNV calling, we argue that a measure of sample-to-sample consistency (e.g. correlation between FPKM values) should be provided by sequencing facilities when exomes are analysed in sufficiently large batches.</p>
<p>Finding of the
<italic>GATA2</italic>
and
<italic>DOCK8</italic>
deletions illustrates the power of ExomeDepth to identify even heterozygous and small CNVs comprising just one to two exons. We conclude that reducing technical variability between the samples and using bioinformatics tools that maximize statistical power of CNV detection, such as ExomeDepth, will allow efficient CNV identification and will increase the value of the future exome sequencing experiments.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_28_21_2747__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts526_supplemental.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>We thank Dr Mike Potter, Consultant Haematologist, who looked after the patient during bone marrow transplantation.</p>
<p>
<italic>Funding</italic>
: S.N. is a
<funding-source>Wellcome Trust Senior Research Fellow in Basic Biomedical Science</funding-source>
(
<award-id>095198/Z/10/Z</award-id>
); the
<funding-source>Wellcome Trust</funding-source>
grant (
<award-id>088838/Z/09/Z</award-id>
), the
<funding-source>Royal Society Research</funding-source>
grant
<award-id>RG090638</award-id>
, the EU FP7 grant (
<award-id>261441</award-id>
) (PEVNET project) and the ERC Starting grant (
<award-id>260477</award-id>
to S.N.); the
<funding-source>NIHR Cambridge Biomedical Research Centre</funding-source>
(to DK and RD); MRC research grant (
<award-id>G1001158</award-id>
) and the
<funding-source>NIHR Moorfields Biomedical Research Council</funding-source>
grant (to V.P.).</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="bts526-B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agresti</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Categorical data analysis</article-title>
<source>Wiley Series in Probability and Statistics</source>
<year>2002</year>
<comment>Chapter 13, 2nd edn. Wiley-Interscience, Hoboken, NJ, p. 553</comment>
</element-citation>
</ref>
<ref id="bts526-B2">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Conrad</surname>
<given-names>DF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Origins and functional impact of copy number variation in the human genome</article-title>
<source>Nature</source>
<year>2010</year>
<volume>464</volume>
<fpage>704</fpage>
<lpage>712</lpage>
<pub-id pub-id-type="pmid">19812545</pub-id>
</element-citation>
</ref>
<ref id="bts526-B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karakoc</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Detection of structural variants and indels within exome data</article-title>
<source>Nat. Methods</source>
<year>2011</year>
<volume>9</volume>
<fpage>176</fpage>
<lpage>178</lpage>
<pub-id pub-id-type="pmid">22179552</pub-id>
</element-citation>
</ref>
<ref id="bts526-B4">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krumm</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Copy number variation detection and genotyping from exome sequence data</article-title>
<source>Genome Res.</source>
<year>2012</year>
<volume>22</volume>
<fpage>1525</fpage>
<lpage>1532</lpage>
<pub-id pub-id-type="pmid">22585873</pub-id>
</element-citation>
</ref>
<ref id="bts526-B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Love</surname>
<given-names>MI</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Modeling read counts for CNV detection in exome sequencing data</article-title>
<source>Stat. Appl. Genet. Mol. Biol.</source>
<year>2011</year>
<volume>10</volume>
</element-citation>
</ref>
<ref id="bts526-B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marioni</surname>
<given-names>JC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization</article-title>
<source>Genome Biol.</source>
<year>2007</year>
<volume>8</volume>
<fpage>R228</fpage>
<pub-id pub-id-type="pmid">17961237</pub-id>
</element-citation>
</ref>
<ref id="bts526-B7">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Computational methods for discovering structural variation with next-generation sequencing</article-title>
<source>Nat. Methods</source>
<year>2009</year>
<volume>6</volume>
<issue>Suppl. 11</issue>
<fpage>S13</fpage>
<lpage>S20</lpage>
<pub-id pub-id-type="pmid">19844226</pub-id>
</element-citation>
</ref>
<ref id="bts526-B8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mortazavi</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Mapping and quantifying mammalian transcriptomes by RNA-Seq</article-title>
<source>Nat. Methods</source>
<year>2008</year>
<volume>5</volume>
<fpage>621</fpage>
<lpage>628</lpage>
<pub-id pub-id-type="pmid">18516045</pub-id>
</element-citation>
</ref>
<ref id="bts526-B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>SB</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Exome sequencing identifies the cause of a mendelian disorder</article-title>
<source>Nat. Genet.</source>
<year>2010</year>
<volume>42</volume>
<fpage>30</fpage>
<lpage>35</lpage>
<pub-id pub-id-type="pmid">19915526</pub-id>
</element-citation>
</ref>
<ref id="bts526-B10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ostergaard</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Mutations in GATA2 cause primary lymphedema associated with a predisposition to acute myeloid leukemia (Emberger syndrome)</article-title>
<source>Nat. Genet.</source>
<year>2011</year>
<volume>43</volume>
<fpage>929</fpage>
<lpage>931</lpage>
<pub-id pub-id-type="pmid">21892158</pub-id>
</element-citation>
</ref>
<ref id="bts526-B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sathirapongsasuti</surname>
<given-names>JF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>2648</fpage>
<lpage>2654</lpage>
<pub-id pub-id-type="pmid">21828086</pub-id>
</element-citation>
</ref>
<ref id="bts526-B12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Tammi</surname>
<given-names>MT</given-names>
</name>
</person-group>
<article-title>CNV-seq, a new method to detect copy number variation using high-throughput sequencing</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>80</fpage>
<pub-id pub-id-type="pmid">19267900</pub-id>
</element-citation>
</ref>
<ref id="bts526-B13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>2865</fpage>
<lpage>2871</lpage>
<pub-id pub-id-type="pmid">19561018</pub-id>
</element-citation>
</ref>
<ref id="bts526-B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeitouni</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>1895</fpage>
<lpage>1896</lpage>
<pub-id pub-id-type="pmid">20639544</pub-id>
</element-citation>
</ref>
<ref id="bts526-B15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome</article-title>
<source>Cytogenet. Genome Res.</source>
<year>2006</year>
<volume>115</volume>
<fpage>205</fpage>
<lpage>214</lpage>
<pub-id pub-id-type="pmid">17124402</pub-id>
</element-citation>
</ref>
<ref id="bts526-B16">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Combined immunodeficiency associated with DOCK8 mutations</article-title>
<source>New Engl. J. Med.</source>
<year>2009</year>
<volume>361</volume>
<fpage>2046</fpage>
<lpage>2055</lpage>
<pub-id pub-id-type="pmid">19776401</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sante/explor/LymphedemaV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003B85 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 003B85 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sante
   |area=    LymphedemaV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3476336
   |texte=   A robust model for read count data in exome sequencing experiments and implications for copy number variant calling
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22942019" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a LymphedemaV1 

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Sat Nov 4 17:40:35 2017. Site generation: Tue Feb 13 16:42:16 2024