Serveur d'exploration sur les relations entre la France et l'Australie

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 001572 ( Pmc/Corpus ); précédent : 0015719; suivant : 0015730 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Combining calls from multiple somatic mutation-callers</title>
<author>
<name sortKey="Kim, Su Yeon" sort="Kim, Su Yeon" uniqKey="Kim S" first="Su Yeon" last="Kim">Su Yeon Kim</name>
<affiliation>
<nlm:aff id="I1">Department of Statistics, University of California at Berkeley, Berkeley CA 94720, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jacob, Laurent" sort="Jacob, Laurent" uniqKey="Jacob L" first="Laurent" last="Jacob">Laurent Jacob</name>
<affiliation>
<nlm:aff id="I2">Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558 Villeurbanne, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Speed, Terence P" sort="Speed, Terence P" uniqKey="Speed T" first="Terence P" last="Speed">Terence P. Speed</name>
<affiliation>
<nlm:aff id="I1">Department of Statistics, University of California at Berkeley, Berkeley CA 94720, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">, Walter and Eliza Hall Institute of Medical Research and the University of Melbourne, Parkville, Victoria, Australia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">24885750</idno>
<idno type="pmc">4035752</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4035752</idno>
<idno type="RBID">PMC:4035752</idno>
<idno type="doi">10.1186/1471-2105-15-154</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">001572</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001572</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Combining calls from multiple somatic mutation-callers</title>
<author>
<name sortKey="Kim, Su Yeon" sort="Kim, Su Yeon" uniqKey="Kim S" first="Su Yeon" last="Kim">Su Yeon Kim</name>
<affiliation>
<nlm:aff id="I1">Department of Statistics, University of California at Berkeley, Berkeley CA 94720, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jacob, Laurent" sort="Jacob, Laurent" uniqKey="Jacob L" first="Laurent" last="Jacob">Laurent Jacob</name>
<affiliation>
<nlm:aff id="I2">Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558 Villeurbanne, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Speed, Terence P" sort="Speed, Terence P" uniqKey="Speed T" first="Terence P" last="Speed">Terence P. Speed</name>
<affiliation>
<nlm:aff id="I1">Department of Statistics, University of California at Berkeley, Berkeley CA 94720, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">, Walter and Eliza Hall Institute of Medical Research and the University of Melbourne, Parkville, Victoria, Australia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still challenging and there is unlikely to be one established caller that systematically outperforms all others. Therefore, fully utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research.</p>
</sec>
<sec>
<title>Results</title>
<p>Using a set of mutations from multiple callers that are impartially validated, we present a statistical approach for building a combined caller, which can be applied to combine calls in a wider dataset generated using a similar protocol. Using the mutation outputs and the validation data from The Cancer Genome Atlas endometrial study (6,746 sites), we demonstrate how to build a statistical model that predicts the probability of each call being a somatic mutation, based on the detection status of multiple callers and a few associated features.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The approach allows us to build a combined caller across the full range of stringency levels, which outperforms all of the individual callers.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Hansen, Nf" uniqKey="Hansen N">NF Hansen</name>
</author>
<author>
<name sortKey="Gartner, Jj" uniqKey="Gartner J">JJ Gartner</name>
</author>
<author>
<name sortKey="Mei, L" uniqKey="Mei L">L Mei</name>
</author>
<author>
<name sortKey="Samuels, Y" uniqKey="Samuels Y">Y Samuels</name>
</author>
<author>
<name sortKey="Mullikin, Jc" uniqKey="Mullikin J">JC Mullikin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cibulskis, K" uniqKey="Cibulskis K">K Cibulskis</name>
</author>
<author>
<name sortKey="Lawrence, Ms" uniqKey="Lawrence M">MS Lawrence</name>
</author>
<author>
<name sortKey="Carter, Sl" uniqKey="Carter S">SL Carter</name>
</author>
<author>
<name sortKey="Sivachenko, A" uniqKey="Sivachenko A">A Sivachenko</name>
</author>
<author>
<name sortKey="Jaffe, D" uniqKey="Jaffe D">D Jaffe</name>
</author>
<author>
<name sortKey="Sougnez, C" uniqKey="Sougnez C">C Sougnez</name>
</author>
<author>
<name sortKey="Gabriel, S" uniqKey="Gabriel S">S Gabriel</name>
</author>
<author>
<name sortKey="Meyerson, M" uniqKey="Meyerson M">M Meyerson</name>
</author>
<author>
<name sortKey="Lander, Es" uniqKey="Lander E">ES Lander</name>
</author>
<author>
<name sortKey="Getz, G" uniqKey="Getz G">G Getz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saunders, Ct" uniqKey="Saunders C">CT Saunders</name>
</author>
<author>
<name sortKey="Wong, Ws" uniqKey="Wong W">WS Wong</name>
</author>
<author>
<name sortKey="Swamy, S" uniqKey="Swamy S">S Swamy</name>
</author>
<author>
<name sortKey="Becq, J" uniqKey="Becq J">J Becq</name>
</author>
<author>
<name sortKey="Murray, Lj" uniqKey="Murray L">LJ Murray</name>
</author>
<author>
<name sortKey="Cheetham, Rk" uniqKey="Cheetham R">RK Cheetham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ding, J" uniqKey="Ding J">J Ding</name>
</author>
<author>
<name sortKey="Bashashati, A" uniqKey="Bashashati A">A Bashashati</name>
</author>
<author>
<name sortKey="Roth, A" uniqKey="Roth A">A Roth</name>
</author>
<author>
<name sortKey="Oloumi, A" uniqKey="Oloumi A">A Oloumi</name>
</author>
<author>
<name sortKey="Tse, K" uniqKey="Tse K">K Tse</name>
</author>
<author>
<name sortKey="Zeng, T" uniqKey="Zeng T">T Zeng</name>
</author>
<author>
<name sortKey="Haffari, G" uniqKey="Haffari G">G Haffari</name>
</author>
<author>
<name sortKey="Hirst, M" uniqKey="Hirst M">M Hirst</name>
</author>
<author>
<name sortKey="Marra, Ma" uniqKey="Marra M">MA Marra</name>
</author>
<author>
<name sortKey="Condon, A" uniqKey="Condon A">A Condon</name>
</author>
<author>
<name sortKey="Aparicio, S" uniqKey="Aparicio S">S Aparicio</name>
</author>
<author>
<name sortKey="Shah, Sp" uniqKey="Shah S">SP Shah</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roth, A" uniqKey="Roth A">A Roth</name>
</author>
<author>
<name sortKey="Ding, J" uniqKey="Ding J">J Ding</name>
</author>
<author>
<name sortKey="Morin, R" uniqKey="Morin R">R Morin</name>
</author>
<author>
<name sortKey="Crisan, A" uniqKey="Crisan A">A Crisan</name>
</author>
<author>
<name sortKey="Ha, G" uniqKey="Ha G">G Ha</name>
</author>
<author>
<name sortKey="Giuliany, R" uniqKey="Giuliany R">R Giuliany</name>
</author>
<author>
<name sortKey="Bashashati, A" uniqKey="Bashashati A">A Bashashati</name>
</author>
<author>
<name sortKey="Hirst, M" uniqKey="Hirst M">M Hirst</name>
</author>
<author>
<name sortKey="Turashvili, G" uniqKey="Turashvili G">G Turashvili</name>
</author>
<author>
<name sortKey="Oloumi, A" uniqKey="Oloumi A">A Oloumi</name>
</author>
<author>
<name sortKey="Marra, Ma" uniqKey="Marra M">MA Marra</name>
</author>
<author>
<name sortKey="Aparicio, S" uniqKey="Aparicio S">S Aparicio</name>
</author>
<author>
<name sortKey="Shah, Sp" uniqKey="Shah S">SP Shah</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Larson, De" uniqKey="Larson D">DE Larson</name>
</author>
<author>
<name sortKey="Harris, Cc" uniqKey="Harris C">CC Harris</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Koboldt, Dc" uniqKey="Koboldt D">DC Koboldt</name>
</author>
<author>
<name sortKey="Abbott, Te" uniqKey="Abbott T">TE Abbott</name>
</author>
<author>
<name sortKey="Dooling, Dj" uniqKey="Dooling D">DJ Dooling</name>
</author>
<author>
<name sortKey="Ley, Tj" uniqKey="Ley T">TJ Ley</name>
</author>
<author>
<name sortKey="Mardis, Er" uniqKey="Mardis E">ER Mardis</name>
</author>
<author>
<name sortKey="Wilson, Rk" uniqKey="Wilson R">RK Wilson</name>
</author>
<author>
<name sortKey="Ding, L" uniqKey="Ding L">L Ding</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lower, M" uniqKey="Lower M">M Lower</name>
</author>
<author>
<name sortKey="Renard, By" uniqKey="Renard B">BY Renard</name>
</author>
<author>
<name sortKey="De Graaf, J" uniqKey="De Graaf J">J de Graaf</name>
</author>
<author>
<name sortKey="Wagner, M" uniqKey="Wagner M">M Wagner</name>
</author>
<author>
<name sortKey="Paret, C" uniqKey="Paret C">C Paret</name>
</author>
<author>
<name sortKey="Kneip, C" uniqKey="Kneip C">C Kneip</name>
</author>
<author>
<name sortKey="Tureci, O" uniqKey="Tureci O">O Tureci</name>
</author>
<author>
<name sortKey="Diken, M" uniqKey="Diken M">M Diken</name>
</author>
<author>
<name sortKey="Britten, C" uniqKey="Britten C">C Britten</name>
</author>
<author>
<name sortKey="Kreiter, S" uniqKey="Kreiter S">S Kreiter</name>
</author>
<author>
<name sortKey="Koslowski, M" uniqKey="Koslowski M">M Koslowski</name>
</author>
<author>
<name sortKey="Castle, Jc" uniqKey="Castle J">JC Castle</name>
</author>
<author>
<name sortKey="Sahin, U" uniqKey="Sahin U">U Sahin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
<author>
<name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Handsaker, B" uniqKey="Handsaker B">B Handsaker</name>
</author>
<author>
<name sortKey="Wysoker, A" uniqKey="Wysoker A">A Wysoker</name>
</author>
<author>
<name sortKey="Fennell, T" uniqKey="Fennell T">T Fennell</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N Homer</name>
</author>
<author>
<name sortKey="Marth, G" uniqKey="Marth G">G Marth</name>
</author>
<author>
<name sortKey="Abecasis, G" uniqKey="Abecasis G">G Abecasis</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wolpert, Dh" uniqKey="Wolpert D">DH Wolpert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sill, J" uniqKey="Sill J">J Sill</name>
</author>
<author>
<name sortKey="Takacs, G" uniqKey="Takacs G">G Takács</name>
</author>
<author>
<name sortKey="Mackey, L" uniqKey="Mackey L">L Mackey</name>
</author>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">24885750</article-id>
<article-id pub-id-type="pmc">4035752</article-id>
<article-id pub-id-type="publisher-id">1471-2105-15-154</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-15-154</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Combining calls from multiple somatic mutation-callers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Kim</surname>
<given-names>Su Yeon</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>suyeonkim08@gmail.com</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Jacob</surname>
<given-names>Laurent</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>laurent.jacob@gmail.com</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A3">
<name>
<surname>Speed</surname>
<given-names>Terence P</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>terry@stat.berkeley.edu</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Statistics, University of California at Berkeley, Berkeley CA 94720, USA</aff>
<aff id="I2">
<label>2</label>
Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558 Villeurbanne, France</aff>
<aff id="I3">
<label>3</label>
, Walter and Eliza Hall Institute of Medical Research and the University of Melbourne, Parkville, Victoria, Australia</aff>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>21</day>
<month>5</month>
<year>2014</year>
</pub-date>
<volume>15</volume>
<fpage>154</fpage>
<lpage>154</lpage>
<history>
<date date-type="received">
<day>16</day>
<month>2</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>5</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2014 Kim et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Kim et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/15/154"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still challenging and there is unlikely to be one established caller that systematically outperforms all others. Therefore, fully utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research.</p>
</sec>
<sec>
<title>Results</title>
<p>Using a set of mutations from multiple callers that are impartially validated, we present a statistical approach for building a combined caller, which can be applied to combine calls in a wider dataset generated using a similar protocol. Using the mutation outputs and the validation data from The Cancer Genome Atlas endometrial study (6,746 sites), we demonstrate how to build a statistical model that predicts the probability of each call being a somatic mutation, based on the detection status of multiple callers and a few associated features.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The approach allows us to build a combined caller across the full range of stringency levels, which outperforms all of the individual callers.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Cancer genome</kwd>
<kwd>Somatic mutation-calling</kwd>
<kwd>Combining calls</kwd>
<kwd>Stacking</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Somatic mutations are genetic changes that occur in somatic cells after conception. Cancer is driven by such somatic alterations, and thus cataloging somatic mutations is essential to understand the genetic bases of cancer development. With the burst of high-throughput sequencing data generated in recent years, extensive efforts have been made towards accurate somatic mutation-calling. Many calling algorithms are now publicly available, including Shimmer [
<xref ref-type="bibr" rid="B1">1</xref>
], MuTect [
<xref ref-type="bibr" rid="B2">2</xref>
], Strelka [
<xref ref-type="bibr" rid="B3">3</xref>
], MutationSeq [
<xref ref-type="bibr" rid="B4">4</xref>
], JointSNVMix [
<xref ref-type="bibr" rid="B5">5</xref>
], and SomaticSniper [
<xref ref-type="bibr" rid="B6">6</xref>
]. Additional in-house callers are likely to be under development for on-going studies. Nonetheless, many challenges remain to be addressed, for example, removing artifactual variants from multiple sources, detecting rare variants in highly heterogeneous tumor samples, detecting variants at a shallower sequencing coverage. Every caller will tackle these issues, but different callers are likely to be more successful on some of them and less so on others. As a consequence, finding the single best performing caller is not easy, and perhaps not even feasible.</p>
<p>Having multiple callers on hand, anyone conducting a mutation analysis may want to apply all of the callers to his/her data with the aim of later constructing a list of final calls. In essence, combining calls from multiple callers amounts to developing a strategy to sort the calls to be included as final calls. This can be done effectively if one can systematically assign a confidence measure to be a somatic mutation across the full list. In general, pursuing this goal requires a validation dataset to some extent. For example, the paper by Lower et al. [
<xref ref-type="bibr" rid="B7">7</xref>
] presented a method to prioritize calls from three methods by assigning false discovery rate confidence values, but it requires the independent sequencing of at least one of the tumor or normal samples.</p>
<p>In our work, we are considering a situation in which mutation-calling is done (by multiple callers) for many tumor-normal sequence pairs across a large genomic regions such as whole genome or exome, but only a limited resource is available for validation. For example, in practice, often only a small fraction of detected mutations can be validated or a small subset of regions in a selected list of samples are re-sequenced for evaluation purposes. We aim to build a combined caller, which is learned based on the relatively small validation dataset but can be applied to a wider dataset generated based on a similar protocol.</p>
<p>A large corpus in the statistical literature is dedicated to combining individual learners, see
<italic>e.g.</italic>
Chapter 16 of [
<xref ref-type="bibr" rid="B8">8</xref>
], however most of them —
<italic>e.g.</italic>
, boosting, bagging and random forests — are based on building individual learners from descriptors rather than combining outputs of algorithms.
<italic>Stacking</italic>
[
<xref ref-type="bibr" rid="B9">9</xref>
] was introduced as a mean of combining such outputs. In this paper, we exploit this well established framework to merge the outputs of different callers.</p>
<p>Specifically, we present a statistical approach for combining calls from multiple somatic mutation-callers, when validation is impartially done for all mutations detected by all callers in a selected set of regions or samples. For 194 tumor-normal exome-seq pairs from The Cancer Genome Atlas (TCGA) endometrial study [
<xref ref-type="bibr" rid="B10">10</xref>
], single nucleotide variant (SNV) type mutations (i.e., point mutations) were detected by three somatic mutation-callers. Validation through an independent re-sequencing was impartially done for all the mutations detected from 20 selected patients across the whole exome and for those mutations detected within 243 genes of interest across all 194 patients. We used this data to show how our statistical approach improves against individual callers and naive combination based on caller intersection. We also show that this improvement is maintained when the parameters of the model are estimated on a set of samples or regions different from the ones on which the performance is evaluated.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<p>Our aim is to build a combined caller using the mutation outputs generated by multiple callers based on the same paired tumor-normal sequence data (BAMs; [
<xref ref-type="bibr" rid="B11">11</xref>
]), when the mutation calls are impartially validated. For illustration purposes, we assume
<italic>K</italic>
=3 callers (Caller A, B, and C) are used for mutation-calling. The most basic and key information available in each mutation output is the list of positions detected as point mutations. A mutation output may include additional features such as mutation substitution type, mutation quality score, and perhaps details of filters applied to remove artifactual or low-quality variants. When the raw sequence data are available, genomic features can be computed for each mutation site such as sequencing depth and the variant allele fraction (the fraction of reads carrying the variant allele) for each tumor and normal sample. The more information that is available, the more powerful are the callers that can be constructed.</p>
<sec>
<title>Taking intersections or unions</title>
<p>One natural and simple way to build a combined caller is to take intersections or unions of the calls from three callers as final calls. For example, one may take the mutations detected by all callers (ABC), or take intersections of mutations from two callers (AB, AC, or BC), or take calls detected by at least two callers (‘2orMore’), or even take calls detected by any caller (Union). This strategy is very intuitive and can be immediately used in practice once a Venn diagram is drawn from calls, as in Figure
<xref ref-type="fig" rid="F1">1</xref>
. Note that building this type of combined caller does not require a validation dataset — although estimating its performance does.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>Venn diagram of the point mutations detected by three callers on 20 TCGA endometrial tumor-normal exome-seq pairs.</p>
</caption>
<graphic xlink:href="1471-2105-15-154-1"></graphic>
</fig>
</sec>
<sec>
<title>Cumulatively adding mutation sets based on combination call status</title>
<p>We explained how the sets of mutation sites defined by a Venn diagram could be used to build a combined caller. Restricting ourselves to mutation sets corresponding to a combination of detection statuses of the
<italic>K</italic>
callers, we obtain a partition of the mutation sites into 2
<sup>
<italic>K</italic>
</sup>
-1 disjoint subsets. This partition can be used to systematically sort mutations by some measure of confidence that we have in their being somatic mutations. On Figure
<xref ref-type="fig" rid="F1">1</xref>
, these 2
<sup>3</sup>
-1=7 disjoint sets are ABC, AB without C, AC without B, BC without A, A only, B only, and C only. We sort these 2
<sup>
<italic>K</italic>
</sup>
-1 disjoint sets by their validation rate,
<italic>i.e.</italic>
, by the proportion of true mutations that they contain, as shown on Table
<xref ref-type="table" rid="T1">1</xref>
. These sorted sets of sites define a sequence of combined callers, sorted by stringency. The most stringent combined caller only predicts the site in the first set to be mutations. Then less stringent combined callers can be defined by adding the sites in the sorted sets.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>
<bold>Validation results of the seven disjoint mutation sets shown in Figure </bold>
<xref ref-type="fig" rid="F1">1</xref>
</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="center" valign="bottom">
<bold>Combination</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>Val.</bold>
<hr></hr>
</th>
<th align="right" valign="bottom">
<bold>FP</bold>
<hr></hr>
</th>
<th align="right" valign="bottom">
<bold>TP</bold>
<hr></hr>
</th>
<th align="right" valign="bottom">
<bold>cFP</bold>
<hr></hr>
</th>
<th align="right" valign="bottom">
<bold>cTP</bold>
<hr></hr>
</th>
</tr>
<tr>
<th align="center">
<bold>call status</bold>
</th>
<th align="center">
<bold>rate (%)</bold>
</th>
<th align="right">
<bold>count</bold>
</th>
<th align="right">
<bold>count</bold>
</th>
<th align="right">
<bold>rate</bold>
</th>
<th align="right">
<bold>rate</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center" valign="bottom">All callers
<hr></hr>
</td>
<td align="center" valign="bottom">99.4
<hr></hr>
</td>
<td align="right" valign="bottom">12
<hr></hr>
</td>
<td align="right" valign="bottom">1,914
<hr></hr>
</td>
<td align="right" valign="bottom">1.2
<hr></hr>
</td>
<td align="right" valign="bottom">55.3
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom">Caller A and C only
<hr></hr>
</td>
<td align="center" valign="bottom">96.4
<hr></hr>
</td>
<td align="right" valign="bottom">11
<hr></hr>
</td>
<td align="right" valign="bottom">294
<hr></hr>
</td>
<td align="right" valign="bottom">2.4
<hr></hr>
</td>
<td align="right" valign="bottom">63.8
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom">Caller A and B only
<hr></hr>
</td>
<td align="center" valign="bottom">96.3
<hr></hr>
</td>
<td align="right" valign="bottom">7
<hr></hr>
</td>
<td align="right" valign="bottom">184
<hr></hr>
</td>
<td align="right" valign="bottom">3.1
<hr></hr>
</td>
<td align="right" valign="bottom">69.1
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom">Caller B and C only
<hr></hr>
</td>
<td align="center" valign="bottom">94.4
<hr></hr>
</td>
<td align="right" valign="bottom">2
<hr></hr>
</td>
<td align="right" valign="bottom">34
<hr></hr>
</td>
<td align="right" valign="bottom">3.3
<hr></hr>
</td>
<td align="right" valign="bottom">70.1
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom">Caller C only
<hr></hr>
</td>
<td align="center" valign="bottom">79.6
<hr></hr>
</td>
<td align="right" valign="bottom">11
<hr></hr>
</td>
<td align="right" valign="bottom">43
<hr></hr>
</td>
<td align="right" valign="bottom">4.4
<hr></hr>
</td>
<td align="right" valign="bottom">71.3
<hr></hr>
</td>
</tr>
<tr>
<td align="center" valign="bottom">Caller A only
<hr></hr>
</td>
<td align="center" valign="bottom">59.7
<hr></hr>
</td>
<td align="right" valign="bottom">632
<hr></hr>
</td>
<td align="right" valign="bottom">935
<hr></hr>
</td>
<td align="right" valign="bottom">69.1
<hr></hr>
</td>
<td align="right" valign="bottom">98.4
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Caller B only</td>
<td align="center">15.9</td>
<td align="right">302</td>
<td align="right">57</td>
<td align="right">100</td>
<td align="right">100</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>For each mutation set (row), the validation rate (Val. rate), the false positive (FP) and true positive (TP) counts, and the cumulative false positive (cFP) and cumulative true positive (cTP) rates in percentage, are presented. Mutation sets are ordered by the validation rate.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Fitting logistic models using the call status and genomic features</title>
<p>Stacked generalization was first introduced in the neural network community [
<xref ref-type="bibr" rid="B12">12</xref>
] and later adapted to the statistics literature [
<xref ref-type="bibr" rid="B9">9</xref>
], as a systematic way to combine classifiers.</p>
<p>Given a set of calls
<italic>c</italic>
<sub>
<italic>i</italic>
<italic>k</italic>
</sub>
∈{0,1} for site 1≤
<italic>i</italic>
<italic>n</italic>
and caller 1≤
<italic>k</italic>
<italic>K</italic>
, stacking aims at building a linear function of the calls for each site
<italic>i</italic>
which predicts its true status
<italic>y</italic>
<sub>
<italic>i</italic>
</sub>
as accurately as possible. In other words, we represent each site by its
<italic>K</italic>
calls from the different callers, and learn a new classifier of mutation sites in this feature space. Formally, given a set of
<italic>n</italic>
sites with known calls
<italic>c</italic>
<sub>
<italic>i</italic>
<italic>k</italic>
</sub>
for all callers and known true status
<italic>y</italic>
<sub>
<italic>i</italic>
</sub>
, a linear stacking approach would solve: </p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-15-154-i1" overflow="scroll">
<mml:munder>
<mml:mrow>
<mml:mtext>arg min</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>
<italic>i.e.</italic>
, a linear regression in the call space, estimating weights
<italic>β</italic>
<sub>
<italic>k</italic>
</sub>
such that a linear combination of the calls based on these weights is close to the true mutation status. The mutation status of a new site
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
defined by its calls from the
<italic>K</italic>
individual callers would then be predicted via </p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M2" name="1471-2105-15-154-i2" overflow="scroll">
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mover>
<mml:mrow>
<mml:mo>=</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>Δ</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mi>.</mml:mi>
</mml:math>
</disp-formula>
</p>
<p>In practice, we use a logistic model rather than a linear one, because it is better suited to binary classification [
<xref ref-type="bibr" rid="B8">8</xref>
] – we only have binary mutation status {0,1} as opposed to scores or continuous confidence measures. Our estimator therefore becomes: </p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M3" name="1471-2105-15-154-i3" overflow="scroll">
<mml:munder>
<mml:mrow>
<mml:mtext>arg min</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mo>log</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mo>exp</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mi>.</mml:mi>
</mml:math>
</disp-formula>
</p>
<p>If the features
<italic>c</italic>
<sub>
<italic>i</italic>
<italic>k</italic>
</sub>
are binary, which is the case if the individual callers returned binary decisions rather than continuous scores, the resulting classifier
<italic>f</italic>
(
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
) is the sum of weights
<italic>β</italic>
<sub>
<italic>k</italic>
</sub>
for callers which classified the site
<italic>i</italic>
as a somatic mutation. It can only take 2
<sup>
<italic>K</italic>
</sup>
-1 distinct values on sites which were called by at least one caller. Each of these values corresponds to a unique combination of calls by the individual methods, which in turn corresponds to one of the disjoint subsets defined by the Venn diagram discussed in Section ‘Cumulatively adding mutation sets based on combination call status’. If the effects of callers are additive, then the ranking of the sites defined by
<italic>f</italic>
is expected to essentially behave like the more naive one defined in Section ‘Cumulatively adding mutation sets based on combination call status’.</p>
<p>The estimators defined by (1) and (3) combine the individual callers uniformly for all sites. It is however conceivable that some callers perform better for some types of sites,
<italic>e.g.</italic>
, those with low coverage, and less well for others. We now assume that some descriptors
<italic>g</italic>
<sub>
<italic>i</italic>
<italic>j</italic>
</sub>
, 1≤
<italic>j</italic>
<italic>p</italic>
, of each site
<italic>i</italic>
are available besides the detection status of the three callers and the validation status. These descriptors could typically be genomic features.</p>
<p>Feature-weighted linear stacking (FWLS, [
<xref ref-type="bibr" rid="B13">13</xref>
]) replaces each parameter
<italic>β</italic>
<sub>
<italic>k</italic>
</sub>
of the stacking regression estimator (3) by a linear combination of the descriptors
<italic>g</italic>
<sub>
<italic>i</italic>
<italic>j</italic>
</sub>
: </p>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M4" name="1471-2105-15-154-i4" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">jk</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where the
<italic>α</italic>
<sub>
<italic>j</italic>
<italic>k</italic>
</sub>
parameters are weights corresponding to the relevance of feature
<italic>g</italic>
<sub>
<italic>i</italic>
<italic>j</italic>
</sub>
to measure how predictive caller
<italic>k</italic>
is for site
<italic>i</italic>
. The weights
<italic>β</italic>
<sub>
<italic>k</italic>
</sub>
are therefore site-specific, accounting for the fact that the relevance
<italic>β</italic>
<sub>
<italic>k</italic>
</sub>
of a particular caller
<italic>k</italic>
may be different for sites with different genomic features.</p>
<p>Plugging weights (4) in the linear function (2) yields a different set of coefficients for each site
<italic>i</italic>
:
<inline-formula>
<mml:math id="M5" name="1471-2105-15-154-i5" overflow="scroll">
<mml:mi>h</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">jk</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ik</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
.
<italic>h</italic>
is now a linear function of the
<italic>K</italic>
×
<italic>p</italic>
products of features
<italic>g</italic>
<sub>
<italic>i</italic>
<italic>j</italic>
</sub>
and calls
<italic>c</italic>
<sub>
<italic>i</italic>
<italic>k</italic>
</sub>
so FWLS equivalently amounts to: </p>
<p>• describing each site by this extended set of features, and</p>
<p>• estimating a linear classifier of mutation sites in this space.</p>
<p>Formally, after plugging (4) in our stacking estimator (3) we see that FWLS solves: </p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M6" name="1471-2105-15-154-i6" overflow="scroll">
<mml:munder>
<mml:mrow>
<mml:mtext>arg min</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>log</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mo>exp</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">il</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.3em"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">il</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M7" name="1471-2105-15-154-i7" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">il</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
contains all the products of calls and genomic features for site
<italic>i</italic>
. The
<italic>K</italic>
×
<italic>p</italic>
parameters
<italic>γ</italic>
<sub>
<italic>l</italic>
</sub>
are the weights of the logistic regression. They are strictly equivalent to the
<italic>α</italic>
<sub>
<italic>j</italic>
<italic>k</italic>
</sub>
parameters of (4), we only use them to emphasize that FWLS can be formulated as a regular logistic regression estimator in an expanded feature space.</p>
<p>In the experiments of this paper, we consider all combinations of call status defined in Section ‘Cumulatively adding mutation sets based on combination call status’,
<italic>i.e.</italic>
, all products of single calls rather than the single calls. Technically this can still be cast as a FWLS model, by adding all single calls and products of single calls to the set of features
<italic>g</italic>
<sub>
<italic>i</italic>
<italic>j</italic>
</sub>
. In practice, our implementation relies on (5),
<italic>i.e.</italic>
, on a logistic regression in an expanded feature space.</p>
<p>Finally, since the resulting feature space can become large, we choose to use an
<italic></italic>
<sub>1</sub>
-penalized version of (5): </p>
<p>
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M8" name="1471-2105-15-154-i8" overflow="scroll">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:munder>
<mml:mrow>
<mml:mtext>arg min</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mspace width="0.3em"></mml:mspace>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>log</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>1</mml:mn>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>+</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>exp</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mspace width="0.3em"></mml:mspace>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">il</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.3em"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>-</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mspace width="0.3em"></mml:mspace>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">il</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>+</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>λ</mml:mi>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mspace width="0.3em"></mml:mspace>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mi>.</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Penalizing the
<italic></italic>
<sub>1</sub>
norm
<inline-formula>
<mml:math id="M9" name="1471-2105-15-154-i9" overflow="scroll">
<mml:munderover>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>γ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:math>
</inline-formula>
of the parameter is known to lead to sparse estimators [
<xref ref-type="bibr" rid="B14">14</xref>
], and
<inline-formula>
<mml:math id="M10" name="1471-2105-15-154-i10" overflow="scroll">
<mml:mi>λ</mml:mi>
<mml:mo></mml:mo>
<mml:mi></mml:mi>
</mml:math>
</inline-formula>
is used to adjust the level of sparsity.</p>
</sec>
<sec>
<title>Implementation and evaluation of combined callers</title>
<p>The approach of building a combined caller by taking intersections or unions (Section ‘Taking intersections or unions’) does not require a training set, and evaluation of the caller can be done straightforwardly on a test set. The approach that cumulatively adds disjoint subsets (Section ‘Cumulatively adding mutation sets based on combination call status’) uses a training set to determine the order of subsets (by computing the validation rate of each subset), and evaluates the performance on a test set using the order. For the approach building a caller by fitting a logistic model (Section ‘Fitting logistic models using the call status and genomic features’), a training set is used to estimate the
<italic>γ</italic>
<sub>
<italic>l</italic>
</sub>
parameters of (6). In order to choose the hyperparameter
<italic>λ</italic>
, we perform 10-fold cross validation on the training set for each candidate
<italic>λ</italic>
to estimate the error of the associated model. Then the most parsimonious model whose error is no more than one standard error above the error of the best model is chosen. Once
<italic>λ</italic>
is selected, we re-estimate
<italic>γ</italic>
<sub>
<italic>l</italic>
</sub>
using this
<italic>λ</italic>
on the whole training set, and evaluate its performance on the test set. Experiments were conducted using the R package glmnet [
<xref ref-type="bibr" rid="B15">15</xref>
], which implements penalized GLMs, in particular the
<italic></italic>
<sub>1</sub>
penalized logistic regression of which (6) is an instance. The R scripts that contain our detailed implementation are included as Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
.</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<p>We have used the mutation datasets generated for the TCGA endometrial study [
<xref ref-type="bibr" rid="B10">10</xref>
]. For 194 tumor-normal Illumina exome-sequence pairs, somatic-mutation calling was done by three centers whose algorithms are referred to here as Caller A, B, and C. In total, 51,648 single nucleotide variant (SNV) type of mutations were detected. A large fraction of the mutations were targeted for custom capture validation. As explained in the Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Supplementary Methods, these sites were captured using the Nimblegen technology and then re-sequenced independently using an Illumina HighSeq 2000. In particular, impartial validation (i.e. validating all calls from all callers) was carried out for all mutations in (1) a randomly selected 20 patients and (2) an additional 243 genes of interest from the remaining 174 patients. Validation status was successfully determined for all but a small fraction (less than 5%) of the validated mutations. For more details about the validation and determining the validation status, see Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Supplementary Methods. Our final dataset consists of the successfully validated mutations: (1) 4,438 sites in the selected 20 patients and (2) an additional 2,308 sites within the 243 genes of interest. Note that almost all of these sites (> 95%) are included as example datasets in our software package (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
).</p>
<p>For each point mutation site in our final dataset, we know the validation status (‘somatic’ or ‘non-somatic’), the call status (i.e., whether or not it was detected) by each of the three callers, the mutation substitution type (combination of the reference allele and the variant allele), and the sequencing depth and the variant allele fraction in each tumor and normal sample based on the exome sequence data that was used for mutation-calling. A brief summary of our dataset is included as Table
<xref ref-type="table" rid="T1">1</xref>
, Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Table S1 and Figures S1–S4. Caller B provided more information besides the positions of the detected mutations. For a broader set of somatic variants (candidate mutations), it reported the mutation quality score as well as the pass/fail status of individual filters at each site. Although the detailed description of each filter was not available, the filter outcomes were available (Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Table S2), which we were able to use for improving Caller B’s performance (Section ‘Improving a single caller’s performance using details of its filters’). In Section ‘Building and evaluating combined callers’, we demonstrate how to build a combined caller using the calling status of the three individual mutation callers and a few genomic features. In Section ‘Improving a single caller’s performance using details of its filters’, we show the potential for improving the performance of an individual caller using more detailed outputs, using Caller B as an instance.</p>
<sec>
<title>Building and evaluating combined callers</title>
<p>We first used the mutations detected from the 20 selected patients (total: 4,438) to build and evaluate combined callers. Assuming (for illustrative purposes) that the characteristics of our mutations are not affected by sample-specific features, we randomly split the data into 50% training and 50% test sets. Other fractions were explored, but the qualitative conclusions were similar as long as there was enough data to train the model, e.g., more than 20% of the total.</p>
<p>The performance of the combined caller constructed by fitting a logistic model (defined in Section ‘Fitting logistic models using the call status and genomic features’) is shown as a receiver operating characteristic (ROC) curve in Figure
<xref ref-type="fig" rid="F2">2</xref>
. The explanatory variables for this logistic model consist of the combination call status (7-1 variables), sequencing depth and variant allele fraction in each tumor and normal sample (4 variables), mutation substitution type (12-1 variables), and interactions between the combination call status variables and other features (90 variables). Note that we used combination call status (7-1 variables) instead of the call status of each individual caller (3 variables) as shown in (6) in Section ‘Fitting logistic models using the call status and genomic features’. We used the combination call status, since we do not want to assume that the effects of callers are necessarily additive. For example, in reality, a certain sequence feature may mislead two callers, but the remaining single caller may have a better filter for it. Therefore rather than imposing additivity, we would like to characterize each combination call status separately. The model fitting was done based on a randomly selected 50% training sites, then prediction was made on the remaining 50% test sites, enabling us to sort the mutations. A more stringent caller can be constructed by taking a smaller percentage of high-ranked mutations as final calls, and a more liberal caller can be constructed by including a larger percentage of mutations as final calls.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Performances of individual and combined callers.</bold>
Model fitting was done using a random 50% of the point mutations detected from the selected 20 patients, and evaluation was done based on the remaining half. In the main panel, the true positive and the false positive rates of various callers are shown: (1) three individual callers (red filled triangles): Caller A, Caller B, and Caller C, (2) the caller that cumulatively adds mutation sets based on the combination call status in the order of the validation rate (connected blue dots), (3) the combined caller built by fitting a logistic model (for details, see text) (green line). The area near the point showing Caller C’s performance is enlarged and shown as a small sub-panel on the lower right part of the main figure. This panel further indicates the performance of the callers that take unions or intersections of calls from three callers (brown diamonds): all callers (ABC), intersections of two callers (AB, AC, or BC), called by more than two callers (‘2orMore’).</p>
</caption>
<graphic xlink:href="1471-2105-15-154-2"></graphic>
</fig>
<p>The performances of individual callers and combined callers are summarized in Figure
<xref ref-type="fig" rid="F2">2</xref>
. Note that validation was done only for the mutations that were detected by at least one of the three callers, and therefore, the union of all mutations comprises all true positives and all false positives. The results of three individual callers are given at three points with different false positive rates, i.e., different stringency levels. Caller A is the most liberal in the sense that it detected many false positives (FP rate at 68%) but also detected most of the true positives (TP rate at 96%). Caller C has a very small FP rate (4%) but detected only 67% of the true positives. Caller B performs poorer than Caller C, since it detected not only more false positives but also less true positives. The performance of the caller taking unions or intersections of the calls is marked as another set of points, inside of the sub-panel on the lower right part of the main panel. The stringency levels of these callers are not necessarily ordered. For example, the set of mutations called by two or more callers (2orMore) is nested within any intersection of two callers (AB, AC, or BC), but no ordering exists among the latter three intersections. In contrast to this, the performance of the caller adding mutations sets cumulatively is shown as a connected set of blue dots because of the natural ordering determined based on the validation rates. In reality, the ordering may not be the same between the training set and the test set. When the validation rates are very similar among the mutation subsets or the number of mutations in each set is very small, sampling variation could easily result in a different ordering. In the training set, the validation rates of the mutation set called by A and C but not B, and the set called by A and B but not C, are 97.99% and 97.96%, respectively.</p>
<p>Overall, our combined caller obtained by fitting a logistic model outperforms the individual callers and other naive combinations. The ROC curve of this combined caller is above of all the points representing the performance of individual callers, although sometimes only slightly so. Further, the combined caller allows us to assess the performance across the full range of stringency levels.</p>
</sec>
<sec>
<title>Improving a single caller’s performance using details of its filters</title>
<p>For Caller B, mutation quality scores as well as the outcomes of individual filters were available for a broader set of somatic variants. (Note that for each caller, the detected mutations are the somatic variants that passed all the filters implemented by that caller.) In Figure
<xref ref-type="fig" rid="F2">2</xref>
, the performance of Caller B was shown as a single point. Here, we demonstrate how such extra details besides the call status can be used to improve the performance. Furthermore, to prove the validity of our approach in a wider dataset, we trained and tested on two different mutation datasets that were generated for the TCGA endometrial study using the same mutation calling algorithms, but constructed from different genomic regions as well as different tumor and normal samples. Specifically, we trained a model on the mutations from the 243 genes of interest from 174 patients (our second dataset described at the beginning of Section ‘Results’), then evaluated on the mutations from the whole exomes of the 20 patients (first dataset). A similar analysis was performed with the roles of the two datasets switched (Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Figure S5).</p>
<p>Since a mutation quality score was available for Caller B, we first drew an ROC curve by sorting the calls that were detected by Caller B (Figure
<xref ref-type="fig" rid="F3">3</xref>
). As expected, the right most point in the ROC curve (besides the one at the FP rate of 1.0) corresponds to the point for which Caller B was previously evaluated. We then fitted a logistic model including the mutation quality score and the individual filter outcomes (indicator of pass/fail) from Caller B as explanatory variables. The estimated coefficients for the individual filters are summarized in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Table S2 (note that these coefficients were estimated from a set of ascertained sites for which each site was called by at least one of the three callers).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>ROC curve of an improved Caller B built by fitting a logistic model using the mutation quality score and individual filters of Caller B.</bold>
Model fitting was done using the point mutations in 243 genes of interest from 174 patients excluding the 20 patients, and evaluation was done on the point mutations in the 20 selected patients. The performances of three individual callers (red filled triangles), the combined caller that cumulatively adds mutation sets (connected blue dots), and the combined caller by fitting the logistic model (green lilne) are shown for comparison purposes. ROC curves of two updated versions of Caller B are shown. One version is obtained by ranking the mutations detected by Caller B using the mutation quality score of Caller B (violet line), and the other version by fitting a logistic model using the mutation quality score and the individual filters of Caller B on an extended set of mutations that were detected by at least one of the three callers (orange line).</p>
</caption>
<graphic xlink:href="1471-2105-15-154-3"></graphic>
</fig>
<p>By utilizing the outcomes of individual filters, Caller B’s performance has improved substantially (Figure
<xref ref-type="fig" rid="F3">3</xref>
). At a false positive rate of 33%, the true positive rate increases from 63% to 78%, detecting 520 more mutations. This highlights the importance of having the full details of all features involved in the final decision on a variant.</p>
<p>Furthermore, if similar details were available for Caller A and C, then we could generalize the logistic model in previous section (Section ‘Building and evaluating combined callers’) including outcomes of individual filters from all callers, which potentially leads to a higher power as well as better insight on the cause of mutation-calling errors.</p>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>In this paper, we present an approach for effectively building a combined caller using the outputs from three mutation callers. Our approach is valid with more than three callers or less concordant mutation call outputs, as long as impartial validation data is available for all calls from all mutation callers as a training data, and the relative performance of individual callers is expected to be consistent between the training set and the test set. The combining approach could be even more beneficial if the individual callers agreed less — assuming (i) they all had comparable individual performances and (ii) the set of loci on which each caller is right could be characterized in terms of genomic features. In this case, the FWLS approach could learn the type of locus on which each caller is typically right and output the best answer for each new locus, resulting in a more accurate calling.</p>
<p>We have analyzed mutation sites that were successfully validated based on the criteria described in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Supplementary Methods. Those validation criteria may not be perfect, but we found them reasonable to demonstrate our approach. Changes in validation criteria can result changes in individual callers’ performances and thus the final model estimated. For example, more stringent criteria are likely to treat all very rare mutations as false calls, and thus in our exercise, may reduce the sensitivity of Caller A to a large extent. However, our approach remains to provide a convenient framework to build the best combined model, given any validation status. In practice, determining validation status based on an independent sequencing data can be very challenging, and developing highly accurate validation method itself is another research topic. Working on better validation is out of scope for our paper, but if uncertainty in the validation could be quantified, it could be used in the logistic model fitting to weight more accurate calls.</p>
<p>In practice, an effective validation strategy is essential for building a successful model. In principle, a training dataset is supposed to contain all sites characterizing a wider dataset for which one wishes to apply the estimated model. Therefore, a validation dataset needs to include enough sites to learn the behaviors of the mutation-calling algorithms across a broad spectrum of genomic features. Another important issue is to have impartially validated sites. If validation is done partially, then the composition of a training dataset is biased and thus the estimated parameters and the performance are also biased.</p>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>Our approaches provide a unified framework for dealing with multiple somatic-mutation callers. If the callers provide only the list of positions detected as mutations, then it is difficult to compare them, or to investigate the tradeoff between the stringency of the calling-procedure and its power to detect true mutations. Our combined caller can be used to overcome these difficulties. It offers an evaluation of its performance across the full range as an ROC curve, and in addition, allows easy comparison with individual callers.</p>
<p>Furthermore, we have shown that it is feasible to build a combined caller that performs better than all the individual callers, one which could be better (even slightly) than a caller combining calls only based on the detection status. An even more powerful caller can possibly be built when more features associated with calling performance are available, such as individual details of the filters used by each caller or the measure of strand bias.</p>
<p>Finally, we demonstrate the potential for building a combined caller using a small validation dataset (generated for a subset of regions or samples in the original study), which can be applied to a wider dataset to assign a confidence measure that can be used for ranking the mutations from multiple callers. Our two mutation datasets, one from the selected 20 patients and the other from 243 genes of interest across 174 patients share protocols (sample preparation, sequencing technology, alignment methods, and the applied mutation-calling algorithms) but differ for genomic regions and the tumor and normal samples used for calling. The results from training the model using one of the datasets and evaluating on the other suggest that the estimated models based on these validation datasets are generally applicable to the mutations from whole exomes of all 194 endometrial patients.</p>
</sec>
<sec>
<title>Abbreviations</title>
<p>TCGA: The cancer genome atlas; SNV: Single nucleotide variant; FWLS: Feature-weighted linear stacking; FP: False positive; FN: False negative; TP: True positive; ROC: Receiver operating characteristic.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>SYK participated in the design of the study, carried out statistical analyses and drafted the manuscript. LJ participated in the design of the study, and drafted the manuscript. TPS conceived the study, participated in its design and helped to draft the manuscript. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>Software package.</bold>
A.tar.gz file that contains R scripts and example datasets to illustrate our approaches. The package also includes a manual file (pdf) explaining how to run the R scripts.</p>
</caption>
<media xlink:href="1471-2105-15-154-S1.zip">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>
<bold>Supplementary information.</bold>
A.pdf file including Supplementary Methods, Tables and Figures.</p>
</caption>
<media xlink:href="1471-2105-15-154-S2.pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We thanks to the TCGA mutation calling group. Special thanks to David Haussler, Li Ding, David Wheeler, and Gad Getz for their leadership and to Singer Ma, Cyriac Kandoth, Kyle Chang for generating the mutation-calling outputs, particularly to Cyriac Kandoth for compiling the mutation outputs as well as the validation data. We also would like to thank to Heidi Sofia and Kenna Shaw for coordination and providing valuable feedbacks, to Paul Spellman for sharing computational facilities, to the members of Speed lab for discussion and providing valuable comments.</p>
<p>The results published here are based upon data generated by The Cancer Genome Atlas project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at
<ext-link ext-link-type="uri" xlink:href="http://cancergenome.nih.gov">http://cancergenome.nih.gov</ext-link>
.</p>
</sec>
<sec>
<title>Funding</title>
<p>We gratefully acknowledge support from NIH grant 5 U24 CA143799-04.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Hansen</surname>
<given-names>NF</given-names>
</name>
<name>
<surname>Gartner</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Mei</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Samuels</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Mullikin</surname>
<given-names>JC</given-names>
</name>
<article-title>
<bold>Shimmer: detection of genetic alterations in tumors using next-generation sequence data</bold>
</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>12</issue>
<fpage>1498</fpage>
<lpage>1503</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt183</pub-id>
<pub-id pub-id-type="pmid">23620360</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Cibulskis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lawrence</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Carter</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Sivachenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sougnez</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gabriel</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Meyerson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Getz</surname>
<given-names>G</given-names>
</name>
<article-title>
<bold>Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples</bold>
</article-title>
<source>Nat Biotechnol</source>
<year>2013</year>
<volume>31</volume>
<issue>3</issue>
<fpage>213</fpage>
<lpage>219</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2514</pub-id>
<pub-id pub-id-type="pmid">23396013</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Saunders</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Swamy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Becq</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Murray</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Cheetham</surname>
<given-names>RK</given-names>
</name>
<article-title>
<bold>Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs</bold>
</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>14</issue>
<fpage>1811</fpage>
<lpage>1817</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts271</pub-id>
<pub-id pub-id-type="pmid">22581179</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Ding</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bashashati</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Roth</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Oloumi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tse</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Haffari</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Marra</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Condon</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Aparicio</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>SP</given-names>
</name>
<article-title>
<bold>Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data</bold>
</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>2</issue>
<fpage>167</fpage>
<lpage>175</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr629</pub-id>
<pub-id pub-id-type="pmid">22084253</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Roth</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Morin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Crisan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Giuliany</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bashashati</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Turashvili</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Oloumi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Marra</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Aparicio</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>SP</given-names>
</name>
<article-title>
<bold>JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data</bold>
</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>7</issue>
<fpage>907</fpage>
<lpage>913</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts053</pub-id>
<pub-id pub-id-type="pmid">22285562</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Larson</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>CC</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Koboldt</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Abbott</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Dooling</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Ley</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Mardis</surname>
<given-names>ER</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>L</given-names>
</name>
<article-title>
<bold>SomaticSniper: identification of somatic point mutations in whole genome sequencing data</bold>
</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>3</issue>
<fpage>311</fpage>
<lpage>317</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr665</pub-id>
<pub-id pub-id-type="pmid">22155872</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Lower</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Renard</surname>
<given-names>BY</given-names>
</name>
<name>
<surname>de Graaf</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wagner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Paret</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kneip</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Tureci</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Diken</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Britten</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kreiter</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Koslowski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Castle</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Sahin</surname>
<given-names>U</given-names>
</name>
<article-title>
<bold>Confidence-based somatic mutation evaluation and prioritization</bold>
</article-title>
<source>PLoS Comput Biol</source>
<year>2012</year>
<volume>8</volume>
<issue>9</issue>
<fpage>1002714</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1002714</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="book">
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>J</given-names>
</name>
<source>The Elements of Statistical Learning</source>
<year>2009</year>
<publisher-name>New York: Springer</publisher-name>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
<article-title>
<bold>Stacked regressions</bold>
</article-title>
<source>Mach Learn</source>
<year>1996</year>
<volume>24</volume>
<issue>1</issue>
<fpage>49</fpage>
<lpage>64</lpage>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<collab>The Cancer Genome Atlas Research Network</collab>
<article-title>
<bold>Integrated genomic characterization of endometrial carcinoma</bold>
</article-title>
<source>Nature</source>
<year>2013</year>
<volume>497</volume>
<issue>7447</issue>
<fpage>67</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1038/nature12113</pub-id>
<pub-id pub-id-type="pmid">23636398</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Handsaker</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wysoker</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Fennell</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Homer</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Marth</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Abecasis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>
<bold>The Sequence Alignment/Map format and SAMtools</bold>
</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>16</issue>
<fpage>2078</fpage>
<lpage>2079</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
<pub-id pub-id-type="pmid">19505943</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Wolpert</surname>
<given-names>DH</given-names>
</name>
<article-title>
<bold>Stacked generalization</bold>
</article-title>
<source>Neural Netw</source>
<year>1992</year>
<volume>5</volume>
<fpage>241</fpage>
<lpage>259</lpage>
<pub-id pub-id-type="doi">10.1016/S0893-6080(05)80023-1</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="other">
<name>
<surname>Sill</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Takács</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mackey</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<article-title>
<bold>Feature-weighted linear stacking</bold>
</article-title>
<year>CoRR 2009</year>
<comment>
<bold>abs/0911.0460</bold>
. [
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/0911.0460">http://arxiv.org/abs/0911.0460</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<article-title>
<bold>Regression shrinkage and selection via the lasso</bold>
</article-title>
<source>J R Stat Soc B</source>
<year>1996</year>
<volume>58</volume>
<issue>1</issue>
<fpage>267</fpage>
<lpage>288</lpage>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Friedman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<article-title>
<bold>Regularization paths for generalized linear models via coordinate descent</bold>
</article-title>
<source>J Stat Softw</source>
<year>2010</year>
<volume>33</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">20808728</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001572  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001572  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Asie
   |area=    AustralieFrV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Dec 5 10:43:12 2017. Site generation: Tue Mar 5 14:07:20 2024