Serveur d'exploration autour du libre accès en Belgique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000409 ( Pmc/Corpus ); précédent : 0004089; suivant : 0004100 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques</title>
<author>
<name sortKey="Duitama, Jorge" sort="Duitama, Jorge" uniqKey="Duitama J" first="Jorge" last="Duitama">Jorge Duitama</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkr1042-AFF1">VIB Laboratory of Systems Biology & Laboratory for Genetics and Genomics, Center of Microbial and Plant Genetics, K.U.Leuven, Gaston Geenslaan 1, B-3001 Leuven (Heverlee), Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mcewen, Gayle K" sort="Mcewen, Gayle K" uniqKey="Mcewen G" first="Gayle K." last="Mcewen">Gayle K. Mcewen</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huebsch, Thomas" sort="Huebsch, Thomas" uniqKey="Huebsch T" first="Thomas" last="Huebsch">Thomas Huebsch</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Palczewski, Stefanie" sort="Palczewski, Stefanie" uniqKey="Palczewski S" first="Stefanie" last="Palczewski">Stefanie Palczewski</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Schulz, Sabrina" sort="Schulz, Sabrina" uniqKey="Schulz S" first="Sabrina" last="Schulz">Sabrina Schulz</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Verstrepen, Kevin" sort="Verstrepen, Kevin" uniqKey="Verstrepen K" first="Kevin" last="Verstrepen">Kevin Verstrepen</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">VIB Laboratory of Systems Biology & Laboratory for Genetics and Genomics, Center of Microbial and Plant Genetics, K.U.Leuven, Gaston Geenslaan 1, B-3001 Leuven (Heverlee), Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Suk, Eun Kyung" sort="Suk, Eun Kyung" uniqKey="Suk E" first="Eun-Kyung" last="Suk">Eun-Kyung Suk</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hoehe, Margret R" sort="Hoehe, Margret R" uniqKey="Hoehe M" first="Margret R." last="Hoehe">Margret R. Hoehe</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22102577</idno>
<idno type="pmc">3299995</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299995</idno>
<idno type="RBID">PMC:3299995</idno>
<idno type="doi">10.1093/nar/gkr1042</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000409</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques</title>
<author>
<name sortKey="Duitama, Jorge" sort="Duitama, Jorge" uniqKey="Duitama J" first="Jorge" last="Duitama">Jorge Duitama</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkr1042-AFF1">VIB Laboratory of Systems Biology & Laboratory for Genetics and Genomics, Center of Microbial and Plant Genetics, K.U.Leuven, Gaston Geenslaan 1, B-3001 Leuven (Heverlee), Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mcewen, Gayle K" sort="Mcewen, Gayle K" uniqKey="Mcewen G" first="Gayle K." last="Mcewen">Gayle K. Mcewen</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huebsch, Thomas" sort="Huebsch, Thomas" uniqKey="Huebsch T" first="Thomas" last="Huebsch">Thomas Huebsch</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Palczewski, Stefanie" sort="Palczewski, Stefanie" uniqKey="Palczewski S" first="Stefanie" last="Palczewski">Stefanie Palczewski</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Schulz, Sabrina" sort="Schulz, Sabrina" uniqKey="Schulz S" first="Sabrina" last="Schulz">Sabrina Schulz</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Verstrepen, Kevin" sort="Verstrepen, Kevin" uniqKey="Verstrepen K" first="Kevin" last="Verstrepen">Kevin Verstrepen</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">VIB Laboratory of Systems Biology & Laboratory for Genetics and Genomics, Center of Microbial and Plant Genetics, K.U.Leuven, Gaston Geenslaan 1, B-3001 Leuven (Heverlee), Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Suk, Eun Kyung" sort="Suk, Eun Kyung" uniqKey="Suk E" first="Eun-Kyung" last="Suk">Eun-Kyung Suk</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hoehe, Margret R" sort="Hoehe, Margret R" uniqKey="Hoehe M" first="Margret R." last="Hoehe">Margret R. Hoehe</name>
<affiliation>
<nlm:aff id="gkr1042-AFF1">Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Drysdale, Cm" uniqKey="Drysdale C">CM Drysdale</name>
</author>
<author>
<name sortKey="Mcgraw, Dw" uniqKey="Mcgraw D">DW McGraw</name>
</author>
<author>
<name sortKey="Stack, Cb" uniqKey="Stack C">CB Stack</name>
</author>
<author>
<name sortKey="Stephens, Jc" uniqKey="Stephens J">JC Stephens</name>
</author>
<author>
<name sortKey="Judson, Rs" uniqKey="Judson R">RS Judson</name>
</author>
<author>
<name sortKey="Nandabalan, K" uniqKey="Nandabalan K">K Nandabalan</name>
</author>
<author>
<name sortKey="Arnold, K" uniqKey="Arnold K">K Arnold</name>
</author>
<author>
<name sortKey="Ruano, G" uniqKey="Ruano G">G Ruano</name>
</author>
<author>
<name sortKey="Liggett, Sb" uniqKey="Liggett S">SB Liggett</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoehe, Mr" uniqKey="Hoehe M">MR Hoehe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoehe, Mr" uniqKey="Hoehe M">MR Hoehe</name>
</author>
<author>
<name sortKey="Kopke, K" uniqKey="Kopke K">K Köpke</name>
</author>
<author>
<name sortKey="Wendel, B" uniqKey="Wendel B">B Wendel</name>
</author>
<author>
<name sortKey="Rohde, K" uniqKey="Rohde K">K Rohde</name>
</author>
<author>
<name sortKey="Flachmeier, C" uniqKey="Flachmeier C">C Flachmeier</name>
</author>
<author>
<name sortKey="Kidd, Kk" uniqKey="Kidd K">KK Kidd</name>
</author>
<author>
<name sortKey="Berrettini, Wh" uniqKey="Berrettini W">WH Berrettini</name>
</author>
<author>
<name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tewhey, R" uniqKey="Tewhey R">R Tewhey</name>
</author>
<author>
<name sortKey="Bansal, V" uniqKey="Bansal V">V Bansal</name>
</author>
<author>
<name sortKey="Torkamani, A" uniqKey="Torkamani A">A Torkamani</name>
</author>
<author>
<name sortKey="Topol, Ej" uniqKey="Topol E">EJ Topol</name>
</author>
<author>
<name sortKey="Schork, Nj" uniqKey="Schork N">NJ Schork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marchini, J" uniqKey="Marchini J">J Marchini</name>
</author>
<author>
<name sortKey="Cutler, D" uniqKey="Cutler D">D Cutler</name>
</author>
<author>
<name sortKey="Stephens, M" uniqKey="Stephens M">M Stephens</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Halperin, E" uniqKey="Halperin E">E Halperin</name>
</author>
<author>
<name sortKey="Lin, S" uniqKey="Lin S">S Lin</name>
</author>
<author>
<name sortKey="Qin, Zs" uniqKey="Qin Z">ZS Qin</name>
</author>
<author>
<name sortKey="Munro, Hm" uniqKey="Munro H">HM Munro</name>
</author>
<author>
<name sortKey="Abecasis, Gr" uniqKey="Abecasis G">GR Abecasis</name>
</author>
<author>
<name sortKey="Donnelly, P" uniqKey="Donnelly P">P Donnelly</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scheet, P" uniqKey="Scheet P">P Scheet</name>
</author>
<author>
<name sortKey="Stephens, M" uniqKey="Stephens M">M Stephens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brinza, D" uniqKey="Brinza D">D Brinza</name>
</author>
<author>
<name sortKey="Zelikovsky, A" uniqKey="Zelikovsky A">A Zelikovsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ma, L" uniqKey="Ma L">L Ma</name>
</author>
<author>
<name sortKey="Xiao, Y" uniqKey="Xiao Y">Y Xiao</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Rao, W" uniqKey="Rao W">W Rao</name>
</author>
<author>
<name sortKey="Feng, Y" uniqKey="Feng Y">Y Feng</name>
</author>
<author>
<name sortKey="Zhang, K" uniqKey="Zhang K">K Zhang</name>
</author>
<author>
<name sortKey="Song, Q" uniqKey="Song Q">Q Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fan, Hc" uniqKey="Fan H">HC Fan</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author>
<name sortKey="Potanina, A" uniqKey="Potanina A">A Potanina</name>
</author>
<author>
<name sortKey="Quake, Sr" uniqKey="Quake S">SR Quake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Levy, S" uniqKey="Levy S">S Levy</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
<author>
<name sortKey="Ng, Pc" uniqKey="Ng P">PC Ng</name>
</author>
<author>
<name sortKey="Feuk, L" uniqKey="Feuk L">L Feuk</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
<author>
<name sortKey="Walenz, Bp" uniqKey="Walenz B">BP Walenz</name>
</author>
<author>
<name sortKey="Axelrod, N" uniqKey="Axelrod N">N Axelrod</name>
</author>
<author>
<name sortKey="Huang, J" uniqKey="Huang J">J Huang</name>
</author>
<author>
<name sortKey="Kirkness, Ef" uniqKey="Kirkness E">EF Kirkness</name>
</author>
<author>
<name sortKey="Denisov, G" uniqKey="Denisov G">G Denisov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bentley, Dr" uniqKey="Bentley D">DR Bentley</name>
</author>
<author>
<name sortKey="Balasubramanian, S" uniqKey="Balasubramanian S">S Balasubramanian</name>
</author>
<author>
<name sortKey="Swerdlow, Hp" uniqKey="Swerdlow H">HP Swerdlow</name>
</author>
<author>
<name sortKey="Smith, Gp" uniqKey="Smith G">GP Smith</name>
</author>
<author>
<name sortKey="Milton, J" uniqKey="Milton J">J Milton</name>
</author>
<author>
<name sortKey="Brown, Cg" uniqKey="Brown C">CG Brown</name>
</author>
<author>
<name sortKey="Hall, Kp" uniqKey="Hall K">KP Hall</name>
</author>
<author>
<name sortKey="Evers, Dj" uniqKey="Evers D">DJ Evers</name>
</author>
<author>
<name sortKey="Barnes, Cl" uniqKey="Barnes C">CL Barnes</name>
</author>
<author>
<name sortKey="Bignell, Hr" uniqKey="Bignell H">HR Bignell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mckernan, Kj" uniqKey="Mckernan K">KJ McKernan</name>
</author>
<author>
<name sortKey="Peckham, He" uniqKey="Peckham H">HE Peckham</name>
</author>
<author>
<name sortKey="Costa, Gl" uniqKey="Costa G">GL Costa</name>
</author>
<author>
<name sortKey="Mclaughlin, Sf" uniqKey="Mclaughlin S">SF McLaughlin</name>
</author>
<author>
<name sortKey="Fu, Y" uniqKey="Fu Y">Y Fu</name>
</author>
<author>
<name sortKey="Tsung, Ef" uniqKey="Tsung E">EF Tsung</name>
</author>
<author>
<name sortKey="Clouser, Cr" uniqKey="Clouser C">CR Clouser</name>
</author>
<author>
<name sortKey="Duncan, C" uniqKey="Duncan C">C Duncan</name>
</author>
<author>
<name sortKey="Ichikawa, Jk" uniqKey="Ichikawa J">JK Ichikawa</name>
</author>
<author>
<name sortKey="Lee, Cc" uniqKey="Lee C">CC Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Snyder, M" uniqKey="Snyder M">M Snyder</name>
</author>
<author>
<name sortKey="Du, J" uniqKey="Du J">J Du</name>
</author>
<author>
<name sortKey="Gerstein, M" uniqKey="Gerstein M">M Gerstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burgtorf, C" uniqKey="Burgtorf C">C Burgtorf</name>
</author>
<author>
<name sortKey="Kepper, P" uniqKey="Kepper P">P Kepper</name>
</author>
<author>
<name sortKey="Hoehe, Mr" uniqKey="Hoehe M">MR Hoehe</name>
</author>
<author>
<name sortKey="Schmitt, C" uniqKey="Schmitt C">C Schmitt</name>
</author>
<author>
<name sortKey="Reinhardt, R" uniqKey="Reinhardt R">R Reinhardt</name>
</author>
<author>
<name sortKey="Lehrach, H" uniqKey="Lehrach H">H Lehrach</name>
</author>
<author>
<name sortKey="Sauer, S" uniqKey="Sauer S">S Sauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kitzman, Jo" uniqKey="Kitzman J">JO Kitzman</name>
</author>
<author>
<name sortKey="Mackenzie, Ap" uniqKey="Mackenzie A">AP MacKenzie</name>
</author>
<author>
<name sortKey="Adey, A" uniqKey="Adey A">A Adey</name>
</author>
<author>
<name sortKey="Hiatt, Jb" uniqKey="Hiatt J">JB Hiatt</name>
</author>
<author>
<name sortKey="Patwardhan, Rp" uniqKey="Patwardhan R">RP Patwardhan</name>
</author>
<author>
<name sortKey="Sudmant, Ph" uniqKey="Sudmant P">PH Sudmant</name>
</author>
<author>
<name sortKey="Ng, Sb" uniqKey="Ng S">SB Ng</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Qiu, R" uniqKey="Qiu R">R Qiu</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Suk, E" uniqKey="Suk E">E Suk</name>
</author>
<author>
<name sortKey="Mcewen, Gk" uniqKey="Mcewen G">GK McEwen</name>
</author>
<author>
<name sortKey="Duitama, J" uniqKey="Duitama J">J Duitama</name>
</author>
<author>
<name sortKey="Nowick, K" uniqKey="Nowick K">K Nowick</name>
</author>
<author>
<name sortKey="Schulz, S" uniqKey="Schulz S">S Schulz</name>
</author>
<author>
<name sortKey="Palczewski, S" uniqKey="Palczewski S">S Palczewski</name>
</author>
<author>
<name sortKey="Schreiber, S" uniqKey="Schreiber S">S Schreiber</name>
</author>
<author>
<name sortKey="Holloway, Dt" uniqKey="Holloway D">DT Holloway</name>
</author>
<author>
<name sortKey="Mclaughlin, Sf" uniqKey="Mclaughlin S">SF McLaughlin</name>
</author>
<author>
<name sortKey="Peckham, He" uniqKey="Peckham H">HE Peckham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Panconesi, A" uniqKey="Panconesi A">A Panconesi</name>
</author>
<author>
<name sortKey="Sozio, M" uniqKey="Sozio M">M Sozio</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rizzi, R" uniqKey="Rizzi R">R Rizzi</name>
</author>
<author>
<name sortKey="Bafna, V" uniqKey="Bafna V">V Bafna</name>
</author>
<author>
<name sortKey="Istrail, S" uniqKey="Istrail S">S Istrail</name>
</author>
<author>
<name sortKey="Lancia, G" uniqKey="Lancia G">G Lancia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bansal, V" uniqKey="Bansal V">V Bansal</name>
</author>
<author>
<name sortKey="Bafna, V" uniqKey="Bafna V">V Bafna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lo, C" uniqKey="Lo C">C Lo</name>
</author>
<author>
<name sortKey="Bashir, A" uniqKey="Bashir A">A Bashir</name>
</author>
<author>
<name sortKey="Bansal, V" uniqKey="Bansal V">V Bansal</name>
</author>
<author>
<name sortKey="Bafna, V" uniqKey="Bafna V">V Bafna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Duitama, J" uniqKey="Duitama J">J Duitama</name>
</author>
<author>
<name sortKey="Huebsch, T" uniqKey="Huebsch T">T Huebsch</name>
</author>
<author>
<name sortKey="Mcewen, G" uniqKey="Mcewen G">G McEwen</name>
</author>
<author>
<name sortKey="Suk, E" uniqKey="Suk E">E Suk</name>
</author>
<author>
<name sortKey="Hoehe, Mr" uniqKey="Hoehe M">MR Hoehe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cilibrasi, R" uniqKey="Cilibrasi R">R Cilibrasi</name>
</author>
<author>
<name sortKey="Iersel, Lv" uniqKey="Iersel L">LV Iersel</name>
</author>
<author>
<name sortKey="Kelk, S" uniqKey="Kelk S">S Kelk</name>
</author>
<author>
<name sortKey="Tromp, J" uniqKey="Tromp J">J Tromp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Geraci, F" uniqKey="Geraci F">F Geraci</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="He, D" uniqKey="He D">D He</name>
</author>
<author>
<name sortKey="Choi, A" uniqKey="Choi A">A Choi</name>
</author>
<author>
<name sortKey="Pipatsrisawat, K" uniqKey="Pipatsrisawat K">K Pipatsrisawat</name>
</author>
<author>
<name sortKey="Darwiche, A" uniqKey="Darwiche A">A Darwiche</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Duitama, J" uniqKey="Duitama J">J Duitama</name>
</author>
<author>
<name sortKey="Srivastava, Pk" uniqKey="Srivastava P">PK Srivastava</name>
</author>
<author>
<name sortKey="M Ndoiu, Ii" uniqKey="M Ndoiu I">II Măndoiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sahni, S" uniqKey="Sahni S">S Sahni</name>
</author>
<author>
<name sortKey="Gonzales, T" uniqKey="Gonzales T">T Gonzales</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Wu, L" uniqKey="Wu L">L Wu</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Wang, R" uniqKey="Wang R">R Wang</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Feng, E" uniqKey="Feng E">E Feng</name>
</author>
<author>
<name sortKey="Wang, R" uniqKey="Wang R">R Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
<author>
<name sortKey="Fu, B" uniqKey="Fu B">B Fu</name>
</author>
<author>
<name sortKey="Schweller, R" uniqKey="Schweller R">R Schweller</name>
</author>
<author>
<name sortKey="Yang, B" uniqKey="Yang B">B Yang</name>
</author>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author>
<name sortKey="Zhu, B" uniqKey="Zhu B">B Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Genovese, Lm" uniqKey="Genovese L">LM Genovese</name>
</author>
<author>
<name sortKey="Geraci, F" uniqKey="Geraci F">F Geraci</name>
</author>
<author>
<name sortKey="Pellegrini, M" uniqKey="Pellegrini M">M Pellegrini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adzhubei, Ia" uniqKey="Adzhubei I">IA Adzhubei</name>
</author>
<author>
<name sortKey="Schmidt, S" uniqKey="Schmidt S">S Schmidt</name>
</author>
<author>
<name sortKey="Peshkin, L" uniqKey="Peshkin L">L Peshkin</name>
</author>
<author>
<name sortKey="Ramensky, Ve" uniqKey="Ramensky V">VE Ramensky</name>
</author>
<author>
<name sortKey="Gerasimova, A" uniqKey="Gerasimova A">A Gerasimova</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
<author>
<name sortKey="Kondrashov, As" uniqKey="Kondrashov A">AS Kondrashov</name>
</author>
<author>
<name sortKey="Sunyaev, Sr" uniqKey="Sunyaev S">SR Sunyaev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schaid, Dj" uniqKey="Schaid D">DJ Schaid</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosenbloom, Kr" uniqKey="Rosenbloom K">KR Rosenbloom</name>
</author>
<author>
<name sortKey="Dreszer, Tr" uniqKey="Dreszer T">TR Dreszer</name>
</author>
<author>
<name sortKey="Pheasant, M" uniqKey="Pheasant M">M Pheasant</name>
</author>
<author>
<name sortKey="Barber, Gp" uniqKey="Barber G">GP Barber</name>
</author>
<author>
<name sortKey="Meyer, Lr" uniqKey="Meyer L">LR Meyer</name>
</author>
<author>
<name sortKey="Pohl, A" uniqKey="Pohl A">A Pohl</name>
</author>
<author>
<name sortKey="Raney, Bj" uniqKey="Raney B">BJ Raney</name>
</author>
<author>
<name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
<author>
<name sortKey="Hinrichs, As" uniqKey="Hinrichs A">AS Hinrichs</name>
</author>
<author>
<name sortKey="Zweig, As" uniqKey="Zweig A">AS Zweig</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huda, A" uniqKey="Huda A">A Huda</name>
</author>
<author>
<name sortKey="Bowen, Nj" uniqKey="Bowen N">NJ Bowen</name>
</author>
<author>
<name sortKey="Conley, Ab" uniqKey="Conley A">AB Conley</name>
</author>
<author>
<name sortKey="Jordan, Ik" uniqKey="Jordan I">IK Jordan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lunshof, Je" uniqKey="Lunshof J">JE Lunshof</name>
</author>
<author>
<name sortKey="Bobe, J" uniqKey="Bobe J">J Bobe</name>
</author>
<author>
<name sortKey="Aach, J" uniqKey="Aach J">J Aach</name>
</author>
<author>
<name sortKey="Angrist, M" uniqKey="Angrist M">M Angrist</name>
</author>
<author>
<name sortKey="Thakuria, Jv" uniqKey="Thakuria J">JV Thakuria</name>
</author>
<author>
<name sortKey="Vorhaus, Db" uniqKey="Vorhaus D">DB Vorhaus</name>
</author>
<author>
<name sortKey="Hoehe, Mr" uniqKey="Hoehe M">MR Hoehe</name>
</author>
<author>
<name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22102577</article-id>
<article-id pub-id-type="pmc">3299995</article-id>
<article-id pub-id-type="doi">10.1093/nar/gkr1042</article-id>
<article-id pub-id-type="publisher-id">gkr1042</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genomics</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Duitama</surname>
<given-names>Jorge</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="gkr1042-COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>McEwen</surname>
<given-names>Gayle K.</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Huebsch</surname>
<given-names>Thomas</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Palczewski</surname>
<given-names>Stefanie</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Schulz</surname>
<given-names>Sabrina</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Verstrepen</surname>
<given-names>Kevin</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Suk</surname>
<given-names>Eun-Kyung</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hoehe</surname>
<given-names>Margret R.</given-names>
</name>
<xref ref-type="aff" rid="gkr1042-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="gkr1042-COR1">*</xref>
</contrib>
</contrib-group>
<aff id="gkr1042-AFF1">
<sup>1</sup>
Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany,
<sup>2</sup>
VIB Laboratory of Systems Biology & Laboratory for Genetics and Genomics, Center of Microbial and Plant Genetics, K.U.Leuven, Gaston Geenslaan 1, B-3001 Leuven (Heverlee), Belgium</aff>
<author-notes>
<corresp id="gkr1042-COR1">*To whom correspondence should be addressed. Tel:
<phone>+49 30 8413 1468</phone>
; Fax:
<fax>+49 30 8413 1462</fax>
; Email:
<email>hoehe@molgen.mpg.de</email>
</corresp>
<corresp>Correspondence can also be addressed to Jorge Duitama. Tel:
<phone>+32 1675 1402</phone>
; Fax:
<fax>+32 1675 1391</fax>
; Email:
<email>Jorge.DuitamaCastellanos@biw.vib-kuleuven.be</email>
</corresp>
<fn>
<p>The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.</p>
</fn>
</author-notes>
<pmc-comment>For NAR both ppub and collection dates generated for PMC processing 1/27/05 beck</pmc-comment>
<pub-date pub-type="collection">
<month>3</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="ppub">
<month>3</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>17</day>
<month>11</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>17</day>
<month>11</month>
<year>2011</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>40</volume>
<issue>5</issue>
<fpage>2041</fpage>
<lpage>2053</lpage>
<history>
<date date-type="received">
<day>10</day>
<month>8</month>
<year>2011</year>
</date>
<date date-type="rev-recd">
<day>4</day>
<month>10</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>10</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2011. Published by Oxford University Press.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.</p>
</abstract>
<counts>
<page-count count="13"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>INTRODUCTION</title>
<p>Human individuals are diploid, with each somatic cell containing two sets of chromosomes, one from each parent. However, current standard sequencing technologies provide mostly mixed-diploid readout, missing intrinsic information on the unique haploid structures of each individual chromosome. This limits the description, analysis and interpretation of individual genomes and their function. In view of abundant genome sequence variability within a diploid genome, it is essential to determine the specific combinations of variants for each of the two homologous chromosomes (haplotypes). Knowledge of phase may be key to understanding the relationships between genetic variation and gene function, phenotype, and medically relevant traits such as susceptibility to disease and individual response to drugs (
<xref ref-type="bibr" rid="gkr1042-B1">1</xref>
<xref ref-type="bibr" rid="gkr1042-B4">4</xref>
).</p>
<p>To be able to resolve the underlying haplotype sequences of individual genomes, both computational and experimental approaches have been developed. Computational approaches to haplotyping are preassumption based (
<xref ref-type="bibr" rid="gkr1042-B5">5</xref>
,
<xref ref-type="bibr" rid="gkr1042-B6">6</xref>
) and require genotypic data from entire populations or trios to predict the most likely haplotypes for an individual. In the case of population-based statistical phasing, phase can be determined at common SNP positions but not for rare and novel SNPs. Also, the quality of this phasing is lower than other methods, especially in regions with low linkage disequilibrium. Trio-based phasing (
<xref ref-type="bibr" rid="gkr1042-B5">5</xref>
,
<xref ref-type="bibr" rid="gkr1042-B7">7</xref>
) is generally accurate but unable to phase variants for which both parents are heterozygous (~20% of SNPs). Experimental techniques that attempt to physically separate entire (homologous) chromosomes, such as chromosome micro-dissection (
<xref ref-type="bibr" rid="gkr1042-B8">8</xref>
) or micro-fluidic separation (
<xref ref-type="bibr" rid="gkr1042-B9">9</xref>
) should provide accurate results, but are currently still very challenging. Thus, to date, no complete whole genome haplotypes have yet been resolved by such a method. A more feasible, alternative strategy is to perform shotgun sequencing of an entire genome and then attempt to assemble long, contiguous haplotypes using the heterozygous variant positions within overlapping sequenced fragments. Fragments must be long enough to span at least two heterozygous loci, providing evidence for co-occurrence of alleles on the same chromosome. This approach was taken to assemble the genome of J. Craig Venter, using Sanger sequencing of mate-paired reads (
<xref ref-type="bibr" rid="gkr1042-B10">10</xref>
). However, Sanger sequencing is cost-intensive and in this case only allowed the reconstruction of partial haplotypes with an N50 length close to 300 kb. Specifically, in the context of this article, N50 is defined as the phased block length such that blocks of equal or longer lengths cover half the bases of the total phased portion of the genome. Next-generation sequencing (NGS) technologies provide a cost-effective way to assemble diploid genomes (
<xref ref-type="bibr" rid="gkr1042-B11">11</xref>
,
<xref ref-type="bibr" rid="gkr1042-B12">12</xref>
) but such technologies fail to directly deliver the information required, mainly because reads are too short to cover more than one heterozygous position (
<xref ref-type="bibr" rid="gkr1042-B13">13</xref>
). To provide sequence fragments long enough to assemble large segments of homologous chromosomes, we developed a fosmid pool-based approach to whole genome haplotype analysis (
<xref ref-type="bibr" rid="gkr1042-B14">14</xref>
). This technique yields haploid DNA segments significantly larger than any other standard shotgun sequencing technology (40 kb fosmids) and when used in conjunction with NGS provides a scalable shotgun sequencing technique for individual whole-genome haplotyping [E.-K. Suk
<italic>et al.</italic>
, 2008, Personal Genomes, abstract, (
<xref ref-type="bibr" rid="gkr1042-B15">15</xref>
,
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
)]. Fragments of this size are likely to span several heterozygous variants and can be tiled into large contiguous haplotypes based on identical alleles within regions of overlap. A schematic overview of this method as outlined in detail by Suk
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
) is provided in
<xref ref-type="fig" rid="gkr1042-F1">Figure 1</xref>
. Fosmid-based haplotyping was used to achieve N50 block lengths of about 300 kb (
<xref ref-type="bibr" rid="gkr1042-B15">15</xref>
), similar to the Venter genome (
<xref ref-type="bibr" rid="gkr1042-B10">10</xref>
) and we were able to achieve blocks of almost 1 Mb covering 99% of SNPs in the genome of a European individual (
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
). Although these are not complete chromosomal haplotypes, they are long enough to be used for many practical applications.
<fig id="gkr1042-F1" position="float">
<label>Figure 1.</label>
<caption>
<p>Fosmid pool-based NGS approach to haplotype-resolve whole genomes (
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
). (
<bold>A</bold>
) Diploid genomic DNA of an individual is used to generate approximately 1.5 Mio fosmid clones, and (
<bold>B</bold>
) partitioned into pools of 15 000 fosmids, each covering about 15% of the genome in 40-kb haploid DNA segments. (
<bold>C</bold>
) Fosmid pools are sequenced using NGS. Here only three pools are shown as an example. (
<bold>D</bold>
) Fosmids are mapped to the genome and positions of heterozygous variants detected. (
<bold>E</bold>
) Single Individual Haplotyping is used to separate fragments into the two underlying haplotypes based on allelic identity at overlapping positions. With low coverage fosmid data, the presence of fosmids on only one haplotype can be used to inform the phase, given accurate SNP calling data. (
<bold>F</bold>
) Long contiguous haplotype blocks are generated, covering the entire genome.</p>
</caption>
<graphic xlink:href="gkr1042f1"></graphic>
</fig>
</p>
<p>The computational problem of reconstructing haplotypes from fragments generated by sequencing is known as Single Individual Haplotyping (SIH), and has been studied from the theoretical perspective for more than 10 years (
<xref ref-type="bibr" rid="gkr1042-B17">17</xref>
,
<xref ref-type="bibr" rid="gkr1042-B18">18</xref>
). In brief, for each chromosome, the two alleles of each heterozygous variant are encoded as 0 and 1 and fragments mapping to that chromosome (fosmids in this case) are aligned as rows of a matrix
<italic>M</italic>
with as many columns as heterozygous variants. Any algorithm aiming to solve this problem has two major tasks: (i) Split the fragments (rows of
<italic>M</italic>
) into two disjoint sets such that, if two fragments were extracted from the same chromosome copy, they should belong to the same set. (ii) group all allele calls belonging to the same chromosome copy to reconstruct the final haplotypes. The outcome of this technique when using real sequencing data is a set of haplotype blocks, where each block contains variants that can be linked together by one or more fragments. The number and composition of blocks depends solely on the information from the fragment matrix and can even be calculated before solving SIH (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
,
<xref ref-type="bibr" rid="gkr1042-B20">20</xref>
). Simulations indicate that longer fragment lengths are able to link more variants with the same coverage (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
). If fragments could be sequenced without errors, the solution for SIH within each block would be straightforward. Overlapping fragments would be assigned to the same group if they are equal and to different groups if they differ, and subsequently haplotypes could be assembled by simple consensus. However, sequencing errors and uncalled variants make the problem computationally difficult (
<xref ref-type="bibr" rid="gkr1042-B22">22</xref>
), giving rise to a wide variety of problem formulations and algorithms (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
,
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
,
<xref ref-type="bibr" rid="gkr1042-B23">23</xref>
,
<xref ref-type="bibr" rid="gkr1042-B24">24</xref>
). Most of these algorithms aim to find haplotypes that minimize the number of allele calls that have to be corrected in the input matrix to make it consistent (which give rise to the metric Minimum number of Entries to Correct, MEC). For this reason, SIH can also be seen as an error correction problem (
<xref ref-type="bibr" rid="gkr1042-B20">20</xref>
). Currently, due to lack of real sequence data for testing, most comparisons between algorithms have been carried out on simulated fragments (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
,
<xref ref-type="bibr" rid="gkr1042-B23">23</xref>
) with MEC generally being used to assess quality of the haplotypes under the assumption that lower MEC implies better quality (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
,
<xref ref-type="bibr" rid="gkr1042-B20">20</xref>
). Real data currently exists for the Venter genome (
<xref ref-type="bibr" rid="gkr1042-B10">10</xref>
), a Gujarati individual (
<xref ref-type="bibr" rid="gkr1042-B15">15</xref>
) and a European genome (
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
) but for all of these a validated haplotype to assess the accuracy of the resulting haplotypes is not available and therefore quality assessment was done indirectly by comparing output haplotypes with HapMap haplotypes of the population of Utah residents with ancestry from northern and western Europe (CEU) (
<xref ref-type="bibr" rid="gkr1042-B25">25</xref>
).</p>
<p>In this work we generated whole genome fosmid sequence data for NA12878, a HapMap trio child from the CEU population, providing molecular contiguity over 40 kb haploid DNA segments. Confident trio-based phasing of about 80% of the SNPs for which NA12878 is heterozygous, has been provided as part of the 1000 Genomes Project. Using this trio-based haplotype as a gold-standard we can directly assess both the validity of our fosmid pool-based NGS approach to haplotype-resolve whole genomes and the accuracy of SIH algorithms for assembly using real (molecular) sequence data. Specifically, we implemented and compared eight published SIH algorithms, including our own algorithm ReFHap (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
). We provide, for the first time, solid evidence that fosmid pool-based whole genome haplotyping can deliver highly accurate results even at low fosmid coverages. We examine current quality metrics and propose alternative ones to compare different algorithms for SIH. Particularly we find that minimizing MEC does not guarantee finding the true haplotypes and that lower MEC solutions do not imply better quality haplotypes. This justifies the use of efficient heuristic algorithms such as ReFHap to assemble confident haplotypes, and indeed we find that ReFHap delivers the highest quality haplotypes of all algorithms compared in a computationally efficient manner. We provide publicly available implementations of several alternative fast heuristics for SIH, including ReFHap under GPL license (see Web Resources). Finally, we have expanded the haplotypes for NA12878 to almost the full set of SNPs detected by the 1000 Genomes Project by combining haplotypes assembled by fosmid pool-based NGS with the haplotypes obtained by trio phasing. These near-to-complete haplotypes define a new gold-standard, which can be used for further advances in experimental and computational methods.</p>
</sec>
<sec sec-type="materials|methods">
<title>MATERIALS AND METHODS</title>
<sec>
<title>Generation of fosmid pool-based NGS data for NA12878</title>
<p>We have applied our fosmid pool-based NGS approach, which has previously been described in detail (
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
), to generate whole genome fosmid sequence data from NA12878 as the input for analyses. As indicated above, NA12878, a HapMap trio child, has undergone deep resequencing as part of the 1000 Genomes Project, and therefore provides a gold-standard as reference for analysis. Independent molecular haplotype-resolving NA12878 offers potential synergy with genetic variation studies in this context, particularly to assist validation and inform development of new approaches for using shotgun short-read data, especially within complex genomic regions. NA12878 is available as a lymphoblastoid cell line (GM12878), generated from the DNA of a female donor with Northern and Western European ancestry. To haplotype-resolve the genome of NA12878, about 1.44 million fosmids were generated using a modified version of our previously described protocol (
<xref ref-type="bibr" rid="gkr1042-B14">14</xref>
,
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
). Briefly, particular modifications included selection of two distinct sizes of haploid DNA inserts (33–38 kb and 38–45 kb), which were ligated to the pCC2FOS™ Vector (Epicentre Copy Control HTP Fosmid Library Production Kit) to facilitate subsequent DNA purification. Fosmids were pooled into working units of 15 000 cfu. For sequencing with the SOLiD system, barcoded sequencing libraries were prepared from 32 pools as per standard protocol, and up to 8 pools sequenced in a single flow cell. Raw reads have been deposited in the European Nucleotide Archive (ENA) with accession number ERP000819. After sequencing, SOLiD reads were aligned to the reference genome (Hg18) with Bioscope 1.2 (
<ext-link ext-link-type="uri" xlink:href="www.solidsoftwaretools.com">www.solidsoftwaretools.com</ext-link>
) using default parameters and only reads mapping uniquely to the genome were retained. To detect fosmids we used a sliding window approach to locate suitable length regions above a coverage threshold, defined dynamically based on the total number of mapped bases. Fosmids were detected as un-gapped contigs ranging from 3 kb to 45 kb. We performed fosmid-specific allele calls for the heterozygous SNPs obtained by the 1000 Genomes Project using the SNVQ SNP caller (
<xref ref-type="bibr" rid="gkr1042-B26">26</xref>
). We finally detected events of co-occurrence of homologous fosmids by looking at heterozygous calls in individual fosmid pools; where such events were identified, fosmids were broken down and only their homozygous tails were retained to prevent chimeric fragments with switch errors.</p>
</sec>
<sec>
<title>Genotype and trio-based haplotypes for NA12878</title>
<p>We utilized the 1000 Genomes Project genotype information for NA12878, which includes 1 704 166 heterozygous SNPs. The trio-based haplotypes for NA12878 generated by the 1000 Genomes Project contained phase information for 1 411 836 heterozygous SNP positions.</p>
</sec>
<sec>
<title>ReFHap algorithm</title>
<p>In (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
), we introduced a novel algorithm for SIH which we called ReFHap (Reliable and Fast Haplotyping). We presented an alternative problem formulation, aiming to find the partition of fragments that maximizes an objective function which resembles the real origin of the fragments. The input for SIH is a matrix
<italic>M</italic>
with
<italic>m</italic>
rows, one for each fragment, and
<italic>n</italic>
columns, one for each heterozygous variant. Each entry
<italic>M</italic>
<sub>
<italic>ij</italic>
</sub>
 ∈ {0, 1, −} represents the allele call in the fragment
<italic>i</italic>
for the variant
<italic>j</italic>
. The character ‘–’ is used for variants not covered by each fragment. For two fragments in rows
<italic>i</italic>
<sub>1</sub>
and
<italic>i</italic>
<sub>2</sub>
of the matrix
<italic>M</italic>
, we define the score
<italic>s</italic>
(
<italic>M</italic>
,
<italic>i</italic>
<sub>1</sub>
,
<italic>i</italic>
<sub>2</sub>
) as in (
<xref ref-type="bibr" rid="gkr1042-B17">17</xref>
):
<disp-formula id="gkr1042-M1">
<label>(1)</label>
<graphic xlink:href="gkr1042m1"></graphic>
</disp-formula>
where the score
<italic>s</italic>
(
<italic>a</italic>
<sub>1</sub>
,
<italic>a</italic>
<sub>2</sub>
) of two allele calls is defined by:
<disp-formula id="gkr1042-M2">
<label>(2)</label>
<graphic xlink:href="gkr1042m2"></graphic>
</disp-formula>
</p>
<p>This score works better in practice than the traditional hamming distance because it takes into account both matches and mismatches to separate fragments. While a highly positive score indicates that the two fragments are likely to be extracted from different chromosome copies, a highly negative score indicates that the two fragments are likely to be extracted from the same chromosome copy. Inconsistencies will produce scores close to zero which is the score for fragments that do not have overlapping allele calls. Now, if we define a partition of the fragments as a subset
<italic>I</italic>
of the rows of
<italic>M</italic>
, we can assign a score to
<italic>I</italic>
by adding the scores of every pair of rows
<italic>i</italic>
<sub>1</sub>
,
<italic>i</italic>
<sub>2</sub>
for which
<italic>i</italic>
<sub>1</sub>
 ∈ 
<italic>I</italic>
and
<italic>i</italic>
<sub>2</sub>
 ∉ 
<italic>I</italic>
:
<disp-formula id="gkr1042-M3">
<label>(3)</label>
<graphic xlink:href="gkr1042m3"></graphic>
</disp-formula>
</p>
<p>Finally, we formalize the Maximum Fragments Cut (MFC) problem as finding the partition
<italic>I</italic>
maximizing
<italic>s</italic>
(
<italic>M</italic>
, 
<italic>I</italic>
). In (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
) we shown that this formulation is NP-Complete and we introduced the following heuristic algorithm, which is based on the Max-CUT problem (
<xref ref-type="bibr" rid="gkr1042-B27">27</xref>
):
<list list-type="order">
<list-item>
<p>Build a graph
<italic>G</italic>
with fragments as vertices and edges connecting overlapping fragments. The weight of each edge is the score
<italic>s</italic>
(
<italic>M</italic>
, 
<italic>i</italic>
<sub>1</sub>
, 
<italic>i</italic>
<sub>2</sub>
)</p>
</list-item>
<list-item>
<p>Solve Max-CUT on this graph to find the subset
<italic>I</italic>
maximizing
<italic>s</italic>
(
<italic>M</italic>
, 
<italic>I</italic>
)</p>
</list-item>
<list-item>
<p>Build haplotypes consistent with
<italic>I</italic>
by generalized consensus, assuming that all variants are heterozygous</p>
</list-item>
</list>
</p>
<p>To solve Max-CUT, we implemented a heuristic algorithm similar to the one used in HapCUT (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
). We use a greedy algorithm to initialize a cut starting from a single edge and then we use common heuristics to improve the score of this cut. In contrast to HapCUT, we do not try random edges to start the cut but we sort edges from largest to smallest weight and then we start solutions from the first
<italic>K</italic>
edges, where
<italic>K</italic>
can be adjusted. The assumption is that edges with high scores are more likely to cross the cut.</p>
<p>For the last step, ReFHap assumes that all variants in
<italic>M</italic>
are heterozygous. Although the allele calls in
<italic>M</italic>
could be used to validate which SNPs are really heterozygous, often genotyping results are derived from different sources of information which are more reliable. In our testing data set, genotyping was performed by the 1000 Genomes Project (
<xref ref-type="bibr" rid="gkr1042-B28">28</xref>
) based on three large separate short-read sequencing experiments, so we can safely assume that the heterozygous calls are correct. Instead of calculating a separate consensus on the fragments that belong to
<italic>I</italic>
and on fragments that do not belong to
<italic>I</italic>
, which can lead to homozygous calls, we calculated a generalized consensus for a partition
<italic>I</italic>
as follows:
<list list-type="order">
<list-item>
<p>For each column
<italic>j</italic>
<list list-type="alpha-lower">
<list-item>
<p>
<italic>I</italic>
<sub>
<italic>j</italic>
,0</sub>
 ← {
<italic>i</italic>
:(
<italic>i</italic>
 ∈ 
<italic>I</italic>
 ∧ 
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
] = 0)∨(
<italic>i</italic>
<italic>I</italic>
 ∧ 
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
 ] = 1)}</p>
</list-item>
<list-item>
<p>
<italic>I</italic>
<sub>
<italic>j</italic>
,1</sub>
 ← {
<italic>i</italic>
:(
<italic>i</italic>
 ∈ 
<italic>I</italic>
 ∧ 
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
] = 1)∨(
<italic>i</italic>
<italic>I</italic>
 ∧ 
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
 ] = 0)}</p>
</list-item>
<list-item>
<p>If |
<italic>I</italic>
<sub>
<italic>j</italic>
,0</sub>
| > |
<italic>I</italic>
<sub>
<italic>j</italic>
,1</sub>
| then
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
 ← 0</p>
</list-item>
<list-item>
<p>If |
<italic>I</italic>
<sub>
<italic>j</italic>
,0</sub>
| < |
<italic>I</italic>
<sub>
<italic>j</italic>
,1</sub>
| then
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
 ← 1</p>
</list-item>
<list-item>
<p>Otherwise, let
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
undefined</p>
</list-item>
</list>
</p>
</list-item>
<list-item>
<p>output
<italic>h</italic>
</p>
</list-item>
</list>
</p>
<p>The last step of the cycle in this algorithm is actually different from the one proposed in (
<xref ref-type="bibr" rid="gkr1042-B21">21</xref>
). This step determines what to do if the consensus assigns the same score to both alleles. The two possible options are (i) decide at random or (ii) leave the allele call undecided. The main advantage of the first option is that the output haplotypes are complete within blocks whereas the second option leaves gaps. However, we find that in practice, even at low coverage, this situation occurs for only a small number of variants, and moreover it is better for the quality of the haplotype to highlight difficult variants by leaving them undecided rather than generating a random phase which will be incorrect half of the time. We discuss in detail this compromise between completeness and accuracy in the ‘Results’ section.</p>
</sec>
<sec>
<title>MEC algorithms for SIH</title>
<p>Most of the algorithms that have been proposed to solve SIH try to find the haplotype for which the number of entries to correct (MEC) in the input matrix is minimized. Since this problem formulation has been shown to be NP-Complete and difficult to approximate (
<xref ref-type="bibr" rid="gkr1042-B22">22</xref>
), all proposed exact algorithms have an exponential dependency on at least one parameter. For example, the runtime of the dynamic programming approach proposed by (
<xref ref-type="bibr" rid="gkr1042-B24">24</xref>
) is exponential in the maximum number of allele calls for a fragment. Whereas this is a feasible approach for short reads that are not likely to span more than a few variants, it is not suitable for fosmids because they often span even more than 100 variants, making this approach computationally unfeasible. We will briefly discuss in this section eight different heuristic algorithms for the MEC problem formulation, which were previously reviewed by (
<xref ref-type="bibr" rid="gkr1042-B23">23</xref>
). The first published algorithm for SIH, called FastHare (
<xref ref-type="bibr" rid="gkr1042-B17">17</xref>
), sorts the fragments based on their first informative locus and then goes left to right assigning each fragment to the closest haplotype and recalculating consensus after each step. Due to its simplicity, FastHare is a very fast algorithm. The algorithms MLF (
<xref ref-type="bibr" rid="gkr1042-B29">29</xref>
), 2d-MEC (
<xref ref-type="bibr" rid="gkr1042-B30">30</xref>
) and DGS (used to assemble the Venter genome) (
<xref ref-type="bibr" rid="gkr1042-B10">10</xref>
) are variants of the same repetitive general procedure consisting of iterating until convergence the following two steps:
<list list-type="order">
<list-item>
<p>Calculate the haplotype
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
by consensus given a fixed partition
<italic>I</italic>
<sub>
<italic>i</italic>
</sub>
of the fragments</p>
</list-item>
<list-item>
<p>Calculate the partition
<italic>I</italic>
<sub>
<italic>i</italic>
+1</sub>
of fragments by assigning each fragment to the closest between the haplotype
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
and its complement.</p>
</list-item>
</list>
</p>
<p>The differences among these algorithms lie mainly on the strategies used to create the initial partition
<italic>I</italic>
<sub>1</sub>
and in the distance measures applied to decide if a fragment is close to
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
or to its complement. In MLF, since the partition is started at random, the whole procedure is repeated 100 times to enlarge the space of visited solutions.</p>
<p>The algorithm chosen to assemble the Gujarati haplotypes is called HapCUT (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
). HapCUT also works by improving the answer haplotype iteratively but, instead of using partitions of the fragments set, it tries to find alleles that after flipping will reduce the MEC. The improvement step can be summarized in the following steps:
<list list-type="order">
<list-item>
<p>Build a graph
<italic>G</italic>
(
<italic>M</italic>
, 
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
) with variants as vertices and weighted edges between variants linked by at least one fragment. The weight of an edge is the number of fragments inconsistent with
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
minus the number of fragments consistent with
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
.</p>
</list-item>
<list-item>
<p>Run a heuristic algorithm for Max-Cut on
<italic>G</italic>
to find a subset
<italic>V</italic>
<sub>
<italic>i</italic>
</sub>
of the variants for which if alleles are flipped in
<italic>H</italic>
<sub>
<italic>i</italic>
</sub>
, the MEC will be reduced. In practice, any cut with positive weight is enough to improve the current haplotype</p>
</list-item>
<list-item>
<p>Build
<italic>H</italic>
<sub>
<italic>i</italic>
+1</sub>
by flipping the allele calls corresponding with variants in
<italic>V</italic>
<sub>
<italic>i</italic>
</sub>
</p>
</list-item>
</list>
</p>
<p>A randomized heuristic is applied for Max-CUT to increase the number of visited solutions. The complexity of the graph on which Max-CUT is solved makes this algorithm the slowest but also the best to find close to optimal MEC solutions.</p>
<p>Two more algorithms are mentioned in (
<xref ref-type="bibr" rid="gkr1042-B23">23</xref>
), a randomized one called SHRThree (
<xref ref-type="bibr" rid="gkr1042-B31">31</xref>
), and SpeedHap (
<xref ref-type="bibr" rid="gkr1042-B32">32</xref>
) which tries to build first a core solution with variants and fragments with full agreement and evidence of presence of the two alleles for each variant, and then includes the remaining fragments and variants by relaxing constraints. Among all these algorithms, HapCUT was the only one for which there was an implementation available to be applied to real data and to perform independent validation. We decided to implement all the other heuristic algorithms and made them available along with ReFHap as part of a single software package. We now release this package under GPL license in (
<ext-link ext-link-type="uri" xlink:href="http://www.molgen.mpg.de/~genetic-variation/SIH/data">http://www.molgen.mpg.de/~genetic-variation/SIH/data</ext-link>
), so that our implementations can be evaluated, improved and used for further advances in haplotyping techniques.</p>
</sec>
<sec>
<title>Quality measures</title>
<p>Until now, there has not been a conclusive study ranking SIH algorithms in terms of quality. This is mainly due to the lack of real data but also to the lack of a standard quality measure allowing the comparison of different approaches. Most previous studies use the hamming distance between the answer haplotype and the closest of the real haplotypes as a measure of quality (
<xref ref-type="bibr" rid="gkr1042-B23">23</xref>
,
<xref ref-type="bibr" rid="gkr1042-B29">29</xref>
). However, this measure can over-penalize simple switch errors (
<xref ref-type="bibr" rid="gkr1042-B20">20</xref>
). Other studies compare MEC values mainly because that is the optimization objective in the MEC problem formulation, and because the MEC value of a solution can be calculated without requiring the real haplotype (
<xref ref-type="bibr" rid="gkr1042-B24">24</xref>
). Unfortunately, the correlation between MEC values and haplotype quality is not perfect, which makes this measure inaccurate for comparing similar solutions (see ‘Results’ section for details).</p>
<p>Another more effective strategy to assign a score to a completely assembled haplotype is to count the number of switch errors. In general, a switch error (SE) is an inconsistency between an assembled haplotype and the real haplotype between two contiguous variants. If either the real or the assembled haplotype include gaps, then switch errors are counted between pairs of variants for which there is no intervening variant that has allele calls in both the real and the assembled haplotype. This count needs to be divided by the total number of overlapping variants, and the normalized count is called the switch error rate. Switch error rate is a good measure to assess quality but it does not provide information on completeness of the haplotype. In an extreme case, a haplotype with just two allele calls well phased has a zero switch error rate.</p>
<p>An alternative measure, called adjusted N50 (AN50) was proposed by (
<xref ref-type="bibr" rid="gkr1042-B20">20</xref>
). This measure is calculated as follows:
<list list-type="order">
<list-item>
<p>Calculate span (in reference base pairs) from first to last phased variant for each block</p>
</list-item>
<list-item>
<p>Multiply each span by the proportion of phased alleles inside the block (to correct for uncalled alleles)</p>
</list-item>
<list-item>
<p>Sort blocks from largest to smallest adjusted span</p>
</list-item>
<list-item>
<p>Traverse the list counting the number of phased variants until this count is more than half of the total number of variants.</p>
</list-item>
</list>
</p>
<p>A similar measure of completeness called S50 can also be calculated by sorting the blocks by number of phased SNPs instead of adjusted span. Both measures penalize incomplete haplotypes, but do not provide information about quality.</p>
<p>To account for both completeness and quality, we propose the following two steps procedure to calculate an alternate measure that we called quality adjusted N50 (QAN50).
<list list-type="order">
<list-item>
<p>Break each haplotype block into the longest possible sub-blocks for which no switch error can be detected</p>
</list-item>
<list-item>
<p>Calculate AN50, as described above, for these sub-blocks.</p>
</list-item>
</list>
</p>
<p>This measure establishes a compromise between accuracy and completeness and also gives an idea on to which extent (in genomic bases) assembled haplotypes can be trusted. In the next sections we will show how different algorithms score in terms of AN50, switch errors and QA50.</p>
</sec>
</sec>
<sec>
<title>RESULTS</title>
<sec>
<title>Fosmid pool-based NGS input data and NA12878 haplotype assembly</title>
<p>Sequencing of 32 fosmid pools of NA12878 (see ‘Materials and Methods’ section for details) resulted in 941 793 498 mapped reads, equivalent to a median 10x genome coverage after duplicated reads had been removed. Over 81% of the genome was covered at least 2× or greater. Heterozygous SNPs positions from the 1000 Genomes Project data set for NA12878 (1 704 166 SNPs) were used to inform the positions where alleles were called within each fosmid, informing a total of 5 145 474 allele calls across all fosmids. For comparison, this average of 18.03 calls per fosmid is six times larger than the corresponding average number of calls in the Venter genome. Only fosmids which contain two or more SNPs are informative for phasing and our data set contained 285 341 phase-informative fosmids (hereafter termed fragments). From the input matrix for SIH, the total number of blocks containing variants that can be linked together by one or more fragments was 17 839, covering 2.04 Gb of the genome.
<xref ref-type="fig" rid="gkr1042-F2">Figure 2</xref>
shows the distribution of blocks per number of SNPs. Even though the fragment coverage is just 3.02 on average, long overlapping fragments allow the phasing of up to 1 582 652 (92.9% of the total) SNPs into blocks with an S50 of 215 SNPs. It is worth noting that this percentage of SNPs seems to be inconsistent with the percentage of the genome included in blocks (about 64%). The reason for this difference is the existence of large repetitive regions in the genome, such as the centromeres, in which it is very difficult to map reads and reliably call SNPs. The largest block contains 3921 SNPs and it is located in the MHC region, which is known to have higher variability than other regions in the genome. These blocks were used as the input for eight SIH algorithms (namely ReFHap, HapCUT, FastHare, DGS, MLF, 2d-MEC, SHRThree and SpeedHap). Input matrices and assembled haplotypes are available for download at (
<ext-link ext-link-type="uri" xlink:href="http://www.molgen.mpg.de/~genetic-variation/SIH/data">http://www.molgen.mpg.de/~genetic-variation/SIH/data</ext-link>
).
<fig id="gkr1042-F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Distribution of blocks per different number of phased SNPs.</p>
</caption>
<graphic xlink:href="gkr1042f2"></graphic>
</fig>
</p>
<p>We verified that our coverage results were consistent with other similar studies. The N50 phased block length achieved for sequencing of the Gujarati individual (
<xref ref-type="bibr" rid="gkr1042-B15">15</xref>
) was 386 kb, mainly because they sequenced about 600 000 fosmids (25% more), but also because they considered 1.9 million predicted heterozygous variants (about 10% more), which produces a greater number of overlaps between fosmids. The N50 for Venter's genome (
<xref ref-type="bibr" rid="gkr1042-B10">10</xref>
) was about 350 kb after performing Sanger sequencing of 103 356 fragments covering 1.85 million predicted heterozygous variants. A large N50 value was achieved in this case by sequencing 1 kb ends of fragments larger than 100 kb, which allowed distant variants to be connected. We have been able to generate the most comprehensively haplotype-resolved individual genome, ‘Max Planck One’ (MPI) to date using our fosmid pool-based NGS approach, and have achieved an N50 phased block length of almost 1 Mb containing 99% of SNPs (
<xref ref-type="bibr" rid="gkr1042-B16">16</xref>
). This level of completeness required sequencing of 67 pools of 15 000 fosmids which resulted in 1.16 million phase informative fosmids, equivalent to 6.38× fosmid coverage of each haplotype.</p>
<p>Unfortunately, for all of these haplotyped genomes it is not feasible to make a direct assessment of quality. Validation for these haplotypes was performed by comparison to HapMap haplotypes assembled by statistical phasing, on regions known to have high linkage disequilibrium. Although comparisons with HapMap haplotypes provide some general sense of reliability, they are not informative enough to produce an accurate estimation of the switch error rate and to investigate potential causes of errors. When compared to haplotypes of 83 HapMap trio children from the CEU population, the percentage of concordance in the phasing of consecutive variants for NA12878, MP1 and the Gujarati individual is consistent with the demographic origin of the samples (
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr1042/DC1">Supplementary Figure S1</ext-link>
). In the following sections we use the trio-phased haplotype for NA12878 as a reference to make a direct assessment of quality of our whole genome haplotype assembly and compare different algorithms for SIH.</p>
</sec>
<sec>
<title>Overall quality assessment</title>
<p>A comparison between all heuristic algorithms for SIH across four different measures is shown in
<xref ref-type="fig" rid="gkr1042-F3">Figure 3</xref>
, A–D: (A) AN50, (B) switch error rate, (C) QAN50 (described above) and (D) runtime for our dataset. Using ReFHap, 91.7% of SNPs were phased and the QAN50 block size was 117.8 kb. ReFHap had the lowest switch error rate (1.69%) and the highest QAN50 of the eight SIH algorithms. DGS and FastHare phase about the same number of SNPs as ReFHap but with slightly larger switch error rates (1.82 and 1.74%, respectively). HapCUT, for which we ran 10 iterations, phased slightly more SNPs than any other algorithm, phasing 1068 (0.06% of input SNPs) more SNPs than ReFHap. HapCUT also covered the largest fraction of the genome after adding up the lengths of the blocks for which no switch error can be detected and adjusting for unphased SNPs (1.82 Gb). ReFHap, FastHare and DGS were close with 1.8 Gb (1.79 Gb for DGS). However, as expected, HapCUT also had significantly longer running time than the other methods (
<xref ref-type="fig" rid="gkr1042-F3">Figure 3</xref>
D). While ReFHap, DGS and FastHare were all able to phase full chromosomes within a few seconds, HapCUT can take hours for a single iteration. This happens because the runtime for the first three methods mainly depends on the number of overlapping fragments in one block, while for HapCUT it depends on the maximum number of SNPs connected in one block. Fosmids are able to connect large numbers of SNPs at low coverage, so algorithms such as ReFHap require significantly less computational resources. Chromosome 6 is an extreme case with HapCUT taking more than 10 h to complete one single iteration compared to 3.29 s for ReFHap; this is mainly due to the large blocks of connected SNPs within the MHC region. As fosmid coverage and number of heterozygous variants analyzed increases, the number of connected components also increases, making the instances more difficult to solve for HapCUT.
<fig id="gkr1042-F3" position="float">
<label>Figure 3.</label>
<caption>
<p>Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (
<bold>A</bold>
) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (
<bold>B</bold>
) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (
<bold>C</bold>
) Quality adjusted N50 which combined measures of completeness and quality; (
<bold>D</bold>
) Runtimes of each algorithm on this data set (log scale); (
<bold>E</bold>
) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (
<bold>F</bold>
) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.</p>
</caption>
<graphic xlink:href="gkr1042f3"></graphic>
</fig>
</p>
<p>We investigated the correlation between switch error rate and different properties of the blocks such as size, span, number of fragments, average fragment length and coverage. We did not find positive or negative correlation of switch errors with any of the analyzed characteristics. As we show with the MEC analysis performed in the next section, switch errors are directly correlated with the allele calling error rate. The distribution of switch error rates and correlation coefficients for each characteristic of the input are included in the
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr1042/DC1">Supplementary Figure S2</ext-link>
.</p>
<p>Given the lack of correlation between coverage and switch error rate, we might wrongly infer that it is not worth increasing the number of fragments sequenced to improve the quality of the haplotypes. In practice, however, increasing the number of variants and fragments will generally change the number and composition of the blocks, affecting the overall quality. To determine how different input sizes change the quality of the haplotypes and to assess how different algorithms are affected by the input size, we ran the pipeline on subsets of the fragments (8, 16 and 24 fosmid pools), and on subsets of the SNPs (25, 50 and 75%) and we calculated the QAN50 of haplotypes assembled by ReFHap, DGS, FastHare and HapCUT. We found that the results of the comparison with the whole dataset were consistent across the different datasets.
<xref ref-type="fig" rid="gkr1042-F3">Figure 3</xref>
E shows that the QAN50 grows linearly with the number of fosmids, being zero for eight pools because less than half of the total SNPs are phased, and growing up to 111 395 bp, which is the maximum value achieved by ReFHap for the whole data set. Taking subsets on the total number of SNPs is a way to simulate variation in the heterozygosity rate of the individual. Low heterozygosity rates reduce the number of variants linked in blocks and reduce the size of the blocks which affects the QAN50. As the heterozygosity rate increases, the length of the blocks also increases but also more switch errors can be detected.
<xref ref-type="fig" rid="gkr1042-F3">Figure 3</xref>
F shows that the QAN50 grows with the number of SNPs, up to 75% of the data set. After this point, the effect of switch errors equates and even becomes more important than the increase in block length reducing the final QAN50. HapCUT seems to be more affected by this effect than the other algorithms.</p>
</sec>
<sec>
<title>MEC as a measure of quality</title>
<p>Previous studies compare algorithms based on the minimum number of entries to correct to make the input matrix consistent with the assembled haplotypes (MEC) (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
,
<xref ref-type="bibr" rid="gkr1042-B24">24</xref>
). If the complete real haplotypes were available, it would be easy to align each fragment to the closest haplotype and to identify exactly the allele calls to be corrected. Unfortunately our gold-standard is not complete, and hence it can not be determined if allele calls in variants uncalled by the gold-standard should be corrected or not. To overcome this issue we ran SIH using only the alleles calls from the subset of SNPs that are present in the trio-phased gold-standard. Since in this case the gold-standard haplotype is complete, we were able to calculate the real MEC. It is interesting to note that the MEC percentage of the gold-standard, which was 2.89%, is the exact allele calling error rate for this experiment.</p>
<p>We can also use the real MEC values of each block to make direct comparisons with MEC values of assembled haplotypes and check if optimizing MEC increases quality. We compared the MEC of the gold-standard with the MEC of HapCUT haplotypes, taking into account that HapCUT is the algorithm achieving the lowest MEC values. We found that the MEC values of the gold-standard are consistently higher across all blocks than those of the HapCUT haplotypes (see
<xref ref-type="fig" rid="gkr1042-F4">Figure 4</xref>
). This means that solutions with optimal, or close to optimal, MEC values are likely to fix less erroneous calls than actually need to be corrected and, in general, are not guaranteed to have better quality. To confirm this statement, we calculated correlation coefficients between MEC percentages and switch error rates for both the gold-standard and HapCUT. We found a high correlation (Pearson Correlation=0.84) between the MEC percentage of the gold-standard and the switch error rate. However, we found that the correlation of the HapCUT MEC percentage with the switch error rate decreased to just 0.11. This means that predicted MEC values are skewed and hence are not good predictors of the switch error rate. Finally, we divided the set of blocks into bins of allele calling error rates to look for another determinant of switch errors. Surprisingly we found consistent negative correlation (−0.5, −0.4) between HapCUT MEC values and switch error rates, which means that solutions with lower MEC values are more likely to increase the number of switch errors.
<fig id="gkr1042-F4" position="float">
<label>Figure 4.</label>
<caption>
<p>Comparison of MEC values predicted by HapCUT with real MEC values. The dark grey bars show the increase of MEC percentage for the gold-standard as the switch error rate increases. However, MEC percentages predicted by HapCUT (light grey bars) do not increase as they should because HapCUT tries to find the solution minimizing MEC. The number of blocks analyzed for each bin (medium grey bars) is shown in the right
<italic>Y</italic>
axis.</p>
</caption>
<graphic xlink:href="gkr1042f4"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Construction of a new gold-standard haplotype</title>
<p>The current gold-standard haplotypes only contain phase information for ~80% of SNP positions due to the fact that trio phasing cannot resolve SNPs that are heterozygous in both parents and the child. Fosmid pool-based phasing is theoretically able to resolve the phase of all SNPs. Therefore we decided to create a new gold-standard for NA12878 combining all SNPs from both methods. We assembled these new gold-standard haplotypes by combining both data sets and correcting the switch errors as follows. We initially selected one of the trio haplotypes as the template and then, for the fosmid-based haplotypes, we built blocks of maximal length within which no switch error is detected (the same blocks built to calculate QAN50). Inside each of these blocks, we augmented the template by filling uncalled variants with calls of the assembled haplotype consistent with the template. Between blocks, by definition, we know that a switch error occurred in one of the variants. To correct this error, we ranked called variants by consensus value and selected the variant with lowest consensus as the position
<italic>i</italic>
where it is most likely that the switch error was produced. We filled the uncalled variants in the template before
<italic>i</italic>
with the haplotype selected for the left-hand block and then we filled the uncalled variants after
<italic>i</italic>
with the haplotype selected for the right-hand block.</p>
<p>
<italic>A priori</italic>
there is no reason to think that the accuracy of fosmid-based haplotyping would decrease in SNPs not phased by the trio, and hence this procedure should correct a large percentage of switch errors in the assembled solutions. However, to assess the accuracy of phasing for the SNPs not verified by the parental genotypes, we compared our results with haplotypes assembled with statistical phasing. We downloaded the latest genotype calls released by the 1000 Genomes Project for a collection of 288 individuals with European ancestry (EUR). This collection groups samples from the following populations: Utah residents (CEPH) with Northern and Western European ancestry (CEU), Toscani in Italia (TSI), British from England and Scotland (GBR), Finnish from Finland (FIN) and Iberian populations in Spain (IBS). We used FastPHASE (
<xref ref-type="bibr" rid="gkr1042-B6">6</xref>
) to predict the most likely haplotypes for 21 878 SNPs in chromosome 22 of NA12878, which has 22 801 SNPs in total (the remaining 923 SNPs were not present in the 1000 Genomes Project genotypes). We also predicted haplotypes based on subsets of the reference population of size 50, 100, 150 and 200 to test how the concordance with the new gold-standard haplotypes changes as the number of individuals in the sample increases. We calculated separately the concordance for adjacent SNPs phased with the parental information (16 346) and SNPs phased with fosmid-based NGS (5255), which add up to the 21 601 SNPs shared between the statistical and the new gold-standard haplotypes.
<xref ref-type="fig" rid="gkr1042-F5">Figure 5</xref>
shows that the concordance is lower for the adjacent SNPs phased with fosmid-based NGS, but it is still larger than 95% after 100 or more individuals are included. The concordance always grows with the number of individuals which means that, as the quality of the haplotypes derived with statistical phasing improves, the concordance with the new gold-standard haplotypes increases. Even if the differences between the new gold-standard haplotype and the haplotypes predicted by FastPHASE in adjacent SNPs phased with fosmids sequencing (207) were all due to errors in the new gold-standard, the overall switch error rate would be <1%.
<fig id="gkr1042-F5" position="float">
<label>Figure 5.</label>
<caption>
<p>Comparison of the new gold-standard haplotype (“Overall”) with haplotypes predicted by statistical phasing using different numbers of individuals in the reference panel. The concordance was calculated separately for pairs of adjacent SNPs phased using parental genotypes (trio phased) and pairs phased using fosmid-based haplotyping (non-trio phased).</p>
</caption>
<graphic xlink:href="gkr1042f5"></graphic>
</fig>
</p>
<p>We were able to phase an additional 257 245 SNPs that were not resolved in the trio phased haplotypes to achieve a new total of 1 669 081 phased SNPs. The haplotypes combining parental information with fosmid-based haplotyping resolve the phase of 97.9% of SNPs in NA12878 (compared to 82.8% previously) producing almost complete SNP haplotypes in this individual. The corrected haplotypes are available for download (
<ext-link ext-link-type="uri" xlink:href="http://www.molgen.mpg.de/~genetic-variation/SIH/data">http://www.molgen.mpg.de/~genetic-variation/SIH/data</ext-link>
). These new haplotypes increase the phase information within various important functional units or disease-related regions (see
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr1042/DC1">Supplementary Table</ext-link>
). For example an additional 96 849 SNPs are phased within genes, including 816 SNPs that cause non-synonymous mutations or splice site mutations. In particular 847 of the newly phased SNPs produce an amino acid exchange in proteins, 108 of which are predicted to be damaging by PolyPhen-2 (
<xref ref-type="bibr" rid="gkr1042-B33">33</xref>
) and 184 are predicted to be damaging by SIFT (56 predictions overlap). The new gold-standard also contains the phase of an additional 263 GWA SNPs across the genome, a useful addition as it has been shown that haplotype information increases the power of genome-wide association studies (GWAS) (
<xref ref-type="bibr" rid="gkr1042-B34">34</xref>
). An additional 11 395 phased SNPs were contained within genes annotated by the Genome Association Database (GAD), with single genes containing hundreds of newly phased SNPs (
<xref ref-type="table" rid="gkr1042-T1">Table 1</xref>
). Some specific examples of GAD genes containing many additional phased SNPs and including at least one GWA SNP are shown in
<xref ref-type="fig" rid="gkr1042-F6">Figure 6</xref>
. These examples are associated with various cancers (
<italic>AGT1A</italic>
genes and
<italic>CDH1</italic>
), drug sensitivity (
<italic>UGT1A9</italic>
), and hypertension and osteoporosis (
<italic>COL1A2</italic>
).
<fig id="gkr1042-F6" position="float">
<label>Figure 6.</label>
<caption>
<p>Examples of GAD genes containing many additional phased SNPs. Fosmid-based phasing allows resolution of the phase of significant numbers of additional SNPs which may be particularly useful within disease-associated genes and SNPs detected in genome-wide association studies (GWA SNPs). Here, we show three examples of disease-relevant genes that contain many additional phased SNPs:
<italic>UGT1A</italic>
genes associated with various cancers;
<italic>CDH1</italic>
which plays a role in drug sensitivity and
<italic>COL1A2</italic>
associated with hypertension and osteoporosis. Tracks are taken from the UCSC Genome Browser. SNPs resolved by trio phasing are shown in the top track with SNPs resolved using fosmid-based phasing shown below. SNPs from the GWAS Catalog are shown as green bars in a separate track and those GWA SNPs that are resolved by fosmid-based phasing are indicated by pink arrows. Annotation from the Gene Association Database (GAD) and OMIM are shown in the lower tracks.</p>
</caption>
<graphic xlink:href="gkr1042f6"></graphic>
</fig>
<table-wrap id="gkr1042-T1" position="float">
<label>Table 1.</label>
<caption>
<p>Comparison of numbers of phased SNPs in functional units or disease-related regions between trio gold-standard and new gold-standard haplotypes</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Trio gold- standard</th>
<th rowspan="1" colspan="1">New gold- standard</th>
<th rowspan="1" colspan="1">Additional SNPS phased</th>
<th rowspan="1" colspan="1">Increase (%)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Total SNPs phased</td>
<td rowspan="1" colspan="1">1 411 836</td>
<td rowspan="1" colspan="1">1 669 081</td>
<td rowspan="1" colspan="1">257 245</td>
<td rowspan="1" colspan="1">18.2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Genes</td>
<td rowspan="1" colspan="1">506 276</td>
<td rowspan="1" colspan="1">603 125</td>
<td rowspan="1" colspan="1">96 849</td>
<td rowspan="1" colspan="1">19.1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Missense, nonsense, splice variants</td>
<td rowspan="1" colspan="1">4650</td>
<td rowspan="1" colspan="1">5466</td>
<td rowspan="1" colspan="1">816</td>
<td rowspan="1" colspan="1">17.5</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GWA SNPs</td>
<td rowspan="1" colspan="1">1323</td>
<td rowspan="1" colspan="1">1568</td>
<td rowspan="1" colspan="1">245</td>
<td rowspan="1" colspan="1">18.5</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GAD disease genes</td>
<td rowspan="1" colspan="1">63 085</td>
<td rowspan="1" colspan="1">74 480</td>
<td rowspan="1" colspan="1">11 395</td>
<td rowspan="1" colspan="1">18.1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">ENCODE regions</td>
<td rowspan="1" colspan="1">13 140</td>
<td rowspan="1" colspan="1">16 207</td>
<td rowspan="1" colspan="1">3 067</td>
<td rowspan="1" colspan="1">23.3</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
<sec>
<title>DISCUSSION</title>
<p>Haplotyping has been identified as one of the most difficult steps toward full genome completion (
<xref ref-type="bibr" rid="gkr1042-B13">13</xref>
) and therefore the development of an accurate and scalable technique for direct haplotyping of diploid samples is of great interest for researchers in both theoretical and applied genetics and genomics (
<xref ref-type="bibr" rid="gkr1042-B4">4</xref>
). Fosmid pool-based haplotyping, utilizing NGS, is a scalable and cost-effective method for the assembly of whole genomes into large contiguous haplotypes and in this study we have undertaken comprehensive assessment of quality for this method. We confirm for the first time that this method allows assembly of highly accurate haplotypes, and we also show that this accuracy is correlated with the allele calling error rate. Hence, improvements in quality and analysis of NGS reads will also increase the accuracy of fosmid pool-based haplotyping. We believe that this makes fosmid pool-based haplotyping a valuable approach for a wide variety of applications of human genome haplotyping such as cancer genome sequencing, and it can even be applied for sequencing of other types of organisms. The SIH problem is at the core of the bioinformatics analysis needed for any haplotyping technique based on shotgun sequencing. Although this problem has been studied for a long time, novel experimental approaches, such as fosmid pool-based haplotyping, provide the real data needed to find new directions for improvement. We have compared a wide variety of algorithms for SIH, specifically assessing accuracy, completeness and runtime for eight different algorithms using real sequence data from fosmid pools. Utilizing the genome of an individual for which there already exists a gold-standard haplotype has allowed us to comprehensively assess the quality of different methods. For this quality assessment, we have proposed a new metric which takes into consideration both the completeness and accuracy of the haplotypes which we call quality adjusted N50 (QAN50). We find that according to both switch error rate and QAN50, ReFHap yields the best compromise between completeness, accuracy and computational resources. We also show that the MEC-based problem formulation used in most of the recently proposed algorithms for SIH can lead to suboptimal haplotypes, even if the MEC problem is solved optimally. This finding justifies the use of heuristic methods not only because of their better efficiency, but also because they yield higher accuracy, and leaves an open door for novel bioinformatics solutions to SIH.</p>
<p>Despite the accuracy of the current gold-standard haplotypes, the phase of almost 20% of the SNPs remained unresolved by trio phasing. Here, utilizing our fosmid-based phasing data in conjunction with the trio haplotypes, we provide a nearly complete new gold-standard haplotype for NA12878, covering 97.9% of heterozygous SNPs that have been genotyped in this widely studied HapMap individual. This has generated phase information for almost all potentially disease predisposing SNPs allowing them to be analyzed now in their molecular context, an indispensable prerequisite to explore their potential functional implications and pathophysiology. Furthermore, we were able to include a notable fraction of GWA SNPs into phase context, an important step to be able to track the underlying causative variants. This phase information is particularly useful in NA12878 given that this individual (in the form of a stable lymphoblastoid cell line) has been extensively analyzed in a variety of projects, such as the ENCODE project (
<xref ref-type="bibr" rid="gkr1042-B35">35</xref>
) and the 1000 Genomes Project (
<xref ref-type="bibr" rid="gkr1042-B28">28</xref>
). As data accumulates for this individual on gene expression, histone modifications, transcription factor binding sites and other such functional assays (
<xref ref-type="bibr" rid="gkr1042-B36">36</xref>
), avenues for examining the effect of phase at a functional level are opened. Our molecular phase data from NA12878 can be integrated with data from all other omics levels to develop a more coherent picture of ‘phase-sensitive’ functional genomics (
<xref ref-type="bibr" rid="gkr1042-B37">37</xref>
). These improved haplotypes will be of use to the scientific community when analyzing functional genomic data for NA12878, allowing new insights into the importance of phase.</p>
</sec>
<sec>
<title>ACCESSION NUMBER</title>
<p>European Nucleotide Archive: ERP000819.</p>
</sec>
<sec>
<title>SUPPLEMENTARY DATA</title>
<p>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr1042/DC1">Supplementary Data</ext-link>
are available at NAR Online: Supplementary Table 1 and Supplementary Figures 1 and 2.</p>
</sec>
<sec id="SEC6">
<title>FUNDING</title>
<p>
<funding-source>The German Federal Ministry of Science and Education (BMBF)</funding-source>
, through the NGFN-2 program and the NGFN-Plus program grants (
<award-id>201GR0414</award-id>
,
<award-id>01GS0863</award-id>
to M.R.H.);
<funding-source>European Research Council Young Investigator</funding-source>
grant (
<award-id>241426</award-id>
to K.V.);
<funding-source>Vlaams Instituut voor Biotechnologie</funding-source>
;
<funding-source>Katholieke Universiteit Leuven</funding-source>
;
<funding-source>Fonds Wetenschappelijk Onderzoek Vlaanderen</funding-source>
; and the
<funding-source>European Molecular Biology Organization through the Odysseus program and the YIP program</funding-source>
. Funding for open access charge:
<funding-source>Max Planck Institute for Molecular Genetics</funding-source>
.</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_40_5_2041__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_gkr1042_nar-01884-n-2011-File009.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>We acknowledge the authors of (
<xref ref-type="bibr" rid="gkr1042-B19">19</xref>
) for making available an implementation of HapCUT.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="gkr1042-B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Drysdale</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>McGraw</surname>
<given-names>DW</given-names>
</name>
<name>
<surname>Stack</surname>
<given-names>CB</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Judson</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Nandabalan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Arnold</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ruano</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Liggett</surname>
<given-names>SB</given-names>
</name>
</person-group>
<article-title>Complex promoter and coding region β
<sub>2</sub>
-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2000</year>
<volume>97</volume>
<fpage>10483</fpage>
<lpage>10488</lpage>
<pub-id pub-id-type="pmid">10984540</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoehe</surname>
<given-names>MR</given-names>
</name>
</person-group>
<article-title>Haplotypes and the systematic analysis of genetic variation in genes and genomes</article-title>
<source>Pharmacogenomics</source>
<year>2003</year>
<volume>4</volume>
<fpage>547</fpage>
<lpage>570</lpage>
<pub-id pub-id-type="pmid">12943464</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoehe</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Köpke</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wendel</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Rohde</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Flachmeier</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kidd</surname>
<given-names>KK</given-names>
</name>
<name>
<surname>Berrettini</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>Sequence variability and candidate gene analysis in complex disease: association of  μ opioid receptor gene variation with substance dependence</article-title>
<source>Hum. Mol. Genet.</source>
<year>2000</year>
<volume>9</volume>
<fpage>2895</fpage>
<lpage>2908</lpage>
<pub-id pub-id-type="pmid">11092766</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tewhey</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bansal</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Torkamani</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Topol</surname>
<given-names>EJ</given-names>
</name>
<name>
<surname>Schork</surname>
<given-names>NJ</given-names>
</name>
</person-group>
<article-title>The importance of phase information for human genomics</article-title>
<source>Nat. Rev. Genet.</source>
<year>2011</year>
<volume>12</volume>
<fpage>215</fpage>
<lpage>223</lpage>
<pub-id pub-id-type="pmid">21301473</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marchini</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cutler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Halperin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>ZS</given-names>
</name>
<name>
<surname>Munro</surname>
<given-names>HM</given-names>
</name>
<name>
<surname>Abecasis</surname>
<given-names>GR</given-names>
</name>
<name>
<surname>Donnelly</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A comparison of phasing algorithms for trios and unrelated individuals</article-title>
<source>Am. J. Hum. Genet.</source>
<year>2006</year>
<volume>78</volume>
<fpage>437</fpage>
<lpage>450</lpage>
<pub-id pub-id-type="pmid">16465620</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scheet</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase</article-title>
<source>Am. J. Hum. Genet.</source>
<year>2006</year>
<volume>78</volume>
<fpage>629</fpage>
<lpage>644</lpage>
<pub-id pub-id-type="pmid">16532393</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brinza</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Zelikovsky</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>2SNP: scalable phasing method for trios and unrelated individuals</article-title>
<source>IEEE/ACM Trans. Comput. Biol. Bioinform.</source>
<year>2008</year>
<volume>5</volume>
<fpage>313</fpage>
<lpage>318</lpage>
<pub-id pub-id-type="pmid">18451440</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Direct determination of molecular haplotypes by chromosome microdissection</article-title>
<source>Nat. Methods</source>
<year>2010</year>
<volume>7</volume>
<fpage>299</fpage>
<lpage>301</lpage>
<pub-id pub-id-type="pmid">20305652</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fan</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Potanina</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Quake</surname>
<given-names>SR</given-names>
</name>
</person-group>
<article-title>Whole-genome molecular haplotyping of single cells</article-title>
<source>Nat. Biotechnol.</source>
<year>2011</year>
<volume>29</volume>
<fpage>51</fpage>
<lpage>57</lpage>
<pub-id pub-id-type="pmid">21170043</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Levy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>PC</given-names>
</name>
<name>
<surname>Feuk</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Walenz</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Axelrod</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kirkness</surname>
<given-names>EF</given-names>
</name>
<name>
<surname>Denisov</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The diploid genome sequence of an individual human</article-title>
<source>PLoS Biol.</source>
<year>2007</year>
<volume>5</volume>
<fpage>e254</fpage>
<pub-id pub-id-type="pmid">17803354</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bentley</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Balasubramanian</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Swerdlow</surname>
<given-names>HP</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Milton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>KP</given-names>
</name>
<name>
<surname>Evers</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Barnes</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Bignell</surname>
<given-names>HR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Accurate whole human genome sequencing using reversible terminator chemistry</article-title>
<source>Nature</source>
<year>2008</year>
<volume>456</volume>
<fpage>53</fpage>
<lpage>59</lpage>
<pub-id pub-id-type="pmid">18987734</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McKernan</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Peckham</surname>
<given-names>HE</given-names>
</name>
<name>
<surname>Costa</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>McLaughlin</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tsung</surname>
<given-names>EF</given-names>
</name>
<name>
<surname>Clouser</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Duncan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ichikawa</surname>
<given-names>JK</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>CC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding</article-title>
<source>Genome Res.</source>
<year>2009</year>
<volume>19</volume>
<fpage>1527</fpage>
<lpage>1541</lpage>
<pub-id pub-id-type="pmid">19546169</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Snyder</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Personal genome sequencing: current approaches and challenges</article-title>
<source>Genes Dev.</source>
<year>2010</year>
<volume>24</volume>
<fpage>423</fpage>
<lpage>431</lpage>
<pub-id pub-id-type="pmid">20194435</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burgtorf</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kepper</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hoehe</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Schmitt</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Reinhardt</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lehrach</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Sauer</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Clone-based systematic haplotyping (CSH): a procedure for physical haplotyping of whole genomes</article-title>
<source>Genome Res.</source>
<year>2003</year>
<volume>13</volume>
<fpage>2717</fpage>
<lpage>2724</lpage>
<pub-id pub-id-type="pmid">14656974</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kitzman</surname>
<given-names>JO</given-names>
</name>
<name>
<surname>MacKenzie</surname>
<given-names>AP</given-names>
</name>
<name>
<surname>Adey</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hiatt</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Patwardhan</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Sudmant</surname>
<given-names>PH</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Eichler</surname>
<given-names>EE</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Haplotype-resolved genome sequencing of a Gujarati Indian individual</article-title>
<source>Nat. Biotechnol.</source>
<year>2011</year>
<volume>29</volume>
<fpage>59</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="pmid">21170042</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Suk</surname>
<given-names>E</given-names>
</name>
<name>
<surname>McEwen</surname>
<given-names>GK</given-names>
</name>
<name>
<surname>Duitama</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Nowick</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Palczewski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schreiber</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Holloway</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>McLaughlin</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Peckham</surname>
<given-names>HE</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A comprehensively molecular haplotype-resolved genome of a European individual</article-title>
<source>Genome Res.</source>
<year>2011</year>
<volume>21</volume>
<fpage>1672</fpage>
<lpage>1685</lpage>
<pub-id pub-id-type="pmid">21813624</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B17">
<label>17</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Panconesi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Sozio</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fast Hare: a fast heuristic for single individual SNP haplotype reconstruction</article-title>
<source>Lecture Notes in Computer Science</source>
<year>2004</year>
<publisher-loc>Berlin/Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
<comment>3240/2004, pp. 266–277</comment>
</element-citation>
</ref>
<ref id="gkr1042-B18">
<label>18</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Rizzi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bafna</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Istrail</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lancia</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem</article-title>
<source>Proceedings of the Second International Workshop on Algorithms in Bioinformatics</source>
<year>2002</year>
<publisher-loc>London</publisher-loc>
<publisher-name>Springer</publisher-name>
<comment>2452, 29–43</comment>
</element-citation>
</ref>
<ref id="gkr1042-B19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bansal</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Bafna</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>HapCUT: an efficient and accurate algorithm for the haplotype assembly problem</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>i153</fpage>
<lpage>i159</lpage>
<pub-id pub-id-type="pmid">18689818</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lo</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Bashir</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bansal</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Bafna</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Strobe sequence design for haplotype assembly</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<issue>Suppl. 1</issue>
<fpage>S24</fpage>
<pub-id pub-id-type="pmid">21342554</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B21">
<label>21</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Duitama</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Huebsch</surname>
<given-names>T</given-names>
</name>
<name>
<surname>McEwen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Suk</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hoehe</surname>
<given-names>MR</given-names>
</name>
</person-group>
<article-title>ReFHap: a reliable and fast algorithm for single individual haplotyping</article-title>
<source>BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology</source>
<year>2010</year>
<publisher-loc>Niagara Falls, NY, USA</publisher-loc>
<fpage>160</fpage>
<lpage>169</lpage>
</element-citation>
</ref>
<ref id="gkr1042-B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cilibrasi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Iersel</surname>
<given-names>LV</given-names>
</name>
<name>
<surname>Kelk</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tromp</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>On the complexity of the single individual SNP haplotyping problem</article-title>
<source>Algorithmica</source>
<year>2005</year>
<volume>49</volume>
<fpage>13</fpage>
<lpage>36</lpage>
</element-citation>
</ref>
<ref id="gkr1042-B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Geraci</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>2217</fpage>
<lpage>2225</lpage>
<pub-id pub-id-type="pmid">20624781</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pipatsrisawat</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Darwiche</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Optimal algorithms for haplotype assembly from whole-genome sequence data</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>i183</fpage>
<lpage>i190</lpage>
<pub-id pub-id-type="pmid">20529904</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B25">
<label>25</label>
<element-citation publication-type="book">
<comment>The International HapMap Consortium. (2007) A second generation human haplotype map of over 3.1 million SNPs.
<italic>Nature</italic>
,
<bold>449</bold>
, 851–861</comment>
</element-citation>
</ref>
<ref id="gkr1042-B26">
<label>26</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Duitama</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>PK</given-names>
</name>
<name>
<surname>Măndoiu</surname>
<given-names>II</given-names>
</name>
</person-group>
<article-title>Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data</article-title>
<source>Proceedings of 1st IEEE International Conference on Computational Advances in Bio and Medical Sciences</source>
<year>2011</year>
<publisher-loc>Orlando, FL, USA</publisher-loc>
<fpage>87</fpage>
<lpage>92</lpage>
</element-citation>
</ref>
<ref id="gkr1042-B27">
<label>27</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sahni</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gonzales</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>P-complete problems and approximate solutions</article-title>
<source>Proceedings of the 15th Annual Symposium on Switching and Automata Theory</source>
<year>1974</year>
<publisher-name>IEEE</publisher-name>
<comment>October 1974, pp.14–16</comment>
</element-citation>
</ref>
<ref id="gkr1042-B28">
<label>28</label>
<element-citation publication-type="book">
<comment>The 1000 Genomes Project Consortium. (2010) A map of human genome variation from population-scale sequencing.
<italic>Nature</italic>
,
<bold>467</bold>
, 1061–1073</comment>
</element-citation>
</ref>
<ref id="gkr1042-B29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Haplotype assembly from aligned weighted SNP fragments</article-title>
<source>Comput. Biol. Chem.</source>
<year>2005</year>
<volume>29</volume>
<fpage>281</fpage>
<lpage>287</lpage>
<pub-id pub-id-type="pmid">16051522</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>A clustering algorithm based on two distance functions for MEC model</article-title>
<source>Comput. Biol. Chem.</source>
<year>2007</year>
<volume>31</volume>
<fpage>148</fpage>
<lpage>150</lpage>
<pub-id pub-id-type="pmid">17363329</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Schweller</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Linear time probabilistic algorithms for the singular haplotype reconstruction problem from SNP fragments</article-title>
<source>J. Comput. Biol.</source>
<year>2008</year>
<volume>15</volume>
<fpage>535</fpage>
<lpage>546</lpage>
<pub-id pub-id-type="pmid">18549306</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Genovese</surname>
<given-names>LM</given-names>
</name>
<name>
<surname>Geraci</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Pellegrini</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>SpeedHap: a fast and accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage</article-title>
<source>IEEE/ACM Trans. Comput. Biol. Bioinform.</source>
<year>2008</year>
<volume>5</volume>
<fpage>492</fpage>
<lpage>502</lpage>
<pub-id pub-id-type="pmid">18989037</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adzhubei</surname>
<given-names>IA</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Peshkin</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ramensky</surname>
<given-names>VE</given-names>
</name>
<name>
<surname>Gerasimova</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kondrashov</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Sunyaev</surname>
<given-names>SR</given-names>
</name>
</person-group>
<article-title>A method and server for predicting damaging missense mutations</article-title>
<source>Nat. Methods</source>
<year>2010</year>
<volume>7</volume>
<fpage>248</fpage>
<lpage>249</lpage>
<pub-id pub-id-type="pmid">20354512</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schaid</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Evaluating associations of haplotypes with traits</article-title>
<source>Genetic Epidemiology</source>
<year>2004</year>
<volume>27</volume>
<fpage>348</fpage>
<lpage>364</lpage>
<pub-id pub-id-type="pmid">15543638</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosenbloom</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Dreszer</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Pheasant</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Barber</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>LR</given-names>
</name>
<name>
<surname>Pohl</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Raney</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hinrichs</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Zweig</surname>
<given-names>AS</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ENCODE whole-genome data in the UCSC genome browser</article-title>
<source>Nucleic Acids Res.</source>
<year>2010</year>
<volume>38</volume>
<issue>Suppl. 1</issue>
<fpage>D620</fpage>
<lpage>D625</lpage>
<pub-id pub-id-type="pmid">19920125</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huda</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bowen</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Conley</surname>
<given-names>AB</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>IK</given-names>
</name>
</person-group>
<article-title>Epigenetic regulation of transposable element derived human gene promoters</article-title>
<source>Gene</source>
<year>2010</year>
<volume>475</volume>
<fpage>39</fpage>
<lpage>48</lpage>
<pub-id pub-id-type="pmid">21215797</pub-id>
</element-citation>
</ref>
<ref id="gkr1042-B37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lunshof</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Bobe</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Aach</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Angrist</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Thakuria</surname>
<given-names>JV</given-names>
</name>
<name>
<surname>Vorhaus</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Hoehe</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>Personal genomes in progress: from the human genome project to the personal genome project</article-title>
<source>Dialogues in Clin. Neurosci.</source>
<year>2010</year>
<volume>12</volume>
<fpage>47</fpage>
<lpage>60</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Belgique/explor/OpenAccessBelV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000409  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000409  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Belgique
   |area=    OpenAccessBelV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Dec 1 00:43:49 2016. Site generation: Wed Mar 6 14:51:30 2024