Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000295 ( Pmc/Corpus ); précédent : 0002949; suivant : 0002960 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Positional bias in variant calls against draft reference assemblies</title>
<author>
<name sortKey="Briskine, Roman V" sort="Briskine, Roman V" uniqKey="Briskine R" first="Roman V." last="Briskine">Roman V. Briskine</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1937 0650</institution-id>
<institution-id institution-id-type="GRID">grid.7400.3</institution-id>
<institution>Department of Evolutionary Biology and Environmental Studies,</institution>
<institution>University of Zurich,</institution>
</institution-wrap>
Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shimizu, Kentaro K" sort="Shimizu, Kentaro K" uniqKey="Shimizu K" first="Kentaro K." last="Shimizu">Kentaro K. Shimizu</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1937 0650</institution-id>
<institution-id institution-id-type="GRID">grid.7400.3</institution-id>
<institution>Department of Evolutionary Biology and Environmental Studies,</institution>
<institution>University of Zurich,</institution>
</institution-wrap>
Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 1033 6139</institution-id>
<institution-id institution-id-type="GRID">grid.268441.d</institution-id>
<institution>Kihara Institute for Biological Research,</institution>
<institution>Yokohama City University,</institution>
</institution-wrap>
641-12 Maioka, Totsuka-ward, Yokohama, 244-0813 Japan</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28351369</idno>
<idno type="pmc">5368935</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5368935</idno>
<idno type="RBID">PMC:5368935</idno>
<idno type="doi">10.1186/s12864-017-3637-2</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000295</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000295</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Positional bias in variant calls against draft reference assemblies</title>
<author>
<name sortKey="Briskine, Roman V" sort="Briskine, Roman V" uniqKey="Briskine R" first="Roman V." last="Briskine">Roman V. Briskine</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1937 0650</institution-id>
<institution-id institution-id-type="GRID">grid.7400.3</institution-id>
<institution>Department of Evolutionary Biology and Environmental Studies,</institution>
<institution>University of Zurich,</institution>
</institution-wrap>
Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shimizu, Kentaro K" sort="Shimizu, Kentaro K" uniqKey="Shimizu K" first="Kentaro K." last="Shimizu">Kentaro K. Shimizu</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1937 0650</institution-id>
<institution-id institution-id-type="GRID">grid.7400.3</institution-id>
<institution>Department of Evolutionary Biology and Environmental Studies,</institution>
<institution>University of Zurich,</institution>
</institution-wrap>
Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 1033 6139</institution-id>
<institution-id institution-id-type="GRID">grid.268441.d</institution-id>
<institution>Kihara Institute for Biological Research,</institution>
<institution>Yokohama City University,</institution>
</institution-wrap>
641-12 Maioka, Totsuka-ward, Yokohama, 244-0813 Japan</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.</p>
</sec>
<sec>
<title>Results</title>
<p>In this article, we report a positional bias present in variant calling performed against draft reference assemblies constructed from de Bruijn or string overlap graphs. We assessed how frequently variants appeared at each position counted from ends of a contig or scaffold sequence, and discovered unexpectedly high number of variants at the positions related to the length of either k-mers or reads used for the assembly. We detected the bias in both publicly available draft assemblies from Assemblathon 2 competition as well as in the assemblies we generated from our simulated short-read data. Simulations confirmed that the bias causing variants are predominantly false positives induced by reads from spatially distant repeated sequences. The bias is particularly strong in contig assemblies. Scaffolding does not eliminate the bias but tends to mitigate it because of the changes in variants’ relative positions and alterations in read alignments. The bias can be effectively reduced by filtering out the variants that reside in repetitive elements.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Draft genome sequences generated by several popular assemblers appear to be susceptible to the positional bias potentially affecting many resequencing projects in non-model species. The bias is inherent to the assembly algorithms and arises from their particular handling of repeated sequences. It is recommended to reduce the bias by filtering especially if higher-quality genome assembly cannot be achieved. Our findings can help other researchers to improve the quality of their variant data sets and reduce artefactual findings in downstream analyses.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12864-017-3637-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28351369</article-id>
<article-id pub-id-type="pmc">5368935</article-id>
<article-id pub-id-type="publisher-id">3637</article-id>
<article-id pub-id-type="doi">10.1186/s12864-017-3637-2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Positional bias in variant calls against draft reference assemblies</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-6831-3914</contrib-id>
<name>
<surname>Briskine</surname>
<given-names>Roman V.</given-names>
</name>
<address>
<email>roman.briskine@ieu.uzh.ch</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shimizu</surname>
<given-names>Kentaro K.</given-names>
</name>
<address>
<email>kentaro.shimizu@ieu.uzh.ch</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1937 0650</institution-id>
<institution-id institution-id-type="GRID">grid.7400.3</institution-id>
<institution>Department of Evolutionary Biology and Environmental Studies,</institution>
<institution>University of Zurich,</institution>
</institution-wrap>
Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</aff>
<aff id="Aff2">
<label>2</label>
Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich, CH-8057 Switzerland</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 1033 6139</institution-id>
<institution-id institution-id-type="GRID">grid.268441.d</institution-id>
<institution>Kihara Institute for Biological Research,</institution>
<institution>Yokohama City University,</institution>
</institution-wrap>
641-12 Maioka, Totsuka-ward, Yokohama, 244-0813 Japan</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>28</day>
<month>3</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>28</day>
<month>3</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>18</volume>
<elocation-id>263</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>6</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>3</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.</p>
</sec>
<sec>
<title>Results</title>
<p>In this article, we report a positional bias present in variant calling performed against draft reference assemblies constructed from de Bruijn or string overlap graphs. We assessed how frequently variants appeared at each position counted from ends of a contig or scaffold sequence, and discovered unexpectedly high number of variants at the positions related to the length of either k-mers or reads used for the assembly. We detected the bias in both publicly available draft assemblies from Assemblathon 2 competition as well as in the assemblies we generated from our simulated short-read data. Simulations confirmed that the bias causing variants are predominantly false positives induced by reads from spatially distant repeated sequences. The bias is particularly strong in contig assemblies. Scaffolding does not eliminate the bias but tends to mitigate it because of the changes in variants’ relative positions and alterations in read alignments. The bias can be effectively reduced by filtering out the variants that reside in repetitive elements.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Draft genome sequences generated by several popular assemblers appear to be susceptible to the positional bias potentially affecting many resequencing projects in non-model species. The bias is inherent to the assembly algorithms and arises from their particular handling of repeated sequences. It is recommended to reduce the bias by filtering especially if higher-quality genome assembly cannot be achieved. Our findings can help other researchers to improve the quality of their variant data sets and reduce artefactual findings in downstream analyses.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12864-017-3637-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Reseqencing</kwd>
<kwd>Variants</kwd>
<kwd>Polymorphisms</kwd>
<kwd>SNPs</kwd>
<kwd>Positional bias</kwd>
<kwd>Draft reference genome</kwd>
<kwd>Repetitive elements</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100001711</institution-id>
<institution>Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung</institution>
</institution-wrap>
</funding-source>
</award-group>
<award-group>
<funding-source>
<institution>Seventh Framework Programme - PLANT FELLOWS</institution>
</funding-source>
</award-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100006447</institution-id>
<institution>Universität Zürich</institution>
</institution-wrap>
</funding-source>
</award-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100001700</institution-id>
<institution>Ministry of Education, Culture, Sports, Science, and Technology</institution>
</institution-wrap>
</funding-source>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>Plummeting cost of high-throughput sequencing (HTS) allowed population geneticists to analyse hundreds of individuals on the whole genome level (e.g. [
<xref ref-type="bibr" rid="CR1">1</xref>
<xref ref-type="bibr" rid="CR3">3</xref>
]). Moreover, the researchers are no longer limited to the model species as draft genome sequences can be assembled de novo from HTS data (e.g. [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
]). The quality of these draft genomes is generally lower than that of the traditionally sequenced genomes [
<xref ref-type="bibr" rid="CR6">6</xref>
] but they are still considered adequate for various types of analysis in population genetics as well as genome-wide association studies.</p>
<p>The core algorithm of a modern genome assembler usually implements either a de Bruijn graph [
<xref ref-type="bibr" rid="CR7">7</xref>
] or a string graph [
<xref ref-type="bibr" rid="CR8">8</xref>
]. Both approaches involve constructing a graph based on sequence overlaps and finding the optimal path through the graph. Such path would correspond to a contiguous assembled sequence known as contig. In a string graph, vertices are represented by reads while a sufficiently long overlap between two reads forms an edge. To construct a de Bruijn graph, each read is split into all possible sequences of length
<italic>k</italic>
(k-mers). The k-mers form vertices of the graph while overlaps between k-mers that are
<italic>k</italic>
−1 bases long become edges. Among the assemblers that implement the string graph approach are SGA [
<xref ref-type="bibr" rid="CR9">9</xref>
] and SAGE [
<xref ref-type="bibr" rid="CR10">10</xref>
] while de Bruijn graphs are incorporated into ABySS [
<xref ref-type="bibr" rid="CR11">11</xref>
], Meraculous [
<xref ref-type="bibr" rid="CR12">12</xref>
], ALLPATHS-LG [
<xref ref-type="bibr" rid="CR13">13</xref>
], and SOAPdenovo2 [
<xref ref-type="bibr" rid="CR14">14</xref>
] among many others.</p>
<p>Pipeline for genome assembly normally includes a rigorous error correction of the reads, which can be done either before the assembly with another tool or during the assembly by the assembler itself. Contig assembly may be followed by scaffolding whereby mate-pair reads that map to two different contigs are used to splice these contigs together into a scaffold. Finally, some pipelines for genome assembly involve merging of the overlapping paired-end reads into longer sequences before the construction of de Bruijn graphs. In fact, ALLPATHS-LG requires that sufficient number of overlapping paired-end reads are present in the input data [
<xref ref-type="bibr" rid="CR13">13</xref>
]. Such approach allows the selection of longer k-mer size for the de Bruijn graph construction, which in turn improves the assembly of repetitive regions [
<xref ref-type="bibr" rid="CR14">14</xref>
].</p>
<p>Here, we report a stark pattern that appears when calling variants against assemblies generated from de Bruijn or string graphs. If paired-end reads are aligned to the assembled contigs, an unexpectedly high number of variants will be called at a certain position from the end of a contig or scaffold. Depending on the assembler’s implementation, this position matches either the k-mer length used for de Bruijn graph construction or the length of reads used in string overlap graphs. Our analyses suggests that the bias is caused by repeated sequences that cannot be successfully resolved by assemblers. While scaffolding mitigates the bias, it does not remove the bias completely and variants are still more likely to appear at the same relative position within contigs incorporated into scaffolds. The most effective approach to attenuate the bias is to remove all variants present in repetitive elements. Since the bias-causing variants are mostly false positives, the bias may have serious implications on downstream analyses performed in resequencing projects [
<xref ref-type="bibr" rid="CR15">15</xref>
,
<xref ref-type="bibr" rid="CR16">16</xref>
].</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>Whole genome assemblies</title>
<p>A subset of
<italic>Maylandia zebra</italic>
(fish) and
<italic>Boa constrictor</italic>
(snake) whole genome assemblies (before and after scaffolding) submitted by various teams (Table
<xref rid="Tab1" ref-type="table">1</xref>
) as entries to the Assemblathon 2 competition [
<xref ref-type="bibr" rid="CR17">17</xref>
] were downloaded from the official repository. To reduce the extent of post-processing that could potentially obscure the problem, only the teams representing the original assembler developers were chosen. Since the competitive SOAPdenovo2 assembly of the snake genome was generated using mislabelled mate-pair libraries, we downloaded the corrected version that the team made available after the competition (See Additional file 3 in Bradnam et al. [
<xref ref-type="bibr" rid="CR17">17</xref>
]).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>List of analysed assemblies</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Reads</th>
<th align="left">Identifier</th>
<th align="left">Assembler</th>
<th align="left">Pos</th>
<th align="left">k</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Bcon</td>
<td align="left">abyss_9C</td>
<td align="left">ABySS</td>
<td align="right">80</td>
<td align="right">80</td>
</tr>
<tr>
<td align="left">Bcon</td>
<td align="left">merac_6C</td>
<td align="left">Meraculous</td>
<td align="right">71</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bcon</td>
<td align="left">phus_5C</td>
<td align="left">Phusion</td>
<td align="right">78</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bcon</td>
<td align="left">sga_7C</td>
<td align="left">SGA</td>
<td align="right">121</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bcon</td>
<td align="left">soap*</td>
<td align="left">SOAPdenovo2</td>
<td align="right">36</td>
<td align="right">36</td>
</tr>
<tr>
<td align="left">Mzeb</td>
<td align="left">abyss_7C</td>
<td align="left">ABySS</td>
<td align="right">56</td>
<td align="right">56</td>
</tr>
<tr>
<td align="left">Mzeb</td>
<td align="left">allp_6C</td>
<td align="left">ALLPATHS-LG</td>
<td align="right">96</td>
<td align="right">96</td>
</tr>
<tr>
<td align="left">Mzeb</td>
<td align="left">soap_11E</td>
<td align="left">SOAPdenovo2</td>
<td align="right">46</td>
<td align="right">46</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">allp</td>
<td align="left">ALLPATHS-LG</td>
<td align="right">96</td>
<td align="right">96</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">sga_m75</td>
<td align="left">SGA</td>
<td align="right">100</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">sga_m77</td>
<td align="left">SGA</td>
<td align="right">100</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">soap_K69</td>
<td align="left">SOAPdenovo2</td>
<td align="right">70</td>
<td align="right">70</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">soap_K71</td>
<td align="left">SOAPdenovo2</td>
<td align="right">72</td>
<td align="right">72</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>This study focused on a subset of the
<italic>B. constrictor</italic>
(Bcon) and
<italic>M. zebra</italic>
(Mzeb) genome assemblies submitted by the assembler developers to the Assemblathon 2 competition. In addition, we simulated reads from
<italic>A. thaliana</italic>
chromosomes 1 and 2 (Sim) and constructed several assemblies with varying parameters. For SGA, we varied the minimum string overlap (-m 75 and -m 77 for sga_m75 and sga_m77 respectively). For SOAPdenovo2, we set the -K parameter to 69 and 71, which corresponded to k = 70 and 72 for soap_K69 and soap_K71 respectively. ‘Pos’ column shows the position (counted from ends of contigs or scaffolds) where variants occur most frequently. ‘NA’ in the ‘k’ column indicates that the choice of k was not reported and could not be determined from other sources</p>
<p>
<sup>*</sup>
The Bcon assembly by the SOAPdenovo2 team submitted for the competition was assembled using an incorrectly labelled library. We analysed the corrected version that was constructed after the competition [
<xref ref-type="bibr" rid="CR17">17</xref>
]</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>For the alignment against these assemblies, we randomly selected a 400 bp insert library for
<italic>B. constrictor</italic>
(ERR234373) and 180 bp insert library for
<italic>M. zebra</italic>
(SRR077290). Each library was aligned only against the assemblies of its respective species. Both libraries were downloaded from the NCBI Sequence Read Archive.</p>
</sec>
<sec id="Sec4">
<title>Simulated data set</title>
<p>Sequences for chromosomes 1 and 2 of
<italic>Arabidopsis thaliana</italic>
(TAIR10) were downloaded from The Arabidopsis Information Resource website [
<xref ref-type="bibr" rid="CR18">18</xref>
].</p>
<p>To simulate the reads we used SimSeq application that aims to reproduce the biases present in normal Illumina data sets [
<xref ref-type="bibr" rid="CR19">19</xref>
]. We ran the application with default parameters to simulate 15 mln 100 bp paired-end reads with the mean insert size of 180 bp and 5 mln 100 bp mate-pair reads with the mean insert size of 3 kb. Since the combined size of the chosen chromosomes is about 50 Mb, the simulated libraries yielded 60× and 20× coverage respectively.</p>
<p>We assembled the short reads from these libraries using ALLPATHS-LG v52293 [
<xref ref-type="bibr" rid="CR13">13</xref>
] with default parameters. Henceforth, we will refer to this assembly as Sim_allp where ‘Sim’ indicates simulated libraries and ‘allp’ denotes the assembler. We also constructed two assemblies (Sim_soap_K69 and Sim_soap_K71) with SOAPdenovo2 v2.04 [
<xref ref-type="bibr" rid="CR14">14</xref>
] specifying different k-mer lengths. The optimal length (
<italic>K</italic>
=69) was determined by running KmerGenie [
<xref ref-type="bibr" rid="CR20">20</xref>
] for the range of lengths between 39 and 99 with the step of 2. Another length value (
<italic>K</italic>
=71) was selected as the next best length according to the KmerGenie output. In both cases, we ran SOAPdenovo-127mer with the options to resolve repeats (-R) and to drop low-frequency k-mers (-d 1). Finally, we constructed two assemblies (Sim_sga_m75 and Sim_sga_m77) using a string overlap assembler, SGA v0.10.14 [
<xref ref-type="bibr" rid="CR9">9</xref>
]. For both SGA assemblies, we ran error correction with k-mer length set to 41 (-k 41) and the minimum overlap of 55 (-m 55) for the overlap command. The minimum overlap in the assemble command was set to either 75 or 77 (-m 75 or -m 77). Subsequent scaffolding was performed using default parameters as described in the SGA documentation.</p>
<p>To investigate variant calling for resequencing analysis, we downloaded a short-insert library (SRX144851) of
<italic>A. thaliana</italic>
Bs-1 genotype from DNA Data Bank of Japan (DDBJ). The reads were sequenced with Illumina HiSeq 2000 and have the insert size of 202 bp with 101 bp read length [
<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
</sec>
<sec id="Sec5">
<title>Variant calling</title>
<p>Variants were called with GATK v3.4-0 [
<xref ref-type="bibr" rid="CR22">22</xref>
] following the established best practices [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR24">24</xref>
]. Briefly, the corresponding short insert library was aligned against the assembled sequences using BWA v0.7.12 [
<xref ref-type="bibr" rid="CR25">25</xref>
]. After marking the duplicates, the reads were locally realigned around insertions/deletions (indels) and variants were called with HaplotypeCaller. The obtained raw calls were filtered using the criteria recommended for cases when variant calibration was not possible (Table
<xref rid="Tab2" ref-type="table">2</xref>
).
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>SNP statistics reported by GATK and thresholds used for filtering</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Abbreviation</th>
<th align="left">Rel</th>
<th align="left">Threshold</th>
<th align="left">Full name</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">QD</td>
<td align="left"><</td>
<td align="left">2.0</td>
<td align="justify">Quality by Depth</td>
</tr>
<tr>
<td align="left">MQ</td>
<td align="left"><</td>
<td align="left">40.0</td>
<td align="justify">Root mean square of Mapping Quality</td>
</tr>
<tr>
<td align="left">MQRankSum</td>
<td align="left"><</td>
<td align="left">−12.5</td>
<td align="justify">Mapping Quality Rank Sum test</td>
</tr>
<tr>
<td align="left">FS</td>
<td align="left">></td>
<td align="left">60.0</td>
<td align="justify">Fisher’s exact test for Strand bias</td>
</tr>
<tr>
<td align="left">SOR</td>
<td align="left">></td>
<td align="left">4.0</td>
<td align="justify">Strand bias Odds Ratio</td>
</tr>
<tr>
<td align="left">ReadPosRankSum</td>
<td align="left"><</td>
<td align="left">−8.0</td>
<td align="justify">Read Position Rank Sum test</td>
</tr>
<tr>
<td align="left">DP</td>
<td align="left">></td>
<td align="left">200.0</td>
<td align="justify">Depth of Coverage</td>
</tr>
<tr>
<td align="left">GQ</td>
<td align="left"><</td>
<td align="left">20.0</td>
<td align="justify">Genotype Quality</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>SNPs with statistics above or below (Rel) the corresponding threshold were removed from consideration. For detailed description of these statistics and justification for the threshold selection, see Van der Auwera et al. [
<xref ref-type="bibr" rid="CR24">24</xref>
] and GATK documentation at
<ext-link ext-link-type="uri" xlink:href="https://www.broadinstitute.org/gatk/">https://www.broadinstitute.org/gatk/</ext-link>
</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>To make sure that variants in the regions with excessively high coverage do not affect the results, we separately applied more restrictive coverage filters to the variants called in the fish and snake genomes. The thresholds were set to the expected coverage calculated using the Lander-Waterman equation
<italic>C</italic>
=
<italic>L</italic>
<italic>N</italic>
/
<italic>G</italic>
[
<xref ref-type="bibr" rid="CR26">26</xref>
], where
<italic>C</italic>
is the expected coverage,
<italic>L</italic>
is the read length,
<italic>N</italic>
is the number of reads, and
<italic>G</italic>
is the estimated haploid genome length. Based on the reported genome lengths of 1.6 and 1 Gb [
<xref ref-type="bibr" rid="CR17">17</xref>
], the expected coverage was 5 × and 8 × for the snake and fish genomes respectively.</p>
</sec>
<sec id="Sec6">
<title>Alternative read alignment and variant calling tools</title>
<p>To exclude the possibility that the bias was caused by the tools we used for read alignment and variant calling (BWA and GATK), we also analysed the variants detected against simulated contig assemblies with alternative tools. We ran GATK on the alignments produced by each NextGenMap [
<xref ref-type="bibr" rid="CR27">27</xref>
], GSNAP [
<xref ref-type="bibr" rid="CR28">28</xref>
], and Bowtie2 [
<xref ref-type="bibr" rid="CR29">29</xref>
]. We accepted the default parameters for each of these read alignment applications. Since we suspected that Bowtie2 might have lower sensitivity than the other aligners, we also ran Bowtie2 with the default parameters to align Bs-1 reads against the simulated contig assemblies.</p>
<p>We relied on BWA alignments to test the FreeBayes [
<xref ref-type="bibr" rid="CR30">30</xref>
] and Samtools mpileup [
<xref ref-type="bibr" rid="CR31">31</xref>
] variant callers. We ran the multithreaded version of FreeBayes and specified the same maximum coverage threshold (200) as with GATK. Following the recommendations from the FreeBayes documentation, variant calls were subsequently filtered using a minimum quality threshold (QUAL < 20). We also used the default parameters for Samtools mpileup except for the maximum indel coverage, which we set to 200. The resultant data was processed with bcftools [
<xref ref-type="bibr" rid="CR31">31</xref>
] to produce a VCF file and filter out the variants with low quality (QUAL < 20) and abnormally high coverage (DP > 200).</p>
</sec>
<sec id="Sec7">
<title>Scaffold position transformation (coordinate mapping)</title>
<p>To transform variant scaffold positions to contig positions in ALLPATHS-LG assemblies, we employed the information from the final.summary file that ALLPATHS-LG generates by default. For each scaffold, the file reports scaffold length, list of included contigs with their respective lengths, and gap sizes. Overlapping contigs have negative gap sizes. We noticed that occasionally a scaffold in the final assembly extends beyond the length specified in the summary file. In such cases, we reported the SNPs located beyond the reported scaffold length as ‘untransformed’ because their coordinates could not be mapped to any contigs.</p>
<p>For the transformation of scaffold positions in SOAPdenovo2 assemblies, we parsed the file with the contigPosInscaff extension. The file is automatically generated by SOAPdenovo2 during scaffolding. For each scaffold, the file lists one or more contig entries. Each entry specifies contig id, starting position within scaffold (origin 0), contig orientation, and ending position within scaffold. We also used the contig length information from the file with contig extension because SOAPdenovo2 often inserted gaps between overlapping contigs in a scaffold. Such a gap would effectively split one of the contigs into two parts making it impossible to derive a transformation map exclusively from the contigPosInscaff file.</p>
<p>Our scaffold coordinate transformation would produce two contig coordinates if a variant is located in a scaffold region where two contigs overlap. We chose this approach because such scaffold variants in principle should have two corresponding contig variants, one on each of the overlapping contigs.</p>
</sec>
<sec id="Sec8">
<title>Identification and filtering of repetitive elements</title>
<p>We identified repetitive elements and low complexity sequences in the simulated assemblies using RepeatMasker v4.0.5 [
<xref ref-type="bibr" rid="CR32">32</xref>
] with the library version 20140131 [
<xref ref-type="bibr" rid="CR33">33</xref>
], NCBI search engine, and ‘viridiplantae’ species filter. To calculate the number of position
<italic>k</italic>
SNPs appearing in repetitive elements, we checked the SNP coordinates against the repetitive sequence ranges reported by RepeatMasker. If a repetitive element spanned position
<italic>k</italic>
at both ends of a contig and contained two position
<italic>k</italic>
SNPs (one at each end), we counted it as a single occurrence, i.e. for each repetitive element sequence the count was either 0 or 1. To adjust for family frequency, we divided the number of position
<italic>k</italic>
SNPs appearing within that family by the total number of the family sequences present in the assembly. SNP filtering process entailed the removal of all variants located within any of the identified repetitive elements or low complexity sequences.</p>
</sec>
</sec>
<sec id="Sec9" sec-type="results">
<title>Results</title>
<sec id="Sec10">
<title>Positional bias in variant distribution within contigs</title>
<p>In our analysis, we used the publicly available data from the Assemblathon 2 competition [
<xref ref-type="bibr" rid="CR17">17</xref>
]. For each of the two analysed species, we randomly selected a single short-insert library among those provided to the teams for assembly and aligned the reads against each of the chosen contig assemblies submitted for the competition. In each case, both the aligned reads and the assembly came from the same individual (species). Hence, any variant calls would be false positives and the distribution of their positions within contigs should be approximately uniform.</p>
<p>After calling the variants, we calculated how frequently they appeared at each position counted from both ends of contigs (Table
<xref rid="Tab1" ref-type="table">1</xref>
). Frequencies were estimated separately for single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). All of the tested assemblies showed positional bias in the distribution of variant calls (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S1–S3). For some assemblies the bias was evident in the distribution of both SNPs and indels while others exhibited only SNP distribution bias. Since indels are typically less frequent, more difficult to call and, therefore, less reliable than SNPs, we will focus on the SNP distribution bias.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Distribution of SNP positions at the 5
<sup></sup>
end of contigs. The analysis includes a subset of five
<italic>B. constrictor</italic>
and three
<italic>M. zebra</italic>
assemblies submitted to Assemblathon 2 [
<xref ref-type="bibr" rid="CR17">17</xref>
]. The description of assembly identifiers is given in Table
<xref rid="Tab1" ref-type="table">1</xref>
</p>
</caption>
<graphic xlink:href="12864_2017_3637_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>In addition to the Assemblathon 2 entries, we simulated 180 bp paired-end library and 3 kb mate-pair library using chromosomes 1 and 2 of
<italic>Arabidopsis thaliana</italic>
. We assembled the simulated paired-end library into contigs separately with several assemblers. We called variants and analysed the results using the same approach as with Assemblathon 2 data. The simulated data set provided several advantages. First, it excluded the possibility that aligning additional paired-end libraries used for assembly would affect variant calling. Assemblathon 2 teams had access to several paired-end libraries while we only aligned a single one to call variants against those assemblies. The simulated data set contained only a single short-insert paired-end library, which was subsequently aligned to the de novo assemblies. Second, we knew the exact origin of each simulated read, which helped us explain why some variants were called. Third, available HTS data for a different
<italic>A. thaliana</italic>
genotype enabled us to explore the effects on variant calling for resequencing analysis. Finally, we were able to run more analyses because of the smaller data set size.</p>
<p>In the literature describing de Bruijn graph approaches to assembly, k-mer length may refer to either the length of sequences at graph vertices [
<xref ref-type="bibr" rid="CR11">11</xref>
,
<xref ref-type="bibr" rid="CR34">34</xref>
] or the length of sequence overlaps at graph edges [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR14">14</xref>
]. To avoid the confusion, we will use the first definition and denote such length as
<italic>k</italic>
. The length of sequence overlaps at graph edges will be denoted as
<italic>K</italic>
, i.e.
<italic>K</italic>
=
<italic>k</italic>
−1 for de Bruijn graphs.</p>
<p>Only three teams (ABySS, ALLPATHS-LG, and SOAPdenovo2) reported k-mer lengths used for assembly (See Additional file 3 in Bradnam et al. [
<xref ref-type="bibr" rid="CR17">17</xref>
]). In all cases, the position where SNPs occurred most frequently matched the reported k-mer length (Table
<xref rid="Tab1" ref-type="table">1</xref>
). This was independent of the tool and the actual
<italic>k</italic>
value used in the assembly (ABySS and SOAPdenovo2 teams each specified different
<italic>k</italic>
values for their respective
<italic>B. constrictor</italic>
and
<italic>M. zebra</italic>
assemblies). When assembling our simulated paired-end reads with SOAPdenovo2, we changed the
<italic>K</italic>
configuration parameter from -K 69 to -K 71 and the most frequent SNP position shifted from 70 to 72 (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4).
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Distribution of SNP positions at the 5
<sup></sup>
end of contigs in the simulated data set. The description of assembly identifiers is given in Table
<xref rid="Tab1" ref-type="table">1</xref>
</p>
</caption>
<graphic xlink:href="12864_2017_3637_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>Among the assemblers we analysed, SGA [
<xref ref-type="bibr" rid="CR9">9</xref>
] was based on string graphs rather than de Bruijn graphs. In this assembler, the parameter equivalent to the k-mer length would be the string overlap length used for graph construction. It can be specified as -m parameter for the assemble command. Changing this parameter did not cause a shift in the peak position (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4). Instead, the peak position was linked to read lengths. For the Assemblathon 2 entry and the simulated data sets, the read lengths were 121 and 100 bp respectively. Both values matched the most frequent SNP position of their respective assembly (Table
<xref rid="Tab1" ref-type="table">1</xref>
). Hereafter, we will use the read length to establish the
<italic>k</italic>
position in SGA assemblies.</p>
<p>To make sure that the positional bias is not limited to the very short contigs with potentially poor quality, we removed all contigs shorter than 500 bp and repeated the analysis. The bias was still clearly visible in all cases (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S5–S6) indicating that such filtering is not effective for bias reduction.</p>
<p>To remove the variants called in the regions with abnormally high coverage, we used the 200 × threshold (DP in Table
<xref rid="Tab2" ref-type="table">2</xref>
). The value was higher than the expected coverage calculated with Lander-Waterman equation [
<xref ref-type="bibr" rid="CR26">26</xref>
]. However, the bias was still apparent even among the variants with coverage that did not exceed the expected levels (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S7–S8).</p>
<p>We also compared the distributions of various quality statistics reported by GATK (Table
<xref rid="Tab2" ref-type="table">2</xref>
) for SNPs at position
<italic>k</italic>
to those reported for SNPs at other positions, and found them very similar (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S9–S16). While the differences might be statistically significant due to large sample size, the effect size is minimal and considerable overlap between distributions makes SNP discrimination unfeasible. The only statistic that could be possibly used to reduce the bias without substantial effect on other SNPs is mapping quality (MQ). Even then, the results would be largely dependent on the assembler choice and the underlying data set. In particular, the distribution of MQ for SNPs in non-
<italic>k</italic>
positions had fairly long left tails in
<italic>M. zebra</italic>
(especially Mzeb_allp_6C; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S10) while the MQ distributions for
<italic>k</italic>
and non-
<italic>k</italic>
SNPs in the simulated data set were hardly separable regardless of the assembler (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S17).</p>
</sec>
<sec id="Sec11">
<title>Scaffolding does not eliminate the positional bias</title>
<p>Assembly pipelines generally include a step to concatenate contigs into longer scaffold sequences based on mate-pair read alignments. We discovered that the positional bias was attenuated but still persisted in the
<italic>B. constrictor</italic>
and
<italic>M. zebra</italic>
assemblies after scaffolding (Table
<xref rid="Tab3" ref-type="table">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S18–S19). The only assembly where the bias appeared less evident was Mzeb_allp_6C but even there the peaks at the position
<italic>k</italic>
were notably high. Similar results, including the greatest reduction of the bias appearance in the ALLPATHS-LG assembly, were observed with the simulated data set (Table
<xref rid="Tab3" ref-type="table">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S20–S21). Removal of the short scaffolds (less than 500 bp) from the
<italic>B. constrictor</italic>
and
<italic>M. zebra</italic>
assemblies did not completely eliminate the bias either (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S22–S23).
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>SNP counts at position
<italic>k</italic>
in the simulated data sets</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Reads</th>
<th align="left">Assembler</th>
<th align="left">Contig</th>
<th align="left">Scaffold</th>
<th align="left">Transf</th>
<th align="left">Untransf</th>
<th align="left">Shared</th>
<th align="left">Masked</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Sim</td>
<td align="left">allp</td>
<td align="right">57</td>
<td align="right">7</td>
<td align="right">21</td>
<td align="right">1</td>
<td align="right">18</td>
<td align="right">12</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">sga_m75</td>
<td align="right">504</td>
<td align="right">463</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">sga_m77</td>
<td align="right">513</td>
<td align="right">479</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">soap_K69</td>
<td align="right">1481</td>
<td align="right">698</td>
<td align="right">255</td>
<td align="right">451</td>
<td align="right">137</td>
<td align="right">44</td>
</tr>
<tr>
<td align="left">Sim</td>
<td align="left">soap_K71</td>
<td align="right">1469</td>
<td align="right">769</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bs-1</td>
<td align="left">allp</td>
<td align="right">66</td>
<td align="right">9</td>
<td align="right">35</td>
<td align="right">0</td>
<td align="right">23</td>
<td align="right">15</td>
</tr>
<tr>
<td align="left">Bs-1</td>
<td align="left">sga_m75</td>
<td align="right">899</td>
<td align="right">711</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bs-1</td>
<td align="left">sga_m77</td>
<td align="right">871</td>
<td align="right">687</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
<tr>
<td align="left">Bs-1</td>
<td align="left">soap_K69</td>
<td align="right">1916</td>
<td align="right">670</td>
<td align="right">365</td>
<td align="right">429</td>
<td align="right">210</td>
<td align="right">95</td>
</tr>
<tr>
<td align="left">Bs-1</td>
<td align="left">soap_K71</td>
<td align="right">1909</td>
<td align="right">692</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
<td align="right">NA</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Reads column indicates the origin of aligned reads: ‘Sim’ refers to the simulated paired-end reads while ‘Bs-1’ denotes the actual
<italic>A. thaliana</italic>
Bs-1 short-insert library [
<xref ref-type="bibr" rid="CR21">21</xref>
]. Contig and Scaffold columns show the number of SNPs at position
<italic>k</italic>
in the respective contig and scaffold assemblies. ‘Transf’ column shows the number of SNPs at position
<italic>k</italic>
called against scaffolds after the scaffold coordinates were transformed to contig coordinates. Only Sim_allp and Sim_soap_K69 scaffold coordinates were transformed. ‘Untransf’ column indicates the number of SNPs that failed to transform because of contig length threshold (Sim_soap_K69) or scaffold being extended beyond the length specified in the assembler’s scaffold map (Sim_allp). ‘Shared’ column reports the number of SNPs present in both Contig and Tranformed sets. ‘Masked’ column shows the SNPs that appear in Contig and Transformed but not in Scaffold because of the change in their relative positions. All counts except ‘Untransf’ are for SNPs in position
<italic>k</italic>
</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>There are two mechanisms that may cause the bias reduction after scaffolding. First, many contigs will be placed in the middle of scaffolds. Thus, many SNPs that were previously present near contig ends would appear in the middle of scaffold sequences as well. Consequently, some SNPs that contributed to the bias before would emerge as SNPs that do not cause the bias because they would not be in position
<italic>k</italic>
relative to scaffold ends. This mechanism makes the bias less apparent but it does not actually decrease it because corresponding SNPs still persist in the scaffold assembly (hereafter, we will refer to this phenomenon as “bias masking”). The second mechanism is triggered by alterations in read alignments. Scaffolding typically involves concatenation of overlapping contigs and gap filling between adjacent contigs. Both actions may alter read alignments in the affected regions. In particular, reads that previously caused SNP calls on individual contigs would not align sufficiently well to the same contigs or would align better elsewhere after scaffolding. Thus, the number of SNPs in both
<italic>k</italic>
and non-
<italic>k</italic>
positions may diminish. Such bias reduction would be real because it effectively eliminates the bias causing SNPs.</p>
<p>We can measure the effects of these two mechanisms by analysing the intersection between SNPs called before and after scaffolding. To compute the intersection, we have to transform SNP scaffold coordinates into contig coordinates using a scaffold map reported by the assembler. Bias masking occurs when a scaffold SNP from a non-
<italic>k</italic>
position appears in the position
<italic>k</italic>
after transformation and there is a contig SNP at the same position. Genuine bias reduction takes place when a contig SNP in position
<italic>k</italic>
does not have a corresponding scaffold SNP.</p>
<p>We transformed scaffold coordinates into contig coordinates in two assemblies constructed from the simulated reads. In the Sim_allp assembly, scaffolding reduced the number of SNPs in position
<italic>k</italic>
from 57 to 7 (Table
<xref rid="Tab3" ref-type="table">3</xref>
). After transforming scaffold coordinates to contig coordinates, the position bias was clearly visible (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S24–S25) as the number of scaffold SNPs in position
<italic>k</italic>
increased to 21 (Table
<xref rid="Tab3" ref-type="table">3</xref>
). Out of those, 18 were also called against the contig assembly (shared SNPs; panel ‘Both’ in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S26) while 3 were unique to the scaffold assembly (panel ‘Scaffold Only’ in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S26; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S27). Coordinates for one SNP could not be transformed because the scaffold was longer than specified in the assembler’s scaffold map. Out of the 57 contig SNPs present in position
<italic>k</italic>
, 12 (21%) emerged in the scaffold assembly at non-
<italic>k</italic>
positions relative to scaffold ends (‘Masked’ column in Table
<xref rid="Tab3" ref-type="table">3</xref>
) while 39 (68%) SNPs completely disappeared due to altered read alignments.</p>
<p>In the Sim_soap_K69 assembly, scaffolding also reduced the number of position
<italic>k</italic>
SNPs (Table
<xref rid="Tab3" ref-type="table">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S28–S29) but the underlying processes were different. A large number of SNP coordinates (451 at position
<italic>k</italic>
) were not transformed (‘Untransf’ column in Table
<xref rid="Tab3" ref-type="table">3</xref>
) because they corresponded to locations on contigs shorter than 200 bp. Those contigs were excluded from the Sim_soap_K69 contig assembly for SNP calling while the assembler still used them for scaffolding. Therefore, the ‘untransformed’ scaffold SNPs could not overlap with the contig SNPs in principle and we ignored them when calculating the overlap between contig and transformed scaffold SNPs (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S28–S29). We also noticed that SOAPdenovo2 tends to introduce gaps within overlapping contigs. A gap is placed between the end of one contig and the non-overlapping part of the other contig, which breaks the second contig into two non-contiguous parts. This leads to considerable changes in read alignments and consequently yields many SNPs unique to the contig assembly (Table
<xref rid="Tab3" ref-type="table">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S28–S29). These two reasons also explain the large difference between the number of contig and coordinate-transformed scaffold SNPs (‘transf’ column) but much smaller difference between contig and scaffold SNP counts (Table
<xref rid="Tab3" ref-type="table">3</xref>
). It also explains why the bias masking level is so low; only 44 (3%) scaffold SNPs in non-
<italic>k</italic>
positions could be matched to contig SNPs in position
<italic>k</italic>
.</p>
</sec>
<sec id="Sec12">
<title>Positional bias with alternative tools</title>
<p>To make sure that the positional bias was not caused by one of the selected tools (BWA and GATK), we executed variant calling pipelines with alternative read alignment or variant discovery applications on the simulated contig assemblies. Running GATK with either NextGenMap [
<xref ref-type="bibr" rid="CR27">27</xref>
] or GSNAP [
<xref ref-type="bibr" rid="CR28">28</xref>
] still resulted in clearly visible peaks at the expected locations (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S30–S33). When using GATK with Bowtie2 [
<xref ref-type="bibr" rid="CR29">29</xref>
], the peaks were much smaller in SGA and SOAPdenovo2 assemblies and the bias was completely absent in the ALLPATHS-LG assembly (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S34–S35). However, the number of variants at other positions was much smaller as well.</p>
<p>Previous reports indicated that Bowtie2 was less sensitive to sequence mismatches [
<xref ref-type="bibr" rid="CR35">35</xref>
,
<xref ref-type="bibr" rid="CR36">36</xref>
] and may have higher error rates [
<xref ref-type="bibr" rid="CR37">37</xref>
]. To check the sensitivity of Bowtie2 on our simulated data set, we ran the Bowtie2 – GATK pipeline using resequencing data (see the next section for additional results). Compared to the BWA – GATK pipeline, we saw a considerable reduction in the total number of variant calls (Table
<xref rid="Tab4" ref-type="table">4</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S36–S37). This is in contrast to Cornish and Guda [
<xref ref-type="bibr" rid="CR38">38</xref>
] who reported only a minor decrease in SNPs between similar pipelines. These results should be interpreted with caution as we do not know the actual number of errors in each case.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>SNPs called by GATK when using BWA or Bowtie2 (Bt2) to align Bs-1 reads against simulated contig assemblies</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Assembly</th>
<th align="left">Bt2 Total</th>
<th align="left">Bt2 Pos k</th>
<th align="left">BWA Total</th>
<th align="left">BWA Pos k</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Sim_allp</td>
<td align="right">97,020</td>
<td align="right">3</td>
<td align="right">488,839</td>
<td align="right">69</td>
</tr>
<tr>
<td align="left">Sim_sga_m75</td>
<td align="right">97,740</td>
<td align="right">28</td>
<td align="right">500,829</td>
<td align="right">948</td>
</tr>
<tr>
<td align="left">Sim_sga_m77</td>
<td align="right">97,727</td>
<td align="right">27</td>
<td align="right">501,055</td>
<td align="right">917</td>
</tr>
<tr>
<td align="left">Sim_soap_K69</td>
<td align="right">96,485</td>
<td align="right">35</td>
<td align="right">497,002</td>
<td align="right">2016</td>
</tr>
<tr>
<td align="left">Sim_soap_K71</td>
<td align="right">96,003</td>
<td align="right">44</td>
<td align="right">497,486</td>
<td align="right">2004</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Total columns show the total number of SNPs in any position while ‘Pos k’ columns show the number of SNPs at position
<italic>k</italic>
</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Variants detected by each FreeBayes [
<xref ref-type="bibr" rid="CR30">30</xref>
] and Samtools mpileup [
<xref ref-type="bibr" rid="CR31">31</xref>
] using BWA alignments also exhibited strong positional bias (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S38–S41). Since Samtools skips anomalous read pairs and orphan reads by default, filtering out improperly-paired reads would not remove the positional bias.</p>
</sec>
<sec id="Sec13">
<title>Positional bias is present in resequencing analysis</title>
<p>In the previous subsections, reads used for variant calling came from the same individual as the reads used for the reference genome assembly. Therefore, any called variants would be considered false positives. In this subsection, we investigate the alignment of reads from a different individual to the reference genome. In this case, some of the called variants should be real but there could also be false positives that would potentially manifest themselves as the positional bias described in this study. This analysis imitates the practical use of draft de novo assemblies in resequencing projects that focus on non-model species.</p>
<p>To verify that the positional bias would still be present in variant calling performed with reads from a different individual, we aligned a short-insert library of
<italic>A. thaliana</italic>
Bs-1 genotype [
<xref ref-type="bibr" rid="CR21">21</xref>
] against our assemblies constructed from the simulated read data. The bias was present in all contig assemblies (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S42–S43). It was also clearly visible in both Sim_sga and both Sim_soap scaffold assemblies at the expected locations (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S44–S45). Meanwhile, the bias essentially disappeared in the Sim_allp scaffold assembly, probably because of the assembly’s high quality (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S44–S45). As before, we transformed scaffold positions into contig positions for all scaffold SNPs and observed partial recovery of the bias in Sim_allp (Table
<xref rid="Tab3" ref-type="table">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S46–S47). We traced a large number of SNPs on Sim_soap_K69 scaffolds to contigs shorter than 200 bp (‘Untransf’ column in Table
<xref rid="Tab3" ref-type="table">3</xref>
). We also found that 15 and 95 SNPs (23 and 5%) from non-
<italic>k</italic>
positions in Sim_allp and Sim_soap_K69 scaffolds respectively corresponded to position
<italic>k</italic>
SNPs in the contig assemblies (bias masking). These results are consistent with those uncovered through the alignment of our simulated reads.</p>
<p>We tested whether the variant positions in the Bs-1 alignments overlapped with the variant positions in the simulated read alignments. If they overlap well, removing these positions may provide a useful solution for the bias reduction. Despite the partial overlap (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S48–S49; ‘Both’ row in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S50–S51), Sim_allp and Sim_soap_K69 data sets possessed abnormally high number of position
<italic>k</italic>
SNPs that were unique to the Bs-1 alignments (‘Bs-1’ row in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S50–S51). These SNPs comprise the variants that would remain after the filtering of the shared SNPs. The remaining bias was particularly strong in case of Sim_soap_K69. Similar pattern appeared when positions of scaffold SNPs were transformed into contig coordinates (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S52–S55). Even though the filtering essentially eliminated the bias from the 5
<sup></sup>
end of the Sim_allp assembly (‘Bs-1’ row in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S54), the results were not universal and other solutions would be needed.</p>
</sec>
<sec id="Sec14">
<title>Considerable reduction in bias achieved by repetitive element filtering</title>
<p>For each assembly created from the simulated read data, we traced the origins of all reads that had primary alignments to SNPs at position
<italic>k</italic>
with the mapping quality of at least 40 and without any insertions or deletions compared to the reference (CIGAR = 100M). Such conservative filtering ensured that the analysed reads aligned well only to a single location in the assembly and the alignments were not coerced by read clipping. If all such reads had a single origin per variant, it would suggest that the corresponding variants were caused by either simulated sequencing errors or poorly aligned reads from remote genomic regions with modest similarity. If the reads had multiple origins, the variants were likely caused by highly similar repeated or homologous sequences in the genome.</p>
<p>Fewer than 6% of the SNPs had the selected reads coming from the same origin. It suggests that the bias is not caused by poor alignments or sequencing errors but rather by repetitive or homologous sequences. Such sequences would lead to multiple potential path extensions through de Bruijn or string graphs. The extensions would also have equally high support and may form either a junction or a so-called “bubble” in the path [
<xref ref-type="bibr" rid="CR11">11</xref>
]. Further investigation is needed to determine why the assemblers tend to terminate the path extension shortly after a bubble or a junction is formed.</p>
<p>We used RepeatMasker [
<xref ref-type="bibr" rid="CR32">32</xref>
] to identify repetitive elements in the assemblies constructed from the simulated reads and discovered that the vast majority of SNPs near contig or scaffold ends were within those sequences (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S56–S66). The pattern persisted whether the SNPs were called against contigs (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S56–S57) or scaffolds (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S58–S61) and whether the SNPs were called from the alignment of our simulated reads (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S56–S61) or from the alignment of the actual Bs-1 reads (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S62–S66).
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Distribution of SNP positions at the 5
<sup></sup>
end of contigs in the Bs-1 data set with repetitive element annotation. Colour indicates whether the SNPs are within repetitive sequences (
<italic>blue</italic>
) or not (
<italic>orange</italic>
). SNPs were called from the Bs-1 read alignments. Repetitive elements included all sequences reported by RepeatMasker</p>
</caption>
<graphic xlink:href="12864_2017_3637_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>In the simulated contig data set, position
<italic>k</italic>
SNPs appeared within diverse repetitive elements. The largest number of SNPs were in the DNA and long terminal repeat (LTR) families (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S67), which had the highest representation in the
<italic>A. thaliana</italic>
genome [
<xref ref-type="bibr" rid="CR39">39</xref>
]. When adjusted for family frequency, SNPs were more likely to appear in the satellites (ALLPATHS and SOAPdenovo2 assemblies) or DNA repeats (SGA assemblies). However, none of the families were strongly overrepresented.</p>
<p>When the SNPs located in repetitive elements were filtered out from the set obtained through the alignment of the simulated reads, the positional bias was either completely eliminated as in the case of Sim_allp or reduced to negligible levels as in the case of Sim_soap and Sim_sga assemblies (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S68–S73). The remaining bias might be due to unannotated repetitive elements or other homologous sequences. The same tendency was observed with the SNPs called from the alignment of the actual Bs-1 reads (Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
; Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S74–S78) except that more SNPs remained overall. This is expected because the alignment of the simulated reads can only produce false positives while the alignment of Bs-1 reads should additionally yield real SNPs.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Distribution of SNP positions at the 5
<sup></sup>
end of contigs in the Bs-1 data set after repetitive element filtering. SNPs were called from the Bs-1 read alignments and SNPs located in the annotated repetitive elements were removed</p>
</caption>
<graphic xlink:href="12864_2017_3637_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>Finally, we checked whether it would be possible to reduce the bias even further by removing both SNPs in repetitive elements and SNPs produced with the simulated read alignments from the Bs-1 SNP set. Since the bias after the repetitive element filtering was already absent in Sim_allp and barely detectable in Sim_sga assemblies, we only report the results for Sim_soap_K69. After transforming scaffold coordinates to contig coordinates, there were 25 SNPs called at position
<italic>k</italic>
from Bs-1 read alignments. Out of them, 6 SNPs were also produced by the simulated read alignments (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S79). Overall effect of the additional filtering is fairly minor (row ‘Both’ in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S80) but the bias almost entirely disappears (row ‘Bs-1 Only’ in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S80).</p>
</sec>
</sec>
<sec id="Sec15" sec-type="discussion">
<title>Discussion</title>
<p>Resequencing projects rely on a wide range of complex applications that often need to be tuned for the best performance. To achieve high quality results, it is essential to know limitations and biases inherent to each employed application. We have shown that variants obtained from the alignment of short reads against assemblies constructed with de Bruijn or string graphs suffer from positional bias. While the degree of bias varied depending on the input data and setup, it was clearly visible in most tested configurations that encompassed several popular aligners and assemblers. If not addressed, the bias may trigger confounding effects in downstream analyses.</p>
<p>To confirm the bias, we designed two types of analyses. First, we called variants after aligning the reads from a short-insert library that was previously used to construct the reference assembly. In this case, all variants would be false positives. However, they would also be likely to appear when aligning reads from a different individual as long as the corresponding regions are conserved. This type of analyses was performed on both publicly available assemblies and our assemblies constructed from simulated reads. Second, we performed variant analyses with actual reads coming from a genotype different from the reference. In this case, the variants would contain a mix of true positives and false positives. Due to the limited availability of data, we only performed these analyses with the assemblies constructed from the simulated reads.</p>
<p>Our results suggest that the positional bias was caused by the alignment of reads from repetitive or homologous sequences that often had fewer copies included in the assembly compared to the actual genome. The most effective method to reduce the bias is to remove the variants located in repeated sequences. Such variants are very likely to be false positives even when they are not located at position
<italic>k</italic>
. Depending on the assembler configuration and input data, such filtering may either eliminate the bias completely or reduce it to almost negligible level. The remaining bias-causing SNPs probably reside in unannotated repetitive elements or other homologous sequences that have multiple copies in the genome but only a single copy in the assembly. While some of the variants could have been caused by heterozygosity in
<italic>B. constrictor</italic>
and
<italic>M. zebra</italic>
genomes, the simulated data set essentially represented a haploid individual without any heterozygous regions. Therefore, heterozygosity is unlikely to constitute a major factor in positional bias.</p>
<p>We also observed the positional bias while using alternative variant callers (FreeBayes and Samtools) and read aligners (NextGenMap and GSNAP). The only exception was Bowtie2 whose read alignments yielded very weak bias in the SGA and SOAPdenovo2 assemblies and no bias in the ALLPATHS-LG assembly when calling variants with GATK. However, the alignment of the resequencing data with Bowtie2 against the simulated contig assemblies generated considerably fewer SNPs in all positions compared to BWA (Table
<xref rid="Tab4" ref-type="table">4</xref>
) suggesting decreased sensitivity of the aligner.</p>
<p>Previous reports also showed the reduced sensitivity of Bowtie2 [
<xref ref-type="bibr" rid="CR35">35</xref>
,
<xref ref-type="bibr" rid="CR36">36</xref>
] while others revealed only minor [
<xref ref-type="bibr" rid="CR38">38</xref>
] or inconclusive [
<xref ref-type="bibr" rid="CR40">40</xref>
] differences between Bowtie2 and BWA. Interestingly, Li attributes the differences between variants called from BWA and Bowtie2 alignments to lower mapping scores that Bowtie2 assigns to reads with additional suboptimal alignments [
<xref ref-type="bibr" rid="CR36">36</xref>
]. Since similar repetitive sequences are likely to appear multiple times in the assembly, the respective primary alignments would have very low scores and they would rarely produce variants. On the other hand, such drastic reduction in the number of SNPs between BWA and Bowtie2 alignments is unlikely to stem exclusively from the false positives called by BWA. Therefore, further research is needed to determine whether Bowtie2 actually outperforms BWA in terms of both sensitivity and specificity when calling variants against draft genome assemblies.</p>
<p>Theoretically, it may be beneficial for resequencing projects to align reads previously used to construct the reference assembly, call variants and exclude them from the variants called in resequenced individuals. If the involved regions are completely conserved between the reference individual and a resequenced individual, those SNPs would be called for the resequenced individual as well and they would be false positives. However, this type of filtering produced very minor effects on the positional bias in the data sets we analysed, especially after the removal of SNPs located in repetitive elements.</p>
<p>The extent of positional bias obviously depends on the quality of the reference genome assembly. High quality assemblies have reduced number of contigs or scaffolds. Thus, they will have fewer locations to call the bias-causing SNPs. For many projects dealing with non-model organisms, it may not be realistic to construct genome assemblies with adequate quality in sufficiently short time. Moreover, even when a high quality assembly does not readily exhibit a strong positional bias, it may still appear after mapping scaffold coordinates to contig coordinates. We found the evidence for such bias masking in our analyses. Increased number of repetitive elements in a genome would also yield higher positional bias while lowering assembly quality. Therefore, regardless of the assembly quality it would be important to filter SNPs that appear in repetitive elements in order to avoid potential complications.</p>
<p>A more conservative approach would involve filtering out all variants near contig ends. When using a sufficiently large threshold (greater than read length), this would remove the positional bias completely. However, any true variants in those regions will also be lost. To apply such a filter, it would be necessary to know contig coordinates within scaffolds and this information may not be readily available. In any case, this kind of filtering should be done in addition to the filtering of variants in repetitive elements because some repetitive sequences would not be located near contig ends.</p>
</sec>
<sec id="Sec16" sec-type="conclusion">
<title>Conclusions</title>
<p>This study describes positional bias in variant calls that are made against draft genomes constructed with several popular assemblers. The variant calls that cause the bias are mostly false positives that arise from aligning the reads originated in spatially remote repeated sequences or homologous regions. The degree of the bias depends on the choice of tools, configuration, and underlying data set. However, the bias is likely to affect many projects that rely on de novo draft assemblies generated from short read data. The bias can be mitigated by removing variants located in repetitive elements that can be identified by programs such as RepeatMasker [
<xref ref-type="bibr" rid="CR32">32</xref>
]. More conservatively, the bias can also be removed by filtering out all variants located near contig ends. Our findings will help researchers who work on resequencing projects to recognise and reduce the position bias, which will result in higher quality variant data sets.</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec17">
<title>Additional file</title>
<p>
<media position="anchor" xlink:href="12864_2017_3637_MOESM1_ESM.pdf" id="MOESM1">
<label>Additional file 1</label>
<caption>
<p>Supplementary figures. (PDF 1499 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>DDBJ</term>
<def>
<p>DNA Data Bank of Japan</p>
</def>
</def-item>
<def-item>
<term>DP</term>
<def>
<p>depth of coverage</p>
</def>
</def-item>
<def-item>
<term>HTS</term>
<def>
<p>high-throughput sequencing</p>
</def>
</def-item>
<def-item>
<term>indel</term>
<def>
<p>insertion or deletion</p>
</def>
</def-item>
<def-item>
<term>LTR</term>
<def>
<p>long terminal repeat</p>
</def>
</def-item>
<def-item>
<term>MQ</term>
<def>
<p>mapping quality</p>
</def>
</def-item>
<def-item>
<term>NCBI</term>
<def>
<p>National Center for Biotechnology Information</p>
</def>
</def-item>
<def-item>
<term>QUAL</term>
<def>
<p>variant quality</p>
</def>
</def-item>
<def-item>
<term>SNP</term>
<def>
<p>single nucleotide polymorphism</p>
</def>
</def-item>
<def-item>
<term>TAIR</term>
<def>
<p>The Arabidopsis Information Resource</p>
</def>
</def-item>
<def-item>
<term>VCF</term>
<def>
<p>variant call format</p>
</def>
</def-item>
</def-list>
</glossary>
<ack>
<title>Acknowledgements</title>
<p>We thank Vladimir Briskin for critiquing the manuscript, Jun Sese for discussion, the Functional Genomic Center Zurich for technical support.</p>
<sec id="d29e2117">
<title>Funding</title>
<p>The study was supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no GA-2010-267243 – PLANT FELLOWS to RVB, Swiss National Science Foundation, the University Research Priority Program of Evolution in Action of the University of Zurich, Human Frontier Science Program, JST CREST, KAKENHI Grant Number 16H06469 and 26113709 to KKS.</p>
</sec>
<sec id="d29e2122">
<title>Availability of data and materials</title>
<p>The pipeline description including all scripts and commands used to process and analyse the data have been deposited to GitLab (
<ext-link ext-link-type="uri" xlink:href="https://gitlab.com/rbrisk/posbias">https://gitlab.com/rbrisk/posbias</ext-link>
).</p>
</sec>
<sec id="d29e2132">
<title>Authors’ contributions</title>
<p>RVB conceived and designed the study, performed simulations, analysed the data, and wrote the paper with inputs from KKS. Both authors read and approved the final manuscript for publication.</p>
</sec>
<sec id="d29e2137">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="d29e2142">
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e2147">
<title>Ethics approval and consent to participate</title>
<p>
<italic>B. constrictor</italic>
(ERR234373) and
<italic>M. zebra</italic>
(SRR077290) short read libraries were downloaded from the publicly available NCBI Sequence Read Archive.</p>
</sec>
<sec id="d29e2158">
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<mixed-citation publication-type="other">Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, et al. Maize HapMap2 identifies extant variation from a genome in flux. Nature Genetics. 2012; 44(7):803–807. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ng.2313">http://dx.doi.org/10.1038/ng.2313</ext-link>
.</mixed-citation>
</ref>
<ref id="CR2">
<label>2</label>
<mixed-citation publication-type="other">Stanton-Geddes J, Paape T, Epstein B, Briskine R, Yoder J, Mudge J, et al. Candidate genes and genetic architecture of symbiotic and agronomic traits revealed by whole-genome, sequence-based association genetics in
<italic>Medicago truncatula</italic>
. PLoS ONE. 2013; 8(5):Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0065688">http://dx.doi.org/10.1371/journal.pone.0065688</ext-link>
.</mixed-citation>
</ref>
<ref id="CR3">
<label>3</label>
<mixed-citation publication-type="other">Gudbjartsson DF, Helgason H, Gudjonsson SA, Zink F, Oddson A, Gylfason A, et al. Large-scale whole-genome sequencing of the Icelandic population. Nature Genetics. 2015; 47(5):435–444. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ng.3247">http://dx.doi.org/10.1038/ng.3247</ext-link>
.</mixed-citation>
</ref>
<ref id="CR4">
<label>4</label>
<mixed-citation publication-type="other">Rondeau EB, Minkley DR, Leong JS, Messmer AM, Jantzen JR, von Schalburg KR, et al. The genome and linkage map of the northern pike (Esox lucius): conserved synteny revealed between the salmonid sister group and the Neoteleostei. PLoS ONE. 2014; Jul;9(7):e102089. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0102089">http://dx.doi.org/10.1371/journal.pone.0102089</ext-link>
.</mixed-citation>
</ref>
<ref id="CR5">
<label>5</label>
<mixed-citation publication-type="other">Nowak M, Russo G, Schlapbach R, Huu C, Lenhard M, Conti E. The draft genome of Primula veris yields insights into the molecular basis of heterostyly. Genome Biology. 2015; 16(1):12. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genomebiology.com/2015/16/1/12">http://genomebiology.com/2015/16/1/12</ext-link>
.</mixed-citation>
</ref>
<ref id="CR6">
<label>6</label>
<mixed-citation publication-type="other">Mavromatis K, Land ML, Brettin TS, Quest DJ, Copeland A, Clum A, et al. The fast changing landscape of sequencing technologies and their impact on microbial genome assemblies and annotation. PLoS ONE. 2012; 7(12):e48837. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0048837">http://dx.doi.org/10.1371/journal.pone.0048837</ext-link>
.</mixed-citation>
</ref>
<ref id="CR7">
<label>7</label>
<mixed-citation publication-type="other">Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001; 98(17):9748–9753. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/content/98/17/9748.abstract">http://www.pnas.org/content/98/17/9748.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR8">
<label>8</label>
<mixed-citation publication-type="other">Myers EW. The fragment assembly string graph. Bioinformatics. 2005; 21(suppl 2):ii79–ii85. Available from:
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract">http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR9">
<label>9</label>
<mixed-citation publication-type="other">Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Research. 2012; 22(3):549–556. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/22/3/549.abstract">http://genome.cshlp.org/content/22/3/549.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR10">
<label>10</label>
<mixed-citation publication-type="other">Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics. 2014; 15(1):Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/15/302">http://www.biomedcentral.com/1471-2105/15/302</ext-link>
.</mixed-citation>
</ref>
<ref id="CR11">
<label>11</label>
<mixed-citation publication-type="other">Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: A parallel assembler for short read sequence data. Genome Research. 2009; 19(6):1117–1123. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/19/6/1117.abstract">http://genome.cshlp.org/content/19/6/1117.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR12">
<label>12</label>
<mixed-citation publication-type="other">Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, Rokhsar DS. Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE. 2011; 6(8):e23501. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0023501">http://dx.doi.org/10.1371/journal.pone.0023501</ext-link>
.</mixed-citation>
</ref>
<ref id="CR13">
<label>13</label>
<mixed-citation publication-type="other">Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences. 2011; 108(4):1513–1518. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/content/108/4/1513.abstract">http://www.pnas.org/content/108/4/1513.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR14">
<label>14</label>
<mixed-citation publication-type="other">Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012; 1(1):Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.gigasciencejournal.com/content/1/1/18">http://www.gigasciencejournal.com/content/1/1/18</ext-link>
.</mixed-citation>
</ref>
<ref id="CR15">
<label>15</label>
<mixed-citation publication-type="other">Hohenlohe PA, Catchen J, Cresko WA. Population genomic analysis of model and nonmodel organisms using sequenced RAD tags In: Pompanon F, Bonin A, editors. Data Production and, Analysis in Population Genomics: Methods and Protocols. Totowa, NJ: Humana Press: 2012. p. 235–260. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/978-1-61779-870-2_14">http://dx.doi.org/10.1007/978-1-61779-870-2_14</ext-link>
.</mixed-citation>
</ref>
<ref id="CR16">
<label>16</label>
<mixed-citation publication-type="other">Xu X, Zeng L, Tao Y, Vuong T, Wan J, Boerma R, et al.Pinpointing genes underlying the quantitative trait loci for root-knot nematode resistance in palaeopolyploid soybean by whole genome resequencing. Proceedings of the National Academy of Sciences. 2013; 110(33):13469–13474. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/content/110/33/13469.abstract">http://www.pnas.org/content/110/33/13469.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR17">
<label>17</label>
<mixed-citation publication-type="other">Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, et al. Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013; 2(1):10. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.gigasciencejournal.com/content/2/1/10">http://www.gigasciencejournal.com/content/2/1/10</ext-link>
.</mixed-citation>
</ref>
<ref id="CR18">
<label>18</label>
<mixed-citation publication-type="other">Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research. 2012; 40(D1):D1202–D1210. Available from:
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/content/40/D1/D1202.abstract">http://nar.oxfordjournals.org/content/40/D1/D1202.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR19">
<label>19</label>
<mixed-citation publication-type="other">Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research. 2011; 21(12):2224–2241. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/21/12/2224.abstract">http://genome.cshlp.org/content/21/12/2224.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR20">
<label>20</label>
<mixed-citation publication-type="other">Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2014; 30(1):31–37. Available from:
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/30/1/31.abstract">http://bioinformatics.oxfordjournals.org/content/30/1/31.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR21">
<label>21</label>
<mixed-citation publication-type="other">Schmitz RJ, Schultz MD, Urich MA, Nery JR, Pelizzola M, Libiger O, et al. Patterns of population epigenomic diversity. Nature. 2013; 495(7440):193–198. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nature11968">http://dx.doi.org/10.1038/nature11968</ext-link>
.</mixed-citation>
</ref>
<ref id="CR22">
<label>22</label>
<mixed-citation publication-type="other">McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al.The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research. 2010; 20(9):1297–1303. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/20/9/1297.abstract">http://genome.cshlp.org/content/20/9/1297.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR23">
<label>23</label>
<mixed-citation publication-type="other">DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al.A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011; 43(5):491–498. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ng.806">http://dx.doi.org/10.1038/ng.806</ext-link>
.</mixed-citation>
</ref>
<ref id="CR24">
<label>24</label>
<mixed-citation publication-type="other">Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, et al.From FastQ data to high-confidence variant calls: The Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013; 43:11.10.1–11.10.33. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1002/0471X00000.250953.bi1110s43">http://dx.doi.org/10.1002/0471X00000.250953.bi1110s43</ext-link>
.</mixed-citation>
</ref>
<ref id="CR25">
<label>25</label>
<mixed-citation publication-type="other">Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25(14):1754–1760. Available from:
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/25/14/1754.abstract">http://bioinformatics.oxfordjournals.org/content/25/14/1754.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR26">
<label>26</label>
<mixed-citation publication-type="other">Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 1988; 2(3):231–239. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.sciencedirect.com/science/article/pii/0888754388900079">http://www.sciencedirect.com/science/article/pii/0888754388900079</ext-link>
.</mixed-citation>
</ref>
<ref id="CR27">
<label>27</label>
<mixed-citation publication-type="other">Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics. 2013; 29(21):2790–2791. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btt468">http://dx.doi.org/10.1093/bioinformatics/btt468</ext-link>
.</mixed-citation>
</ref>
<ref id="CR28">
<label>28</label>
<mixed-citation publication-type="other">Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010 Feb; 26(7):873–881. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btq057">http://dx.doi.org/10.1093/bioinformatics/btq057</ext-link>
.</mixed-citation>
</ref>
<ref id="CR29">
<label>29</label>
<mixed-citation publication-type="other">Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012; 9(4):357–359. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nmeth.1923">http://dx.doi.org/10.1038/nmeth.1923</ext-link>
.</mixed-citation>
</ref>
<ref id="CR30">
<label>30</label>
<mixed-citation publication-type="other">Garrison E, Marth GT. Haplotype-based variant detection from short-read sequencing. arXiv. 2012:1207.3907. Available from:
<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1207.3907">https://arxiv.org/abs/1207.3907</ext-link>
.</mixed-citation>
</ref>
<ref id="CR31">
<label>31</label>
<mixed-citation publication-type="other">Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–2079. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR32">
<label>32</label>
<mixed-citation publication-type="other">Smit A, Hubley R, Green P. RepeatMasker Open-4.0. 2013. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.repeatmasker.org">http://www.repeatmasker.org</ext-link>
.</mixed-citation>
</ref>
<ref id="CR33">
<label>33</label>
<mixed-citation publication-type="other">Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research. 2005; 110(1-4):462–467. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.karger.com/doi/10.1159/000084979">http://www.karger.com/doi/10.1159/000084979</ext-link>
.</mixed-citation>
</ref>
<ref id="CR34">
<label>34</label>
<mixed-citation publication-type="other">Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, et al.ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research. 2008; 18(5):810–820. Available from:
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/18/5/810.abstract">http://genome.cshlp.org/content/18/5/810.abstract</ext-link>
.</mixed-citation>
</ref>
<ref id="CR35">
<label>35</label>
<mixed-citation publication-type="other">Hatem A, Bozdag D, Toland AE, Catalyiurek UV. Benchmarking short sequence mapping tools. BMC Bioinformatics. 2013; 14(1):184. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/1471-2105-14-184">http://dx.doi.org/10.1186/1471-2105-14-184</ext-link>
.</mixed-citation>
</ref>
<ref id="CR36">
<label>36</label>
<mixed-citation publication-type="other">Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014; 30(20):2843–2851. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btu356">http://dx.doi.org/10.1093/bioinformatics/btu356</ext-link>
.</mixed-citation>
</ref>
<ref id="CR37">
<label>37</label>
<mixed-citation publication-type="other">Highnam G, Wang JJ, Kusler D, Zook J, Vijayan V, Leibovich N, et al.An analytical framework for optimizing variant discovery from personal genomes. Nature Communications. 2015; 6:6275. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ncomms7275">http://dx.doi.org/10.1038/ncomms7275</ext-link>
.</mixed-citation>
</ref>
<ref id="CR38">
<label>38</label>
<mixed-citation publication-type="other">Cornish A, Guda C. A comparison of variant calling pipelines using genome in a bottle as a reference. BioMed Research International. 2015; 2015:11. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1155/2015/456479">http://dx.doi.org/10.1155/2015/456479</ext-link>
.</mixed-citation>
</ref>
<ref id="CR39">
<label>39</label>
<mixed-citation publication-type="other">de la Chaux N, Tsuchimatsu T, Shimizu KK, Wagner A. The predominantly selfing plant
<italic>Arabidopsis thaliana</italic>
experienced a recent reduction in transposable element abundance compared to its outcrossing relative
<italic>Arabidopsis lyrata</italic>
. Mobile DNA. 2012; 3(1):2. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/1759-8753-3-2">http://dx.doi.org/10.1186/1759-8753-3-2</ext-link>
.</mixed-citation>
</ref>
<ref id="CR40">
<label>40</label>
<mixed-citation publication-type="other">Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports. 2015 Dec; 5:17875. Available from:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/srep17875">http://dx.doi.org/10.1038/srep17875</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000295  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000295  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021