Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000F370 ( Pmc/Corpus ); précédent : 000F369; suivant : 000F371 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">RepARK—
<italic>de novo</italic>
creation of repeat libraries from whole-genome NGS reads</title>
<author>
<name sortKey="Koch, Philipp" sort="Koch, Philipp" uniqKey="Koch P" first="Philipp" last="Koch">Philipp Koch</name>
</author>
<author>
<name sortKey="Platzer, Matthias" sort="Platzer, Matthias" uniqKey="Platzer M" first="Matthias" last="Platzer">Matthias Platzer</name>
</author>
<author>
<name sortKey="Downie, Bryan R" sort="Downie, Bryan R" uniqKey="Downie B" first="Bryan R." last="Downie">Bryan R. Downie</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">24634442</idno>
<idno type="pmc">4027187</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4027187</idno>
<idno type="RBID">PMC:4027187</idno>
<idno type="doi">10.1093/nar/gku210</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000F37</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F37</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">RepARK—
<italic>de novo</italic>
creation of repeat libraries from whole-genome NGS reads</title>
<author>
<name sortKey="Koch, Philipp" sort="Koch, Philipp" uniqKey="Koch P" first="Philipp" last="Koch">Philipp Koch</name>
</author>
<author>
<name sortKey="Platzer, Matthias" sort="Platzer, Matthias" uniqKey="Platzer M" first="Matthias" last="Platzer">Matthias Platzer</name>
</author>
<author>
<name sortKey="Downie, Bryan R" sort="Downie, Bryan R" uniqKey="Downie B" first="Bryan R." last="Downie">Bryan R. Downie</name>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Generation of repeat libraries is a critical step for analysis of complex genomes. In the era of next-generation sequencing (NGS), such libraries are usually produced using a whole-genome shotgun (WGS) derived reference sequence whose completeness greatly influences the quality of derived repeat libraries. We describe here a
<italic>de novo</italic>
repeat assembly method—RepARK (Repetitive motif detection by Assembly of Repetitive K-mers)—which avoids potential biases by using abundant k-mers of NGS WGS reads without requiring a reference genome. For validation, repeat consensuses derived from simulated and real
<italic>Drosophila melanogaster</italic>
NGS WGS reads were compared to repeat libraries generated by four established methods. RepARK is orders of magnitude faster than the other methods and generates libraries that are: (i) composed almost entirely of repetitive motifs, (ii) more comprehensive and (iii) almost completely annotated by TEclass. Additionally, we show that the RepARK method is applicable to complex genomes like human and can even serve as a diagnostic tool to identify repetitive sequences contaminating NGS datasets.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Lander, E S" uniqKey="Lander E">E.S. Lander</name>
</author>
<author>
<name sortKey="Linton, L M" uniqKey="Linton L">L.M. Linton</name>
</author>
<author>
<name sortKey="Birren, B" uniqKey="Birren B">B. Birren</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C. Nusbaum</name>
</author>
<author>
<name sortKey="Zody, M C" uniqKey="Zody M">M.C. Zody</name>
</author>
<author>
<name sortKey="Baldwin, J" uniqKey="Baldwin J">J. Baldwin</name>
</author>
<author>
<name sortKey="Devon, K" uniqKey="Devon K">K. Devon</name>
</author>
<author>
<name sortKey="Dewar, K" uniqKey="Dewar K">K. Dewar</name>
</author>
<author>
<name sortKey="Doyle, M" uniqKey="Doyle M">M. Doyle</name>
</author>
<author>
<name sortKey="Fitzhugh, W" uniqKey="Fitzhugh W">W. FitzHugh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mayer, K F" uniqKey="Mayer K">K.F. Mayer</name>
</author>
<author>
<name sortKey="Waugh, R" uniqKey="Waugh R">R. Waugh</name>
</author>
<author>
<name sortKey="Brown, J W" uniqKey="Brown J">J.W. Brown</name>
</author>
<author>
<name sortKey="Schulman, A" uniqKey="Schulman A">A. Schulman</name>
</author>
<author>
<name sortKey="Langridge, P" uniqKey="Langridge P">P. Langridge</name>
</author>
<author>
<name sortKey="Platzer, M" uniqKey="Platzer M">M. Platzer</name>
</author>
<author>
<name sortKey="Fincher, G B" uniqKey="Fincher G">G.B. Fincher</name>
</author>
<author>
<name sortKey="Muehlbauer, G J" uniqKey="Muehlbauer G">G.J. Muehlbauer</name>
</author>
<author>
<name sortKey="Sato, K" uniqKey="Sato K">K. Sato</name>
</author>
<author>
<name sortKey="Close, T J" uniqKey="Close T">T.J. Close</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Treangen, T J" uniqKey="Treangen T">T.J. Treangen</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yandell, M" uniqKey="Yandell M">M. Yandell</name>
</author>
<author>
<name sortKey="Ence, D" uniqKey="Ence D">D. Ence</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feschotte, C" uniqKey="Feschotte C">C. Feschotte</name>
</author>
<author>
<name sortKey="Pritham, E J" uniqKey="Pritham E">E.J. Pritham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orr, H T" uniqKey="Orr H">H.T. Orr</name>
</author>
<author>
<name sortKey="Zoghbi, H Y" uniqKey="Zoghbi H">H.Y. Zoghbi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hancks, D C" uniqKey="Hancks D">D.C. Hancks</name>
</author>
<author>
<name sortKey="Kazazian, H H" uniqKey="Kazazian H">H.H. Kazazian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lim, K G" uniqKey="Lim K">K.G. Lim</name>
</author>
<author>
<name sortKey="Kwoh, C K" uniqKey="Kwoh C">C.K. Kwoh</name>
</author>
<author>
<name sortKey="Hsu, L Y" uniqKey="Hsu L">L.Y. Hsu</name>
</author>
<author>
<name sortKey="Wirawan, A" uniqKey="Wirawan A">A. Wirawan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jurka, J" uniqKey="Jurka J">J. Jurka</name>
</author>
<author>
<name sortKey="Kapitonov, V V" uniqKey="Kapitonov V">V.V. Kapitonov</name>
</author>
<author>
<name sortKey="Kohany, O" uniqKey="Kohany O">O. Kohany</name>
</author>
<author>
<name sortKey="Jurka, M V" uniqKey="Jurka M">M.V. Jurka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eichler, E E" uniqKey="Eichler E">E.E. Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, G" uniqKey="Benson G">G. Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jurka, J" uniqKey="Jurka J">J. Jurka</name>
</author>
<author>
<name sortKey="Kapitonov, V V" uniqKey="Kapitonov V">V.V. Kapitonov</name>
</author>
<author>
<name sortKey="Pavlicek, A" uniqKey="Pavlicek A">A. Pavlicek</name>
</author>
<author>
<name sortKey="Klonowski, P" uniqKey="Klonowski P">P. Klonowski</name>
</author>
<author>
<name sortKey="Kohany, O" uniqKey="Kohany O">O. Kohany</name>
</author>
<author>
<name sortKey="Walichiewicz, J" uniqKey="Walichiewicz J">J. Walichiewicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, Z" uniqKey="Bao Z">Z. Bao</name>
</author>
<author>
<name sortKey="Eddy, S R" uniqKey="Eddy S">S.R. Eddy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Price, A L" uniqKey="Price A">A.L. Price</name>
</author>
<author>
<name sortKey="Jones, N C" uniqKey="Jones N">N.C. Jones</name>
</author>
<author>
<name sortKey="Pevzner, P A" uniqKey="Pevzner P">P.A. Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S. Kurtz</name>
</author>
<author>
<name sortKey="Choudhuri, J V" uniqKey="Choudhuri J">J.V. Choudhuri</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E. Ohlebusch</name>
</author>
<author>
<name sortKey="Schleiermacher, C" uniqKey="Schleiermacher C">C. Schleiermacher</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J. Stoye</name>
</author>
<author>
<name sortKey="Giegerich, R" uniqKey="Giegerich R">R. Giegerich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Achaz, G" uniqKey="Achaz G">G. Achaz</name>
</author>
<author>
<name sortKey="Boyer, F" uniqKey="Boyer F">F. Boyer</name>
</author>
<author>
<name sortKey="Rocha, E P" uniqKey="Rocha E">E.P. Rocha</name>
</author>
<author>
<name sortKey="Viari, A" uniqKey="Viari A">A. Viari</name>
</author>
<author>
<name sortKey="Coissac, E" uniqKey="Coissac E">E. Coissac</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Koning, A P" uniqKey="De Koning A">A.P. de Koning</name>
</author>
<author>
<name sortKey="Gu, W" uniqKey="Gu W">W. Gu</name>
</author>
<author>
<name sortKey="Castoe, T A" uniqKey="Castoe T">T.A. Castoe</name>
</author>
<author>
<name sortKey="Batzer, M A" uniqKey="Batzer M">M.A. Batzer</name>
</author>
<author>
<name sortKey="Pollock, D D" uniqKey="Pollock D">D.D. Pollock</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
<author>
<name sortKey="Ye, J" uniqKey="Ye J">J. Ye</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S. Li</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J. Wang</name>
</author>
<author>
<name sortKey="Han, Y" uniqKey="Han Y">Y. Han</name>
</author>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C. Ye</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J. Wang</name>
</author>
<author>
<name sortKey="Yang, H" uniqKey="Yang H">H. Yang</name>
</author>
<author>
<name sortKey="Yu, J" uniqKey="Yu J">J. Yu</name>
</author>
<author>
<name sortKey="Wong, G K" uniqKey="Wong G">G.K. Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clark, A G" uniqKey="Clark A">A.G. Clark</name>
</author>
<author>
<name sortKey="Eisen, M B" uniqKey="Eisen M">M.B. Eisen</name>
</author>
<author>
<name sortKey="Smith, D R" uniqKey="Smith D">D.R. Smith</name>
</author>
<author>
<name sortKey="Bergman, C M" uniqKey="Bergman C">C.M. Bergman</name>
</author>
<author>
<name sortKey="Oliver, B" uniqKey="Oliver B">B. Oliver</name>
</author>
<author>
<name sortKey="Markow, T A" uniqKey="Markow T">T.A. Markow</name>
</author>
<author>
<name sortKey="Kaufman, T C" uniqKey="Kaufman T">T.C. Kaufman</name>
</author>
<author>
<name sortKey="Kellis, M" uniqKey="Kellis M">M. Kellis</name>
</author>
<author>
<name sortKey="Gelbart, W" uniqKey="Gelbart W">W. Gelbart</name>
</author>
<author>
<name sortKey="Iyer, V N" uniqKey="Iyer V">V.N. Iyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S. Kurtz</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A. Narechania</name>
</author>
<author>
<name sortKey="Stein, J C" uniqKey="Stein J">J.C. Stein</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D. Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, E W" uniqKey="Myers E">E.W. Myers</name>
</author>
<author>
<name sortKey="Sutton, G G" uniqKey="Sutton G">G.G. Sutton</name>
</author>
<author>
<name sortKey="Delcher, A L" uniqKey="Delcher A">A.L. Delcher</name>
</author>
<author>
<name sortKey="Dew, I M" uniqKey="Dew I">I.M. Dew</name>
</author>
<author>
<name sortKey="Fasulo, D P" uniqKey="Fasulo D">D.P. Fasulo</name>
</author>
<author>
<name sortKey="Flanigan, M J" uniqKey="Flanigan M">M.J. Flanigan</name>
</author>
<author>
<name sortKey="Kravitz, S A" uniqKey="Kravitz S">S.A. Kravitz</name>
</author>
<author>
<name sortKey="Mobarry, C M" uniqKey="Mobarry C">C.M. Mobarry</name>
</author>
<author>
<name sortKey="Reinert, K H" uniqKey="Reinert K">K.H. Reinert</name>
</author>
<author>
<name sortKey="Remington, K A" uniqKey="Remington K">K.A. Remington</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S. Koren</name>
</author>
<author>
<name sortKey="Treangen, T J" uniqKey="Treangen T">T.J. Treangen</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M. Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z. Iqbal</name>
</author>
<author>
<name sortKey="Caccamo, M" uniqKey="Caccamo M">M. Caccamo</name>
</author>
<author>
<name sortKey="Turner, I" uniqKey="Turner I">I. Turner</name>
</author>
<author>
<name sortKey="Flicek, P" uniqKey="Flicek P">P. Flicek</name>
</author>
<author>
<name sortKey="Mcvean, G" uniqKey="Mcvean G">G. McVean</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bailey, J A" uniqKey="Bailey J">J.A. Bailey</name>
</author>
<author>
<name sortKey="Yavor, A M" uniqKey="Yavor A">A.M. Yavor</name>
</author>
<author>
<name sortKey="Massa, H F" uniqKey="Massa H">H.F. Massa</name>
</author>
<author>
<name sortKey="Trask, B J" uniqKey="Trask B">B.J. Trask</name>
</author>
<author>
<name sortKey="Eichler, E E" uniqKey="Eichler E">E.E. Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, Z" uniqKey="Jiang Z">Z. Jiang</name>
</author>
<author>
<name sortKey="Hubley, R" uniqKey="Hubley R">R. Hubley</name>
</author>
<author>
<name sortKey="Smit, A" uniqKey="Smit A">A. Smit</name>
</author>
<author>
<name sortKey="Eichler, E E" uniqKey="Eichler E">E.E. Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P A" uniqKey="Pevzner P">P.A. Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H. Tang</name>
</author>
<author>
<name sortKey="Waterman, M S" uniqKey="Waterman M">M.S. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, D R" uniqKey="Zerbino D">D.R. Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E. Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J T" uniqKey="Simpson J">J.T. Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K. Wong</name>
</author>
<author>
<name sortKey="Jackman, S D" uniqKey="Jackman S">S.D. Jackman</name>
</author>
<author>
<name sortKey="Schein, J E" uniqKey="Schein J">J.E. Schein</name>
</author>
<author>
<name sortKey="Jones, S J" uniqKey="Jones S">S.J. Jones</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I. Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H. Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J. Ruan</name>
</author>
<author>
<name sortKey="Qian, W" uniqKey="Qian W">W. Qian</name>
</author>
<author>
<name sortKey="Fang, X" uniqKey="Fang X">X. Fang</name>
</author>
<author>
<name sortKey="Shi, Z" uniqKey="Shi Z">Z. Shi</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y. Li</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S. Li</name>
</author>
<author>
<name sortKey="Shan, G" uniqKey="Shan G">G. Shan</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K. Kristiansen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S. Gnerre</name>
</author>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I. Maccallum</name>
</author>
<author>
<name sortKey="Przybylski, D" uniqKey="Przybylski D">D. Przybylski</name>
</author>
<author>
<name sortKey="Ribeiro, F J" uniqKey="Ribeiro F">F.J. Ribeiro</name>
</author>
<author>
<name sortKey="Burton, J N" uniqKey="Burton J">J.N. Burton</name>
</author>
<author>
<name sortKey="Walker, B J" uniqKey="Walker B">B.J. Walker</name>
</author>
<author>
<name sortKey="Sharpe, T" uniqKey="Sharpe T">T. Sharpe</name>
</author>
<author>
<name sortKey="Hall, G" uniqKey="Hall G">G. Hall</name>
</author>
<author>
<name sortKey="Shea, T P" uniqKey="Shea T">T.P. Shea</name>
</author>
<author>
<name sortKey="Sykes, S" uniqKey="Sykes S">S. Sykes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Phillippy, A M" uniqKey="Phillippy A">A.M. Phillippy</name>
</author>
<author>
<name sortKey="Schatz, M C" uniqKey="Schatz M">M.C. Schatz</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M. Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, D R" uniqKey="Kelley D">D.R. Kelley</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zimin, A V" uniqKey="Zimin A">A.V. Zimin</name>
</author>
<author>
<name sortKey="Kelley, D R" uniqKey="Kelley D">D.R. Kelley</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M. Roberts</name>
</author>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marcais</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
<author>
<name sortKey="Yorke, J A" uniqKey="Yorke J">J.A. Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, D R" uniqKey="Kelley D">D.R. Kelley</name>
</author>
<author>
<name sortKey="Schatz, M C" uniqKey="Schatz M">M.C. Schatz</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, X" uniqKey="Li X">X. Li</name>
</author>
<author>
<name sortKey="Waterman, M S" uniqKey="Waterman M">M.S. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langley, C H" uniqKey="Langley C">C.H. Langley</name>
</author>
<author>
<name sortKey="Crepeau, M" uniqKey="Crepeau M">M. Crepeau</name>
</author>
<author>
<name sortKey="Cardeno, C" uniqKey="Cardeno C">C. Cardeno</name>
</author>
<author>
<name sortKey="Corbett Detig, R" uniqKey="Corbett Detig R">R. Corbett-Detig</name>
</author>
<author>
<name sortKey="Stevens, K" uniqKey="Stevens K">K. Stevens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C. Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abrusan, G" uniqKey="Abrusan G">G. Abrusan</name>
</author>
<author>
<name sortKey="Grundmann, N" uniqKey="Grundmann N">N. Grundmann</name>
</author>
<author>
<name sortKey="Demester, L" uniqKey="Demester L">L. DeMester</name>
</author>
<author>
<name sortKey="Makalowski, W" uniqKey="Makalowski W">W. Makalowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, W J" uniqKey="Kent W">W.J. Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adams, M D" uniqKey="Adams M">M.D. Adams</name>
</author>
<author>
<name sortKey="Celniker, S E" uniqKey="Celniker S">S.E. Celniker</name>
</author>
<author>
<name sortKey="Holt, R A" uniqKey="Holt R">R.A. Holt</name>
</author>
<author>
<name sortKey="Evans, C A" uniqKey="Evans C">C.A. Evans</name>
</author>
<author>
<name sortKey="Gocayne, J D" uniqKey="Gocayne J">J.D. Gocayne</name>
</author>
<author>
<name sortKey="Amanatides, P G" uniqKey="Amanatides P">P.G. Amanatides</name>
</author>
<author>
<name sortKey="Scherer, S E" uniqKey="Scherer S">S.E. Scherer</name>
</author>
<author>
<name sortKey="Li, P W" uniqKey="Li P">P.W. Li</name>
</author>
<author>
<name sortKey="Hoskins, R A" uniqKey="Hoskins R">R.A. Hoskins</name>
</author>
<author>
<name sortKey="Galle, R F" uniqKey="Galle R">R.F. Galle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Celniker, S E" uniqKey="Celniker S">S.E. Celniker</name>
</author>
<author>
<name sortKey="Wheeler, D A" uniqKey="Wheeler D">D.A. Wheeler</name>
</author>
<author>
<name sortKey="Kronmiller, B" uniqKey="Kronmiller B">B. Kronmiller</name>
</author>
<author>
<name sortKey="Carlson, J W" uniqKey="Carlson J">J.W. Carlson</name>
</author>
<author>
<name sortKey="Halpern, A" uniqKey="Halpern A">A. Halpern</name>
</author>
<author>
<name sortKey="Patel, S" uniqKey="Patel S">S. Patel</name>
</author>
<author>
<name sortKey="Adams, M" uniqKey="Adams M">M. Adams</name>
</author>
<author>
<name sortKey="Champe, M" uniqKey="Champe M">M. Champe</name>
</author>
<author>
<name sortKey="Dugan, S P" uniqKey="Dugan S">S.P. Dugan</name>
</author>
<author>
<name sortKey="Frise, E" uniqKey="Frise E">E. Frise</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dohm, J C" uniqKey="Dohm J">J.C. Dohm</name>
</author>
<author>
<name sortKey="Lottaz, C" uniqKey="Lottaz C">C. Lottaz</name>
</author>
<author>
<name sortKey="Borodina, T" uniqKey="Borodina T">T. Borodina</name>
</author>
<author>
<name sortKey="Himmelbauer, H" uniqKey="Himmelbauer H">H. Himmelbauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Macas, J" uniqKey="Macas J">J. Macas</name>
</author>
<author>
<name sortKey="Neumann, P" uniqKey="Neumann P">P. Neumann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kohany, O" uniqKey="Kohany O">O. Kohany</name>
</author>
<author>
<name sortKey="Gentles, A J" uniqKey="Gentles A">A.J. Gentles</name>
</author>
<author>
<name sortKey="Hankus, L" uniqKey="Hankus L">L. Hankus</name>
</author>
<author>
<name sortKey="Jurka, J" uniqKey="Jurka J">J. Jurka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bergman, C M" uniqKey="Bergman C">C.M. Bergman</name>
</author>
<author>
<name sortKey="Quesneville, H" uniqKey="Quesneville H">H. Quesneville</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, C D" uniqKey="Smith C">C.D. Smith</name>
</author>
<author>
<name sortKey="Shu, S" uniqKey="Shu S">S. Shu</name>
</author>
<author>
<name sortKey="Mungall, C J" uniqKey="Mungall C">C.J. Mungall</name>
</author>
<author>
<name sortKey="Karpen, G H" uniqKey="Karpen G">G.H. Karpen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bennett, E A" uniqKey="Bennett E">E.A. Bennett</name>
</author>
<author>
<name sortKey="Keller, H" uniqKey="Keller H">H. Keller</name>
</author>
<author>
<name sortKey="Mills, R E" uniqKey="Mills R">R.E. Mills</name>
</author>
<author>
<name sortKey="Schmidt, S" uniqKey="Schmidt S">S. Schmidt</name>
</author>
<author>
<name sortKey="Moran, J V" uniqKey="Moran J">J.V. Moran</name>
</author>
<author>
<name sortKey="Weichenrieder, O" uniqKey="Weichenrieder O">O. Weichenrieder</name>
</author>
<author>
<name sortKey="Devine, S E" uniqKey="Devine S">S.E. Devine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morissette, G" uniqKey="Morissette G">G. Morissette</name>
</author>
<author>
<name sortKey="Flamand, L" uniqKey="Flamand L">L. Flamand</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="iso-abbrev">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">24634442</article-id>
<article-id pub-id-type="pmc">4027187</article-id>
<article-id pub-id-type="doi">10.1093/nar/gku210</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods Online</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>RepARK—
<italic>de novo</italic>
creation of repeat libraries from whole-genome NGS reads</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Koch</surname>
<given-names>Philipp</given-names>
</name>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Platzer</surname>
<given-names>Matthias</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Downie</surname>
<given-names>Bryan R.</given-names>
</name>
</contrib>
<aff id="AFF1">Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany</aff>
</contrib-group>
<author-notes>
<corresp id="COR1">
<label>*</label>
To whom correspondence should be addressed. Tel: +49 3641 65 6053; Fax: +49 3641 65 6255; Email:
<email>philippk@fli-leibniz.de</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>5</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>14</day>
<month>3</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>14</day>
<month>3</month>
<year>2014</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>42</volume>
<issue>9</issue>
<fpage>e80</fpage>
<lpage>e80</lpage>
<history>
<date date-type="accepted">
<day>28</day>
<month>2</month>
<year>2014</year>
</date>
<date date-type="rev-recd">
<day>21</day>
<month>2</month>
<year>2014</year>
</date>
<date date-type="received">
<day>13</day>
<month>6</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.</copyright-statement>
<copyright-year>2014</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by/3.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/3.0/">http://creativecommons.org/licenses/by/3.0/</ext-link>
), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:title="pdf" xlink:type="simple" xlink:href="gku210.pdf"></self-uri>
<abstract>
<p>Generation of repeat libraries is a critical step for analysis of complex genomes. In the era of next-generation sequencing (NGS), such libraries are usually produced using a whole-genome shotgun (WGS) derived reference sequence whose completeness greatly influences the quality of derived repeat libraries. We describe here a
<italic>de novo</italic>
repeat assembly method—RepARK (Repetitive motif detection by Assembly of Repetitive K-mers)—which avoids potential biases by using abundant k-mers of NGS WGS reads without requiring a reference genome. For validation, repeat consensuses derived from simulated and real
<italic>Drosophila melanogaster</italic>
NGS WGS reads were compared to repeat libraries generated by four established methods. RepARK is orders of magnitude faster than the other methods and generates libraries that are: (i) composed almost entirely of repetitive motifs, (ii) more comprehensive and (iii) almost completely annotated by TEclass. Additionally, we show that the RepARK method is applicable to complex genomes like human and can even serve as a diagnostic tool to identify repetitive sequences contaminating NGS datasets.</p>
</abstract>
<counts>
<page-count count="12"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>cover-date</meta-name>
<meta-value>2014</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="SEC1">
<title>INTRODUCTION</title>
<p>Repetitive DNA is widespread among eukaryotes and generation of accurate repeat libraries is critical for genomic analyses: >50% of the human genome is composed of repeats (
<xref rid="B1" ref-type="bibr">1</xref>
), while some important agricultural crops such as barley have more than 80% repetitive sequence (
<xref rid="B2" ref-type="bibr">2</xref>
). In many sequence and genome analyses such as read alignment,
<italic>de novo</italic>
genome assembly and genome annotation, repeats can present major challenges (
<xref rid="B3" ref-type="bibr">3</xref>
). Identification and classification of repeats is one of the first steps in genome annotation as transposons can contain features such as protein-coding regions that complicate subsequent analyses (e.g. gene annotation) if repeats are not properly marked (
<xref rid="B4" ref-type="bibr">4</xref>
). Additionally, repeats are believed to play significant roles in genome evolution (
<xref rid="B5" ref-type="bibr">5</xref>
) and disease (
<xref rid="B6" ref-type="bibr">6–7</xref>
).</p>
<p>Depending on their size and distribution, repetitive elements are categorized into different types. Tandem repeats are composed of highly conserved sequence motifs located directly adjacent to each other, have unit sizes from 1 to more than 100 bp, and are categorized into microsatellites, minisatellites or satellites based on their unit size (
<xref rid="B8" ref-type="bibr">8</xref>
). Dispersed repeats range between 50 bp and 30 kb, but are scattered throughout the entire genome (
<xref rid="B9" ref-type="bibr">9</xref>
). Segmental duplications (SDs) are low-copy repetitive regions of between 1 kb and several Mb in size with an identity ≥90%, and can occur either intra- or interchromosomally (
<xref rid="B10" ref-type="bibr">10</xref>
).</p>
<p>In the genomics era, repeat libraries are usually derived from a draft genome sequence. Following genome assembly, low complexity repeats such as tandem repeats are first predicted with Tandem Repeats Finder (
<xref rid="B11" ref-type="bibr">11</xref>
). RepeatMasker (
<ext-link ext-link-type="uri" xlink:href="http://www.repeatmasker.org">http://www.repeatmasker.org</ext-link>
) identifies and masks dispersed repeats using consensuses from RepBase Update (
<xref rid="B12" ref-type="bibr">12</xref>
), which contains manually curated repeat consensuses from hundreds of species. Both false positives (due to sequence similarities) and negatives (when repeats are highly divergent) can emerge at this stage. Species-specific repeat families can be identified
<italic>ab initio</italic>
from reference genomes using RECON (
<xref rid="B13" ref-type="bibr">13</xref>
), which evaluates pair-wise similarities to build repeat consensuses, or RepeatScout (
<xref rid="B14" ref-type="bibr">14</xref>
) which identifies and uses highly frequent k-mers as seeds that are extended based on multiple sequence alignments. Both of these programs rely on either a high-quality reference sequence or long Sanger-length sequencing reads. REPuter (
<xref rid="B15" ref-type="bibr">15</xref>
) and Repseek (
<xref rid="B16" ref-type="bibr">16</xref>
) both adopt a seed-and-extend paradigm to identify identical and degenerate repetitive sequence. P-clouds (
<xref rid="B17" ref-type="bibr">17</xref>
) determines repetitive motifs by clustering similar but divergent sequences together. ReAS (
<xref rid="B18" ref-type="bibr">18</xref>
) generates repeat libraries based on identification and extension of seeds directly from shotgun reads rather than assembled sequences, but is limited to reads larger than 100 bp (the seed size) and has seen only limited usage (e.g. Drosophila 12 genomes project (
<xref rid="B19" ref-type="bibr">19</xref>
)). Tallymer predicts repeats based on k-mer counting in reference genomes and has identified repeats in the maize genome (
<xref rid="B20" ref-type="bibr">20</xref>
), but also relies on Sanger-length reads. Moreover, ‘surrogates’ generated as side-product when running ‘wgs-assembler’ (also known as Celera assembler) (
<xref rid="B21" ref-type="bibr">21</xref>
) represent sequences predicted to be repetitive based on depth of coverage statistics. Bambus 2, a scaffolder specifically adopted to metagenome datasets, can identify ‘variant motifs’ independent of coverage (
<xref rid="B22" ref-type="bibr">22</xref>
). Graph-based variation detection tools such as Cortex (
<xref rid="B23" ref-type="bibr">23</xref>
) can also be used to
<italic>de novo</italic>
identify genomic repeats, but require multiple samples or a finished reference genome. Finally, SDs may be detected via genome-wide all-versus-all alignments that are filtered to fulfill the requirements of ≥1 kb size and ≥90% identity (
<xref rid="B24" ref-type="bibr">24</xref>
). DupMasker uses information from a pre-defined SD-library to automatically detect SDs, but the SD-library limits this application to the human and other primate genomes (
<xref rid="B25" ref-type="bibr">25</xref>
). To date, there exist no resources to identify short length (<1 kb) or more divergent (<90% identity) SD events.</p>
<p>New opportunities in genome analysis have emerged with the advent of high-throughput short-read next-generation sequencing (NGS) technologies (
<xref rid="B3" ref-type="bibr">3</xref>
). However, complex, repeat-rich genomes still present major challenges for modern
<italic>de novo</italic>
assembly algorithms such as EULER (
<xref rid="B26" ref-type="bibr">26</xref>
), Velvet (
<xref rid="B27" ref-type="bibr">27</xref>
), ABySS (
<xref rid="B28" ref-type="bibr">28</xref>
), SOAPdenovo (
<xref rid="B29" ref-type="bibr">29</xref>
), ALLPATHS-LG (
<xref rid="B30" ref-type="bibr">30</xref>
) and CLC Assembly Cell (CLCbio,
<ext-link ext-link-type="uri" xlink:href="http://www.clc-bio.com">http://www.clc-bio.com</ext-link>
). In the de Bruijn graph paradigm that dominates assembly algorithms for such genomes, reads are broken into sub-strings of k nucleotides (k-mers) and used to construct a directed graph. A genome assembly is derived from a path through this graph and repetitive genomic sequences lead to ambiguities while traversing the graph (
<xref rid="B3" ref-type="bibr">3</xref>
) and introduce structural assembly errors such as chimeric or mis-assembled contigs. In general, highly repetitive genomes usually lead to fragmented genome assemblies with an underrepresentation of repetitive content in the final assembly (
<xref rid="B31" ref-type="bibr">31</xref>
), but can also lead to false assembly repeats in the form of SDs (
<xref rid="B32" ref-type="bibr">32–33</xref>
).</p>
<p>To address these challenges, k-mer analysis is an important first step in most genome assembly projects. At this stage, k-mers of NGS reads are counted and plotted on a histogram. Such a histogram can be used to predict sequencing errors (
<xref rid="B34" ref-type="bibr">34</xref>
), genome size (
<xref rid="B35" ref-type="bibr">35</xref>
) or repetitive sequences in reads for purposes such as repeat content assessment (
<xref rid="B20" ref-type="bibr">20</xref>
) or scaffolding and gap filling (B. R. Downie, P. Koch, N. Jahn, J. Schumacher and M. Platzer, unpublished results). K-mers derived from the unique fraction of the genome will accumulate in a Poisson-like curve with a peak near the genome coverage, while sequences that occur more than once genome wide are progressively enriched among k-mers with higher coverages.</p>
<p>We postulated that de Bruijn graph assemblers could create a repeat library using only ‘abundant’ k-mers (those k-mers that are predicted to occur more than once genome wide). As a proof of principle, we used both simulated and real NGS data from the
<italic>Drosophila melanogaster</italic>
genome to create, validate and annotate
<italic>de novo</italic>
repeat libraries. Velvet, a widely used de Bruijn graph-based
<italic>de novo</italic>
genome assembler, assembled the NGS sequences from which RepeatScout predicted repeat consensuses, and wgs-assembler surrogates were extracted after a
<italic>de novo</italic>
genome assembly of the same NGS data. These repeat libraries were compared to that of RepBase update and to the ReAS
<italic>de novo</italic>
repeat library (ReASLib) from the Drosophila 12 genomes project (
<xref rid="B19" ref-type="bibr">19</xref>
). Finally, we validated 'Repetitive motif detection by Assembly of Repetitive K-mers' (RepARK) on a human Illumina DNA dataset produced for the ALLPATHS-LG publication (
<xref rid="B30" ref-type="bibr">30</xref>
) to ensure its applicability to larger, more complex genomes.</p>
</sec>
<sec sec-type="materials|methods" id="SEC2">
<title>MATERIALS AND METHODS</title>
<sec id="SEC2-1">
<title>The
<italic>Drosophila melanogaster</italic>
genome</title>
<p>The
<italic>D. melanogaster</italic>
R5.43 assembly (170 Mb) is distributed across 15 sequence entries: the left and right arms of chromosomes 2 and 3, chromosome X, the corresponding heterochromatin content of these chromosomes, chromosome Y only as heterochromatin, the mini chromosome 4, the mitochondrial genome, and 40 Mb in two additional pseudo-chromosomes (U and Uextra). Currently, 412 repeat consensuses in RepBase Update (release 20120418) can be extracted with the term ‘
<italic>drosophila melanogaster</italic>
’, of which 249 are non-low-complexity repeats including 26 that are
<italic>D. melanogaster</italic>
-specific repeats (i.e. non-ancestral). We also downloaded the
<italic>D. melanogaster</italic>
repeat library created in the 12
<italic>Drosophila</italic>
genomes project using ReAS (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.genomics.org.cn/pub/ReAS/drosophila/v2/consensus_fasta/dmel.con.fa.gz">ftp://ftp.genomics.org.cn/pub/ReAS/drosophila/v2/consensus_fasta/dmel.con.fa.gz</ext-link>
) (391 consensuses).</p>
</sec>
<sec id="SEC2-2">
<title>Sequencing data</title>
<p>Sixty-eight million 101 bp reads (‘simulated’; 27 average quality, 40× genome coverage, insert sizes 400 bp and 2500 bp) were simulated with MAQ (version 0.7.1,
<ext-link ext-link-type="uri" xlink:href="http://maq.sourceforge.net">http://maq.sourceforge.net</ext-link>
) without mutations or indels using an Illumina training dataset on the
<italic>D. melanogaster</italic>
genome release R5.43 (including the U and Uextra chromosomes). Additionally, two sets of experimentally obtained Illumina reads (‘real’; ycnbwsp_2: SRX040484; ycnbwsp_7-HE: SRX040486; 83 million reads, 82 nt avg. length, 30 average quality, 40× genome coverage) were downloaded from the Short Read Archive (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/sra">http://www.ncbi.nlm.nih.gov/sra</ext-link>
). They are derived from an individual of the stock (
<ext-link ext-link-type="uri" xlink:href="http://flybase.org/reports/FBst0002057.html">http://flybase.org/reports/FBst0002057.html</ext-link>
) that was used in the release 5 of the
<italic>D. melanogaster</italic>
genome assembly (
<xref rid="B36" ref-type="bibr">36</xref>
). Both simulated and real datasets were error-corrected with QUAKE (
<xref rid="B34" ref-type="bibr">34</xref>
) (version 0.3.4, using default settings and
<italic>k</italic>
= 17). Human Illumina reads derived from a lymphoblastoid cell line (Coriell Institute, GM12878) (101 bp length, 132 Gb total, ∼40× coverage) were downloaded from SRA (SRR067780, SRR067784, SRR067785, SRR067787, SRR067789, SRR067791, SRR067792 and SRR067793) and used directly without error correction.</p>
</sec>
<sec id="SEC2-3">
<title>Building the RepARK repeat libraries</title>
<p>For NGS
<italic>de novo</italic>
repeat library creation, k-mers of NGS whole-genome shotgun (WGS) datasets were first counted with Jellyfish (
<xref rid="B37" ref-type="bibr">37</xref>
) (version 1.1.6) using the highest supported k-mer size of 31 (−m 31, −both-strands). The threshold for ‘abundant’ k-mers (those occurring more than once genome wide) was predicted for each dataset. A histogram of k-mer frequencies is calculated and a linear function is fit to the slope of the descending segment of the Poisson-like unique k-mer fraction. k-mers which occur with a frequency above which the projected linear function crosses the x-axis are expected to occur more than once genome wide. To further ensure that no contamination of the abundant k-mer set by unique sequences occurred, this value was doubled, and k-mers with a frequency above this threshold were classified as abundant (simulated: k-mer coverage >60, real: >84, human: >76; Supplementary Figure S1). Abundant k-mers were isolated and independently
<italic>de novo</italic>
assembled using CLC Assembly Cell (CLC) (version 4.0) or Velvet (version 1.2.08) with default settings and k-mer size of 29, resulting in four RepARK
<italic>de novo</italic>
repeat libraries.</p>
<p>Additionally, repeat libraries for both real and simulated datasets were
<italic>de novo</italic>
generated using two established methods. First, we applied RepeatScout to predict repetitive consensuses based on a
<italic>de novo</italic>
genome assembly generated by Velvet. Second, we used wgs-assembler to assemble the same datasets and thereby generate surrogates representing those contigs determined to be repetitive. The respective genome assembly statistics can be found in Supplementary Table S1.</p>
<p>The repeat consensuses were annotated with TEclass (
<xref rid="B38" ref-type="bibr">38</xref>
) (version 2.1) using the default training set that contains oligomer frequencies of all RepBase (release 15.07) repeats. For the purposes of subsequent analyses, a sequence was considered a repeat if it aligned more than once to the genome with at least 80% identity.</p>
</sec>
<sec id="SEC2-4">
<title>Mapping and repeat masking</title>
<p>All mappings were performed with BLAT (
<xref rid="B39" ref-type="bibr">39</xref>
) (version .34) with default options including ‘−extendThroughN’ to map over stretches of N's and ‘−minIdentity = 50’ to retain lower identity hits. The resulting psl files were further filtered for minimum identity where mentioned in the text. Repeat masking was performed with RepeatMasker (version 4.0.0) with the default parameters and either
<italic>D. melanogaster</italic>
repeats from RepBase (DmRepBase, release 20120418) or the specified repeat library. For analysis of Alu repeats in the human genome, we extracted 51 Alu consensus sequences from RepBase (release 18.07) categorized as ‘Homo sapiens and Ancestral’, and determined completeness by masking extracted Alu sequences using the RepARK repeat library.</p>
</sec>
<sec id="SEC2-5">
<title>Retrieving known segmental duplications and comparison to the
<italic>de novo</italic>
repeat consensuses</title>
<p>We downloaded the positions of SD identified in release 5 of the
<italic>D. melanogaster</italic>
reference sequence (
<ext-link ext-link-type="uri" xlink:href="http://humanparalogy.gs.washington.edu/dm3/dm3wgac.html">http://humanparalogy.gs.washington.edu/dm3/dm3wgac.html</ext-link>
). SDs were retrieved from the reference genome and masked with DmRepBase such that 3.09 Mb SD regions without RepBase repeats remain. Each repeat library was also masked separately with DmRepBase. The remaining SD sequences were subsequently masked with each masked repeat library to calculate the fraction of SDs each library can identify.</p>
</sec>
</sec>
<sec sec-type="results" id="SEC3">
<title>RESULTS</title>
<p>A summary of the method to create
<italic>de novo</italic>
repeat libraries from NGS WGS reads (RepARK) is depicted in Figure 
<xref ref-type="fig" rid="F1">1</xref>
. To benchmark our approach for the
<italic>de novo</italic>
creation of repeat libraries, we used the
<italic>D. melanogaster</italic>
genome due to the availability of a high-quality reference genome (
<xref rid="B40" ref-type="bibr">40–41</xref>
) (version R5.43), an advanced, manually curated repeat library (RepBase Update version 20120418), and NGS WGS reads. For this study, we analyzed both simulated (‘simulated’) and experimentally derived (‘real’) datasets. With simulated data, we know the genomic sequence from which the data is derived, and can therefore ameliorate mis-assemblies in the reference sequence as a source of error in our analyses as well as sequencing biases of the Illumina technology (e.g. underrepresentation of G+C-rich regions (
<xref rid="B42" ref-type="bibr">42</xref>
)). With real data, we can determine whether the method is valid even in the face of real world confounding elements such as technical biases or contaminations.</p>
<fig id="F1" position="float">
<label>Figure 1.</label>
<caption>
<p>Workflow of the repeat library creation pipeline RepARK. WGS sequencing reads (
<bold>a</bold>
) contain unique (black) and repetitive (red) fractions of the genome. K-mers of all reads (
<bold>b</bold>
) were counted and the threshold of frequent k-mers is determined. These abundant k-mers are isolated (
<bold>c</bold>
) and assembled by a
<italic>de novo</italic>
genome assembly program (such as Velvet) into repeat consensus sequences (
<bold>d</bold>
).</p>
</caption>
<graphic xlink:href="gku210fig1"></graphic>
</fig>
<p>RepARK libraries were compared against both established repeat libraries and those generated using state-of-the-art methods (Tables 
<xref ref-type="table" rid="T1">1</xref>
and
<xref ref-type="table" rid="T2">2</xref>
). The
<italic>D. melanogaster</italic>
repeat library of RepBase (DmRepBase) and the ReAS
<italic>de novo</italic>
repeat library (ReASLib) from the Drosophila 12 genomes project (
<xref rid="B19" ref-type="bibr">19</xref>
) were downloaded as established repeat libraries. RepeatScout was used to generate repeat libraries based on Velvet
<italic>de novo</italic>
genome assemblies of both simulated and real datasets, while wgs-assembler surrogates are those which have been identified as repeats during assembly graph resolution. Generation of RepARK libraries using either Velvet (RepARK Velvet) or CLC Assembly Cell (RepARK CLC) was orders of magnitude (14×–465×) faster than when using
<italic>de novo</italic>
state-of-the-art methods. It is notable that the N50 values (the consensus size above which half the total size of the library is represented) of the repeat libraries generated by either RepARK, RepeatScout or wgs-assembler are one to two orders of magnitude (16×–93×) smaller than either the RepBase or ReASLib repeat libraries, indicating extensive fragmentation of the consensuses. The larger total length of libraries created by wgs-assembler and RepARK (2×–7×) in respect to DmRepBase hints to higher redundancies.</p>
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<title>
<italic>D. melanogaster</italic>
repeat library metrics from simulated NGS reads</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">RepeatScout</th>
<th align="left" rowspan="1" colspan="1">wgs-assembler</th>
<th align="left" rowspan="1" colspan="1">RepARK CLC</th>
<th align="left" rowspan="1" colspan="1">RepARK Velvet</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Identification method</td>
<td align="left" rowspan="1" colspan="1">Velvet + RepeatScout</td>
<td align="left" rowspan="1" colspan="1">wgs-assembler surrogates</td>
<td align="left" rowspan="1" colspan="1">CLC</td>
<td align="left" rowspan="1" colspan="1">Velvet</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Number of consensuses</td>
<td align="left" rowspan="1" colspan="1">1239</td>
<td align="left" rowspan="1" colspan="1">18 203</td>
<td align="left" rowspan="1" colspan="1">67 968</td>
<td align="left" rowspan="1" colspan="1">14 147</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total length (Mb)</td>
<td align="left" rowspan="1" colspan="1">0.174</td>
<td align="left" rowspan="1" colspan="1">4.3</td>
<td align="left" rowspan="1" colspan="1">4.3</td>
<td align="left" rowspan="1" colspan="1">1.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Min./max. length (bp)</td>
<td align="left" rowspan="1" colspan="1">51/2565</td>
<td align="left" rowspan="1" colspan="1">66/6446</td>
<td align="left" rowspan="1" colspan="1">30/6945</td>
<td align="left" rowspan="1" colspan="1">57/6943</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N50 (bp)</td>
<td align="left" rowspan="1" colspan="1">78</td>
<td align="left" rowspan="1" colspan="1">147</td>
<td align="left" rowspan="1" colspan="1">58</td>
<td align="left" rowspan="1" colspan="1">149</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N90 (bp)</td>
<td align="left" rowspan="1" colspan="1">64</td>
<td align="left" rowspan="1" colspan="1">116</td>
<td align="left" rowspan="1" colspan="1">36</td>
<td align="left" rowspan="1" colspan="1">59</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Time to create (h)</td>
<td align="left" rowspan="1" colspan="1">8.75</td>
<td align="left" rowspan="1" colspan="1">284</td>
<td align="left" rowspan="1" colspan="1">0.61</td>
<td align="left" rowspan="1" colspan="1">0.61</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2.</label>
<caption>
<title>
<italic>D. melanogaster</italic>
repeat library metrics from real data</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">DmRepBase</th>
<th align="left" rowspan="1" colspan="1">ReASLib</th>
<th align="left" rowspan="1" colspan="1">RepeatScout</th>
<th align="left" rowspan="1" colspan="1">wgs-assembler</th>
<th align="left" rowspan="1" colspan="1">RepARK CLC</th>
<th align="left" rowspan="1" colspan="1">RepARK Velvet</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Source data</td>
<td align="left" rowspan="1" colspan="1">N/A</td>
<td align="left" rowspan="1" colspan="1">Sanger reads</td>
<td colspan="1" rowspan="1">Illumina reads</td>
<td align="left" rowspan="1" colspan="1">Illumina reads</td>
<td align="left" rowspan="1" colspan="1">Illumina reads</td>
<td align="left" rowspan="1" colspan="1">Illumina reads</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Identification method</td>
<td align="left" rowspan="1" colspan="1">Manual curation</td>
<td align="left" rowspan="1" colspan="1">Seed based</td>
<td align="left" rowspan="1" colspan="1">Velvet + RepeatScout</td>
<td align="left" rowspan="1" colspan="1">wgs-assembler surrogates</td>
<td align="left" rowspan="1" colspan="1">CLC</td>
<td align="left" rowspan="1" colspan="1">Velvet</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Number of consensuses</td>
<td align="left" rowspan="1" colspan="1">249</td>
<td align="left" rowspan="1" colspan="1">391</td>
<td align="left" rowspan="1" colspan="1">414</td>
<td align="left" rowspan="1" colspan="1">14 296</td>
<td align="left" rowspan="1" colspan="1">19 677</td>
<td align="left" rowspan="1" colspan="1">4284</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total length (Mb)</td>
<td align="left" rowspan="1" colspan="1">0.7</td>
<td align="left" rowspan="1" colspan="1">0.96</td>
<td align="left" rowspan="1" colspan="1">0.035</td>
<td align="left" rowspan="1" colspan="1">2.2</td>
<td align="left" rowspan="1" colspan="1">1.6</td>
<td align="left" rowspan="1" colspan="1">0.87</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Min./max. length (bp)</td>
<td align="left" rowspan="1" colspan="1">52/14 477</td>
<td align="left" rowspan="1" colspan="1">101/12 876</td>
<td align="left" rowspan="1" colspan="1">51/616</td>
<td align="left" rowspan="1" colspan="1">64/25 962</td>
<td align="left" rowspan="1" colspan="1">30/7589</td>
<td align="left" rowspan="1" colspan="1">57/7587</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N50 (bp)</td>
<td align="left" rowspan="1" colspan="1">5402</td>
<td align="left" rowspan="1" colspan="1">4757</td>
<td align="left" rowspan="1" colspan="1">83</td>
<td align="left" rowspan="1" colspan="1">158</td>
<td align="left" rowspan="1" colspan="1">87</td>
<td align="left" rowspan="1" colspan="1">290</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N90 (bp)</td>
<td align="left" rowspan="1" colspan="1">1750</td>
<td align="left" rowspan="1" colspan="1">1247</td>
<td align="left" rowspan="1" colspan="1">56</td>
<td align="left" rowspan="1" colspan="1">76</td>
<td align="left" rowspan="1" colspan="1">38</td>
<td align="left" rowspan="1" colspan="1">89</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Time to create (h)</td>
<td align="left" rowspan="1" colspan="1">N/A</td>
<td align="left" rowspan="1" colspan="1">N/A</td>
<td align="left" rowspan="1" colspan="1">5.75</td>
<td align="left" rowspan="1" colspan="1">101</td>
<td align="left" rowspan="1" colspan="1">0.28</td>
<td align="left" rowspan="1" colspan="1">0.28</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="T1TFN1">
<p>N/A: not applicable</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>To evaluate specificity, each repeat library was mapped onto the
<italic>D. melanogaster</italic>
genome using BLAT and filtered for minimum identity of 80%. Consensuses encompassing the bulk of each repeat library length (84–99%) mapped multiple times to the reference sequence (henceforth called ‘repetitive consensuses’) (Figure 
<xref ref-type="fig" rid="F2">2</xref>
, black), while the remaining sequence aligned only once or not at all (Figure 
<xref ref-type="fig" rid="F2">2</xref>
, gray). A similar fraction of repetitive consensuses were measured for identity thresholds of 90% and 95% for all libraries (Supplementary Table S2). The largest fraction of non-repetitive consensuses was observed in the wgs-assembler library created from real data. Although being composed nearly entirely of repetitive consensuses, the overall length of the RepeatScout library was considerably shorter than the other libraries (Tables 
<xref ref-type="table" rid="T1">1</xref>
and
<xref ref-type="table" rid="T2">2</xref>
, Figure 
<xref ref-type="fig" rid="F2">2</xref>
). Repeat masking the two assemblies used by RepeatScout revealed that only 6.5% (simulated) and 4.7% (real) of each assembly could be identified as repeats. The vast majority (>99%) of consensuses from RepARK libraries had an average nucleotide coverage >10× (Supplementary Figure S2), and most repetitive consensuses align fewer than 100 times to the reference (Supplementary Figure S3).</p>
<fig id="F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Cumulative length of repetitive and non-repetitive consensuses within each library. Black: repetitive consensuses (i.e. align more than once to the reference); gray: non-repetitive consensuses (i.e. singly mapping or not at all); Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.</p>
</caption>
<graphic xlink:href="gku210fig2"></graphic>
</fig>
<p>To evaluate the potential of each library for masking genomic repeats, the
<italic>D. melanogaster</italic>
reference was masked using RepeatMasker with the corresponding library (Figure 
<xref ref-type="fig" rid="F3">3</xref>
, black). More of the reference sequence was identified as repetitive when using either the RepARK libraries or ReASLib than when using RepBase. Of state-of-the-art methods, wgs-assembler-based repeat libraries provided comparable results only using simulated reads, while the two RepeatScout derived libraries could mask only a small fraction of the reference. Moreover, when the masked reference is subsequently masked with DmRepBase, only a small fraction of the unmasked genome sequence was identified as repetitive for RepARK libraries (0.18–1.18%) and ReASLib (0.56%) (Figure 
<xref ref-type="fig" rid="F3">3</xref>
, gray), while wgs-assembler (2.3–8.5%) and RepeatScout (17–20%) derived libraries left much of the repeat fraction of the genome unmasked.</p>
<fig id="F3" position="float">
<label>Figure 3.</label>
<caption>
<p>Repeat fractions identified in the
<italic>D. melanogaster</italic>
reference sequence. Black: fraction of the reference masked by RepeatMasker using the respective repeat library; gray: fraction of the reference that was subsequently masked by RepeatMasker using RepBase; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.</p>
</caption>
<graphic xlink:href="gku210fig3"></graphic>
</fig>
<p>DmRepBase contains 249 annotated repeat consensuses. Completeness of each of these consensuses in the other repeat libraries was determined by masking them using RepeatMasker and DmRepBase (Figure 
<xref ref-type="fig" rid="F4">4</xref>
, Supplementary Table S3) and evaluating what fraction of each DmRepBase consensus was used for masking. In general, LTR and non-LTR retrotransposons showed a higher median completeness than DNA transposons. However, RepARK libraries consistently showed as good or superior completeness compared to the other libraries investigated.</p>
<fig id="F4" position="float">
<label>Figure 4.</label>
<caption>
<p>Boxplot of DmRepBase repeat class completeness in the
<italic>de novo</italic>
repeat libraries. DNA: 33 DNA transposons; LTR: 138 LTR retrotransposons; non-LTR: 41 non-LTR retrotransposons; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads; box: first and third quartiles; horizontal line: median; whiskers: most extreme value within 1.5× of inter-quartile range; dots: outliers. A full table of repeat family representation in the RepARK libraries can be found in Supplementary Table S3.</p>
</caption>
<graphic xlink:href="gku210fig4"></graphic>
</fig>
<p>Next, we explored potentially novel repeats in each of the
<italic>de novo</italic>
libraries by mapping the consensuses not recognized as RepBase repeats by RepeatMasker to the
<italic>D. melanogaster</italic>
reference. Using this approach, we found consensuses that map with high identity proximal to one another on the same chromosome (Supplementary Figure S4) and/or to the corresponding heterochromatin entry (Supplementary Figure S5), patterns characteristic of SDs (
<xref rid="B10" ref-type="bibr">10</xref>
). We therefore retrieved a list of known
<italic>D. melanogaster</italic>
SDs and determined the fraction identified by those
<italic>de novo</italic>
library consensuses that were not recognized as DmRepBase repeats. The largest fraction of the SDs could be identified by the RepARK libraries compared to the other
<italic>de novo</italic>
repeat libraries studied (Figure 
<xref ref-type="fig" rid="F5">5</xref>
), with the exception of wgs-assembler surrogates using simulated data.</p>
<fig id="F5" position="float">
<label>Figure 5.</label>
<caption>
<p>Fractions of known
<italic>D. melanogaster</italic>
segmental duplications identified by the
<italic>de novo</italic>
repeat libraries. Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.</p>
</caption>
<graphic xlink:href="gku210fig5"></graphic>
</fig>
<p>TEclass, commonly used to annotate repeat libraries, requires consensuses ≥50 bp for classification. In each library analyzed in this study, more than 90% of such consensuses were successfully classified by TEclass. A greater proportion of consensuses in the RepARK libraries were annotated as DNA transposons and fewer as retrotransposons as compared to ReASLib or DmRepBase (Supplementary Table S4), and more of the reference sequence was annotated as DNA transposons at the expenses of retrotransposons using the RepARK libraries (Supplementary Table S5). This bias could be due to the extensive fragmentation of the RepARK libraries to which the TEclass algorithm may not be adopted. Consequently, we restricted the TEclass annotation to consensuses >100 bp, which considerably reduced the bias toward DNA transposons in the repeat annotation of the genome using these RepARK libraries (Figure 
<xref ref-type="fig" rid="F6">6</xref>
).</p>
<fig id="F6" position="float">
<label>Figure 6.</label>
<caption>
<p>Fractions of the
<italic>D. melanogaster</italic>
genome reference classified according to annotated repeat libraries. Black: DNA transposon sequence; dark gray: retrotransposon sequence; light gray: unclear; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.</p>
</caption>
<graphic xlink:href="gku210fig6"></graphic>
</fig>
<p>Finally, we wanted to determine whether the findings of RepARK as applied to the
<italic>D. melanogaster</italic>
datasets could be extended to larger, more complex genomes. To this end, we downloaded Illumina read libraries used in the
<italic>de novo</italic>
assembly of a human genome and generated a RepARK repeat library using the same parameters described previously (Table 
<xref ref-type="table" rid="T3">3</xref>
). In this case, we utilized Velvet due to its frequent use in academic environments. The RepARK library (7.9 Mb) was again substantially longer than the human RepBase repeat library (HsRepBase, 1.6 Mb), and a similar fraction of the cumulative length of the human RepARK library was found to be composed of repetitive consensuses (93%) as in that for
<italic>D. melanogaster</italic>
(Figure 
<xref ref-type="fig" rid="F2">2</xref>
). Additionally, 37 of 51 of the highly abundant and mobile Alu families were at least 50% represented within the RepARK library (Supplementary Table S6).</p>
<table-wrap id="T3" position="float">
<label>Table 3.</label>
<caption>
<title>Human repeat library metrics and mapping results against the human reference sequence</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">HsRepBase</th>
<th align="left" rowspan="1" colspan="1">RepARK Velvet</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Number of consensuses</td>
<td align="left" rowspan="1" colspan="1">1439</td>
<td align="left" rowspan="1" colspan="1">62 425</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total length (kb)</td>
<td align="left" rowspan="1" colspan="1">1566</td>
<td align="left" rowspan="1" colspan="1">7882</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Min./max. length (bp)</td>
<td align="left" rowspan="1" colspan="1">63/9044</td>
<td align="left" rowspan="1" colspan="1">57/42 518</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N50 (bp)</td>
<td align="left" rowspan="1" colspan="1">2822</td>
<td align="left" rowspan="1" colspan="1">143</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">N90 (bp)</td>
<td align="left" rowspan="1" colspan="1">471</td>
<td align="left" rowspan="1" colspan="1">57</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Time to create (hrs)</td>
<td align="left" rowspan="1" colspan="1">N/A</td>
<td align="left" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Number of consensuses with multiple hits</td>
<td align="left" rowspan="1" colspan="1">1167 (81%
<sup>a</sup>
)</td>
<td align="left" rowspan="1" colspan="1">57 239 (92%
<sup>a</sup>
)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total length of consensuses with multiple hits (kb)</td>
<td align="left" rowspan="1" colspan="1">1471 (94%
<sup>b</sup>
)</td>
<td align="left" rowspan="1" colspan="1">7318 (93%
<sup>b</sup>
)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="T2TFN1">
<p>
<sup>a</sup>
Ratio to the total number of consensuses of the library.</p>
</fn>
<fn id="T2TFN2">
<p>
<sup>b</sup>
Ratio to the total length of the library.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Surprisingly, RepARK also generated a number of very long consensuses from the human NGS data, the longest being 42518 bp (almost twice as long as the longest known LTR retrotransposon
<italic>ogre</italic>
with 25 kb (
<xref rid="B43" ref-type="bibr">43</xref>
)). Aligning this consensus with BLAST against ‘Nucleotide collection (nt/nr)’ (
<ext-link ext-link-type="uri" xlink:href="http://blast.ncbi.nlm.nih.gov">http://blast.ncbi.nlm.nih.gov</ext-link>
) identified a highly significant match to the Epstein-Barr virus (EBV alias Human herpes virus 4, HHV-4) which was used to establish the human cell line sequenced (Coriell Institute, GM12878). After further investigation, 23 repeat consensuses were identified with >90% of their bases mapping and
<italic>p</italic>
< 10
<sup>−60</sup>
to the EBV genome. The majority (90.5%) of the 171 kb virus genome is covered by one of the consensuses using these parameters (Figure 
<xref ref-type="fig" rid="F7">7</xref>
), and the remaining 9.5% is covered by consensuses using more relaxed criteria.</p>
<fig id="F7" position="float">
<label>Figure 7.</label>
<caption>
<p>High confidence alignments of human RepARK consensuses (right half) to the Epstein-Barr virus genome (left half, HHV-4). Each ribbon represents a consensus alignment with >90% mapping and
<italic>p</italic>
< 10
<sup>−60</sup>
, encompassing 90.5% of the Epstein-Barr virus genome. Lower confidence consensuses align to the remaining 9.5% with more relaxed criteria. Three consensuses map multiple times to the virus genome sequence (NODE_48265, NODE_888, NODE_5085; dark red). Created with Circoletto (
<ext-link ext-link-type="uri" xlink:href="http://bat.ina.certh.gr/tools/circoletto/">http://bat.ina.certh.gr/tools/circoletto/</ext-link>
).</p>
</caption>
<graphic xlink:href="gku210fig7"></graphic>
</fig>
</sec>
<sec sec-type="discussion" id="SEC4">
<title>DISCUSSION</title>
<p>Generation of repeat libraries is an important step for accurate analyses of genomes, but has historically relied heavily upon manual curation (
<xref rid="B44" ref-type="bibr">44</xref>
). With the availability of genome assemblies and NGS, new prediction models came into practice (reviewed in (
<xref rid="B45" ref-type="bibr">45</xref>
)). These approaches are dependent on the quality of the genome sequence analyzed, and assemblers using short reads from NGS technologies are notoriously poor at resolving repetitive genomic segments due to the length and complexity of genomic repeats. As an alternative to a reference-based approach, we describe here RepARK, a novel, NGS-based method for building and annotating a library of repeat consensuses without a reference genome. This method relies on k-mer counting, a routine step in sequence analysis (
<xref rid="B26" ref-type="bibr">26</xref>
). After counting, k-mers predicted to occur more than once genome-wide (‘abundant’) are
<italic>de novo</italic>
assembled with a de Bruijn graph assembler and a comprehensive repeat library is generated.</p>
<p>For the proof-of-principle, we selected the
<italic>D. melanogaster</italic>
genome for its moderate size and repeat content and for the high-quality reference sequence available (
<xref rid="B40" ref-type="bibr">40–41</xref>
). We validated the method on both simulated and experimentally derived data using both commercial (CLC) and open source (Velvet) de Bruijn graph assemblers. The overall lengths of the RepARK repeat libraries are longer than that found in RepBase (0.87–4.3 Mb versus 0.7 Mb), and >90% of consensuses in all RepARK libraries are repetitive. Moreover, only a small fraction of the reference masked with a RepARK library can be subsequently identified by RepBase as a repeat (0.18–1.18%), indicating that the bulk of RepBase repeats in the genome can be identified using the RepARK method. Although we required a sequence identity of >80% for mapping of the consensuses to the reference (the standard threshold for the identification of a repeat motif), the number of RepARK library repetitive consensuses did not change even with a threshold of >90% or >95% (Supplementary Table S2), most likely due to the sequence fragmentation in the de Bruijn graph. The high ratio of consensus length and greater overall consensus length that maps more than once to the reference in the RepARK libraries indicates that the presented method may generate genome-specific repeat libraries with comparable or even higher sensitivity and specificity than the RepBase approach that is focused on the identification and reconstruction of genome-wide dispersed transposons and does not tackle, e.g., SDs.</p>
<p>Although wgs-assembler using simulated data produced a comprehensive repeat library in almost all metrics examined in this study, these positive results were not reflected when using a real dataset. In particular, while the repeat library derived from real data contained repetitive consensuses with a longer total length compared to the other libraries, it was substantially less effective in masking the reference genome (22% versus 27–32% for the other non-RepeatScout libraries) This discrepancy between length of repetitive consensuses and length of the reference masked could be due to consensus redundancy. It is also important to note that the RepeatScout-based method, arguably the most popular state-of-the-art method for
<italic>de novo</italic>
generation of repeat libraries, was the least effective at generating comprehensive repeat libraries of all the methods examined. The fact that a low completeness of repeats could be identified in the Velvet-based genome assemblies only underscores the reliance of RepeatScout on a high-quality draft reference assembly that is frequently difficult to obtain using only NGS libraries. In the course of preparing this publication, a novel
<italic>D. melanogaster</italic>
assembly was reported that has been derived from >90× coverage by reads obtained using the PacBio technology with an average length of 10 kb (
<ext-link ext-link-type="uri" xlink:href="http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html">http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html</ext-link>
). In this assembly, PacBio reads resolve unique repetitive transposable elements up to ∼10 kb in size, indicating that long reads may also provide new opportunities for
<italic>de novo</italic>
repeat prediction. Finally, the RepARK method is orders of magnitude faster than the state-of-the-art methods due to assembly graph simplification, making RepARK a useful tool for prototyping reference repeat libraries as well as generating repeat libraries for individual samples. While the ReAS library was comparable in almost every metric evaluated to RepARK libraries and uses a similar method to generate repeat libraries, it requires labor- and cost-intensive Sanger-type long sequences and is unable to deal with short NGS reads. In point of fact, we were not able to evaluate ReAS using either our simulated or real read data due to the limitations of the program.</p>
<p>More consensuses were found in RepARK libraries from the simulated dataset than from the real data (Table 
<xref ref-type="table" rid="T1">1</xref>
). Such a discrepancy could result from assembly errors in the reference sequence leading to an artificial overrepresentation of certain motifs. This explanation is supported by noting that the U and Uextra chromosomes, included as templates for read simulation, are hotspots for assembly errors (
<xref rid="B46" ref-type="bibr">46</xref>
). Alternatively, real sequencing data are subjected to various technological biases leading to the underrepresentation of particular motifs (e.g. GC-rich or heterochromatin sequence, both regions of high repeat content (
<xref rid="B42" ref-type="bibr">42</xref>
)). Finally, it is possible that this discrepancy is due to actual genomic differences between the reference and the DNA sample sequenced such as copy number variation or SDs.</p>
<p>Although we observe RepBase consensuses with a completeness of <50%, only ∼1% of the RepARK library-masked
<italic>D. melanogaster</italic>
reference genome could be subsequently masked using RepBase (Figure 
<xref ref-type="fig" rid="F3">3</xref>
, gray). It is particularly telling that one-third of such consensuses belong to the RepBase group ‘remaining’, which contains consensus annotation such as ‘ARTEFACT’. Such consensuses are derived from cloning artifacts and would therefore not be detected using cloning-free NGS methods. Moreover, the DmRepBase library contains ancestral repeat consensuses that may not be repetitive or represented at all in the reference genome and therefore could not be detected as repeats by RepARK. Alternatively, some of the very short RepARK consensuses may not be usable by RepeatMasker when masking the DmRepBase library resulting in underestimation of completeness. Finally, highly divergent repeat motifs may cause excessive fragmentation of the assembly graph, the consensuses of which may be lost by our size cutoff of 50bp. This seems a likely scenario given the high fraction of short consensuses within the RepARK libraries and could be at least partially rectified by using a
<italic>de novo</italic>
assembler that uses more relaxed criteria for calling consensus sequences.</p>
<p>More of the genome is masked by RepeatMasker using the RepARK libraries than with DmRepBase (1.6–4.5% additional sequence). Part of this additional masked sequence can be explained by the observation that a portion of the RepARK consensuses represents SDs, which can be specific for individual genomes. Such a finding is compatible with the fact that RepBase libraries contain only simple and genome-wide dispersed repeats. To date, SDs are detected using traditional whole-genome alignment methods based on criteria that exclude shorter, more divergent sequences (<90% identity, <1 kb). This limitation could explain some of the putative novel SD events identified using the RepARK libraries, such as that observed for chromosome X (Supplementary Figure S4). Additionally, the use of whole-genome alignments to detect SDs runs the risk of false positives/negatives due to assembly errors in the reference sequence. Together with the high ratio of fully mappable consensuses, these data further underpin the conclusion that the consensuses produced by RepARK are both highly specific and sensitive for detection of repetitive elements of a given genome.</p>
<p>The bias toward DNA transposon annotation by TEclass for the NGS
<italic>de novo</italic>
libraries represents a limitation for accurately annotating repeat classes in a genome. This behavior is most likely due to the highly fragmented nature of such libraries, which may present a challenge for some of the annotation models implemented in TEclass. Revising these models may produce more accurate annotation of highly fragmented repeat libraries such as those investigated in this study. Alternatively, creation of longer repeat consensuses (such as that found in RepARK library generated by Velvet) or the restriction of the TEclass library annotation to longer consensuses (>100 bp) can also improve repeat annotation. Regardless to further improvements, precise examination of repeat evolution in newly assembled genomes will require closer, manual examination. Nevertheless, the consensuses of NGS
<italic>de novo</italic>
libraries can be used to identify and isolate repetitive genomic elements with high accuracy and to provide a first pass annotation.</p>
<p>The high rate of true positives and long overall length seen for
<italic>D. melanogaster</italic>
RepARK libraries was also found in the human RepARK library, indicating that this method is readily extensible to larger and more complex genomes. Alu repeat elements are high-frequency retrotransposons that are still mobile within the human genome (
<xref rid="B47" ref-type="bibr">47</xref>
), and a majority of Alu families were represented by more than 50% in the RepARK repeat library. Unexpectedly, the entire EBV genome was found within the RepARK library, a finding that can be readily explained by noting that EBV was used to establish the cell line from which the human DNA was isolated and sequenced. As EBV generally does not integrate into the host chromosomes, it exists as a circular episome within the nucleus (see review (
<xref rid="B48" ref-type="bibr">48</xref>
)). This finding suggests that RepARK may also represent a novel method to quickly identify contaminants within a DNA dataset and may find future application not only as a repeat library generator, but also as a diagnostic tool.</p>
<p content-type="F7 T3">Taken together, our k-mer-based method can use sequences as short as 31 bp, is independent of an assembled genome sequence, can utilize any de Bruijn assembler, generates consensuses for which the vast majority are repetitive and can be annotated by TEclass. It can be applied to genomes at least as large and complex as the human genome. Construction of these libraries is orders of magnitude faster and represents a new approach to identify SDs, multi-copy contaminations or pathogens directly from NGS datasets. Finally, we showed that RepARK repeat libraries are as good as or better than that of the state-of-the-art methods examined.</p>
</sec>
<sec id="SEC5">
<title>SUPPLEMENTARY DATA</title>
<p>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/lookup/suppl/doi:10.1093/nar/gku179/-/DC1">Supplementary Data</ext-link>
are available at NAR Online.</p>
</sec>
<sec id="SEC6">
<title>DATA ACCESS</title>
<p>The generated repeat libraries can be downloaded from
<ext-link ext-link-type="ftp" xlink:href="ftp://genome.fli-leibniz.de/pub/repeat-assemblies/">ftp://genome.fli-leibniz.de/pub/repeat-assemblies/</ext-link>
and the RepARK script via
<ext-link ext-link-type="uri" xlink:href="https://github.com/PhKoch/RepARK">https://github.com/PhKoch/RepARK</ext-link>
.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>SUPPLEMENTARY DATA</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_42_9_e80__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_gku210_nar-01653-met-k-2013-File010.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGMENTS</title>
<p>We would like to thank Casey Bergman for proof-reading the manuscript and suggesting the comparison to ReAS, pointing us to the NGS dataset of
<italic>D. melanogaster</italic>
used in this study, and giving numerous helpful comments on the manuscript. We would also like to thank Jens Schumacher for helpful discussions regarding k-mer histogram analysis.</p>
</ack>
<sec id="SEC7">
<title>FUNDING</title>
<p>Klaus Tschira Stiftung [00.179.2011 to P.K.].</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
<sec id="SEC9">
<title>REFERENCES</title>
</sec>
<ref-list>
<ref id="B1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lander</surname>
<given-names>E.S.</given-names>
</name>
<name>
<surname>Linton</surname>
<given-names>L.M.</given-names>
</name>
<name>
<surname>Birren</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zody</surname>
<given-names>M.C.</given-names>
</name>
<name>
<surname>Baldwin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Devon</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dewar</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Doyle</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>FitzHugh</surname>
<given-names>W.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Initial sequencing and analysis of the human genome</article-title>
<source>Nature</source>
<year>2001</year>
<volume>409</volume>
<fpage>860</fpage>
<lpage>921</lpage>
<pub-id pub-id-type="pmid">11237011</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mayer</surname>
<given-names>K.F.</given-names>
</name>
<name>
<surname>Waugh</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>J.W.</given-names>
</name>
<name>
<surname>Schulman</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Langridge</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Platzer</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fincher</surname>
<given-names>G.B.</given-names>
</name>
<name>
<surname>Muehlbauer</surname>
<given-names>G.J.</given-names>
</name>
<name>
<surname>Sato</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Close</surname>
<given-names>T.J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A physical, genetic and functional sequence assembly of the barley genome</article-title>
<source>Nature</source>
<year>2012</year>
<volume>491</volume>
<fpage>711</fpage>
<lpage>716</lpage>
<pub-id pub-id-type="pmid">23075845</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Treangen</surname>
<given-names>T.J.</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
<article-title>Repetitive DNA and next-generation sequencing: computational challenges and solutions</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>36</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="pmid">22124482</pub-id>
</element-citation>
</ref>
<ref id="B4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yandell</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ence</surname>
<given-names>D.</given-names>
</name>
</person-group>
<article-title>A beginner's guide to eukaryotic genome annotation</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>329</fpage>
<lpage>342</lpage>
<pub-id pub-id-type="pmid">22510764</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feschotte</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Pritham</surname>
<given-names>E.J.</given-names>
</name>
</person-group>
<article-title>DNA transposons and the evolution of eukaryotic genomes</article-title>
<source>Annu. Rev. Genet.</source>
<year>2007</year>
<volume>41</volume>
<fpage>331</fpage>
<lpage>368</lpage>
<pub-id pub-id-type="pmid">18076328</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Orr</surname>
<given-names>H.T.</given-names>
</name>
<name>
<surname>Zoghbi</surname>
<given-names>H.Y.</given-names>
</name>
</person-group>
<article-title>Trinucleotide repeat disorders</article-title>
<source>Annu. Rev. Neurosci.</source>
<year>2007</year>
<volume>30</volume>
<fpage>575</fpage>
<lpage>621</lpage>
<pub-id pub-id-type="pmid">17417937</pub-id>
</element-citation>
</ref>
<ref id="B7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hancks</surname>
<given-names>D.C.</given-names>
</name>
<name>
<surname>Kazazian</surname>
<given-names>H.H.</given-names>
<suffix>Jr</suffix>
</name>
</person-group>
<article-title>Active human retrotransposons: variation and disease</article-title>
<source>Curr. Opin. Genet. Dev.</source>
<year>2012</year>
<volume>22</volume>
<fpage>191</fpage>
<lpage>203</lpage>
<pub-id pub-id-type="pmid">22406018</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lim</surname>
<given-names>K.G.</given-names>
</name>
<name>
<surname>Kwoh</surname>
<given-names>C.K.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>L.Y.</given-names>
</name>
<name>
<surname>Wirawan</surname>
<given-names>A.</given-names>
</name>
</person-group>
<article-title>Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance</article-title>
<source>Brief Bioinform</source>
<year>2012</year>
<volume>14</volume>
<fpage>67</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="pmid">22648964</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jurka</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kapitonov</surname>
<given-names>V.V.</given-names>
</name>
<name>
<surname>Kohany</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Jurka</surname>
<given-names>M.V.</given-names>
</name>
</person-group>
<article-title>Repetitive sequences in complex genomes: structure and evolution</article-title>
<source>Annu. Rev. Genomics Hum. Genet.</source>
<year>2007</year>
<volume>8</volume>
<fpage>241</fpage>
<lpage>259</lpage>
<pub-id pub-id-type="pmid">17506661</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eichler</surname>
<given-names>E.E.</given-names>
</name>
</person-group>
<article-title>Recent duplication, domain accretion and the dynamic mutation of the human genome</article-title>
<source>Trends Genet.</source>
<year>2001</year>
<volume>17</volume>
<fpage>661</fpage>
<lpage>669</lpage>
<pub-id pub-id-type="pmid">11672867</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Tandem repeats finder: a program to analyze DNA sequences</article-title>
<source>Nucleic Acids Res.</source>
<year>1999</year>
<volume>27</volume>
<fpage>573</fpage>
<lpage>580</lpage>
<pub-id pub-id-type="pmid">9862982</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jurka</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kapitonov</surname>
<given-names>V.V.</given-names>
</name>
<name>
<surname>Pavlicek</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Klonowski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kohany</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Walichiewicz</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Repbase Update, a database of eukaryotic repetitive elements</article-title>
<source>Cytogenet. Genome Res.</source>
<year>2005</year>
<volume>110</volume>
<fpage>462</fpage>
<lpage>467</lpage>
<pub-id pub-id-type="pmid">16093699</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Eddy</surname>
<given-names>S.R.</given-names>
</name>
</person-group>
<article-title>Automated de novo identification of repeat sequence families in sequenced genomes</article-title>
<source>Genome Res.</source>
<year>2002</year>
<volume>12</volume>
<fpage>1269</fpage>
<lpage>1276</lpage>
<pub-id pub-id-type="pmid">12176934</pub-id>
</element-citation>
</ref>
<ref id="B14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Price</surname>
<given-names>A.L.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>N.C.</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P.A.</given-names>
</name>
</person-group>
<article-title>De novo identification of repeat families in large genomes</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>Suppl. 1</issue>
<fpage>i351</fpage>
<lpage>i358</lpage>
<pub-id pub-id-type="pmid">15961478</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Choudhuri</surname>
<given-names>J.V.</given-names>
</name>
<name>
<surname>Ohlebusch</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Schleiermacher</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Giegerich</surname>
<given-names>R.</given-names>
</name>
</person-group>
<article-title>REPuter: the manifold applications of repeat analysis on a genomic scale</article-title>
<source>Nucleic Acids Res.</source>
<year>2001</year>
<volume>29</volume>
<fpage>4633</fpage>
<lpage>4642</lpage>
<pub-id pub-id-type="pmid">11713313</pub-id>
</element-citation>
</ref>
<ref id="B16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Achaz</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Boyer</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Rocha</surname>
<given-names>E.P.</given-names>
</name>
<name>
<surname>Viari</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Coissac</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>Repseek, a tool to retrieve approximate repeats from large DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>119</fpage>
<lpage>121</lpage>
<pub-id pub-id-type="pmid">17038345</pub-id>
</element-citation>
</ref>
<ref id="B17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Koning</surname>
<given-names>A.P.</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Castoe</surname>
<given-names>T.A.</given-names>
</name>
<name>
<surname>Batzer</surname>
<given-names>M.A.</given-names>
</name>
<name>
<surname>Pollock</surname>
<given-names>D.D.</given-names>
</name>
</person-group>
<article-title>Repetitive elements may comprise over two-thirds of the human genome</article-title>
<source>PLoS Genet.</source>
<year>2011</year>
<volume>7</volume>
<fpage>e1002384</fpage>
<pub-id pub-id-type="pmid">22144907</pub-id>
</element-citation>
</ref>
<ref id="B18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>G.K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ReAS: recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun</article-title>
<source>PLoS Comput. Biol.</source>
<year>2005</year>
<volume>1</volume>
<fpage>e43</fpage>
<pub-id pub-id-type="pmid">16184192</pub-id>
</element-citation>
</ref>
<ref id="B19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>A.G.</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>M.B.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>C.M.</given-names>
</name>
<name>
<surname>Oliver</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Markow</surname>
<given-names>T.A.</given-names>
</name>
<name>
<surname>Kaufman</surname>
<given-names>T.C.</given-names>
</name>
<name>
<surname>Kellis</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gelbart</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Iyer</surname>
<given-names>V.N.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Evolution of genes and genomes on the Drosophila phylogeny</article-title>
<source>Nature</source>
<year>2007</year>
<volume>450</volume>
<fpage>203</fpage>
<lpage>218</lpage>
<pub-id pub-id-type="pmid">17994087</pub-id>
</element-citation>
</ref>
<ref id="B20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Narechania</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>J.C.</given-names>
</name>
<name>
<surname>Ware</surname>
<given-names>D.</given-names>
</name>
</person-group>
<article-title>A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
<source>BMC Genomics</source>
<year>2008</year>
<volume>9</volume>
<fpage>517</fpage>
<pub-id pub-id-type="pmid">18976482</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Myers</surname>
<given-names>E.W.</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G.G.</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>A.L.</given-names>
</name>
<name>
<surname>Dew</surname>
<given-names>I.M.</given-names>
</name>
<name>
<surname>Fasulo</surname>
<given-names>D.P.</given-names>
</name>
<name>
<surname>Flanigan</surname>
<given-names>M.J.</given-names>
</name>
<name>
<surname>Kravitz</surname>
<given-names>S.A.</given-names>
</name>
<name>
<surname>Mobarry</surname>
<given-names>C.M.</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K.H.</given-names>
</name>
<name>
<surname>Remington</surname>
<given-names>K.A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A whole-genome assembly of Drosophila</article-title>
<source>Science</source>
<year>2000</year>
<volume>287</volume>
<fpage>2196</fpage>
<lpage>2204</lpage>
<pub-id pub-id-type="pmid">10731133</pub-id>
</element-citation>
</ref>
<ref id="B22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Treangen</surname>
<given-names>T.J.</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Bambus 2: scaffolding metagenomes</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>2964</fpage>
<lpage>2971</lpage>
<pub-id pub-id-type="pmid">21926123</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iqbal</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Caccamo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Turner</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Flicek</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>McVean</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>De novo assembly and genotyping of variants using colored de Bruijn graphs</article-title>
<source>Nat. Genet.</source>
<year>2012</year>
<volume>44</volume>
<fpage>226</fpage>
<lpage>232</lpage>
<pub-id pub-id-type="pmid">22231483</pub-id>
</element-citation>
</ref>
<ref id="B24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bailey</surname>
<given-names>J.A.</given-names>
</name>
<name>
<surname>Yavor</surname>
<given-names>A.M.</given-names>
</name>
<name>
<surname>Massa</surname>
<given-names>H.F.</given-names>
</name>
<name>
<surname>Trask</surname>
<given-names>B.J.</given-names>
</name>
<name>
<surname>Eichler</surname>
<given-names>E.E.</given-names>
</name>
</person-group>
<article-title>Segmental duplications: organization and impact within the current human genome project assembly</article-title>
<source>Genome Res.</source>
<year>2001</year>
<volume>11</volume>
<fpage>1005</fpage>
<lpage>1017</lpage>
<pub-id pub-id-type="pmid">11381028</pub-id>
</element-citation>
</ref>
<ref id="B25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Hubley</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Smit</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Eichler</surname>
<given-names>E.E.</given-names>
</name>
</person-group>
<article-title>DupMasker: a tool for annotating primate segmental duplications</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>1362</fpage>
<lpage>1368</lpage>
<pub-id pub-id-type="pmid">18502942</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pevzner</surname>
<given-names>P.A.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M.S.</given-names>
</name>
</person-group>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
<source>Proc. Natl. Acad. Sci. U.S.A.</source>
<year>2001</year>
<volume>98</volume>
<fpage>9748</fpage>
<lpage>9753</lpage>
<pub-id pub-id-type="pmid">11504945</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="B28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simpson</surname>
<given-names>J.T.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>S.D.</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>J.E.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>S.J.</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I.</given-names>
</name>
</person-group>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Res.</source>
<year>2009</year>
<volume>19</volume>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="pmid">19251739</pub-id>
</element-citation>
</ref>
<ref id="B29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shan</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res.</source>
<year>2010</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="pmid">20019144</pub-id>
</element-citation>
</ref>
<ref id="B30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gnerre</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Maccallum</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Przybylski</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ribeiro</surname>
<given-names>F.J.</given-names>
</name>
<name>
<surname>Burton</surname>
<given-names>J.N.</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>B.J.</given-names>
</name>
<name>
<surname>Sharpe</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Shea</surname>
<given-names>T.P.</given-names>
</name>
<name>
<surname>Sykes</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>High-quality draft assemblies of mammalian genomes from massively parallel sequence data</article-title>
<source>Proc. Natl. Acad. Sci. U.S.A.</source>
<year>2011</year>
<volume>108</volume>
<fpage>1513</fpage>
<lpage>1518</lpage>
<pub-id pub-id-type="pmid">21187386</pub-id>
</element-citation>
</ref>
<ref id="B31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Phillippy</surname>
<given-names>A.M.</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M.C.</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Genome assembly forensics: finding the elusive mis-assembly</article-title>
<source>Genome Biol.</source>
<year>2008</year>
<volume>9</volume>
<fpage>R55</fpage>
<pub-id pub-id-type="pmid">18341692</pub-id>
</element-citation>
</ref>
<ref id="B32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
<article-title>Detection and correction of false segmental duplications caused by genome mis-assembly</article-title>
<source>Genome Biol.</source>
<year>2010</year>
<volume>11</volume>
<fpage>R28</fpage>
<pub-id pub-id-type="pmid">20219098</pub-id>
</element-citation>
</ref>
<ref id="B33">
<label>33.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zimin</surname>
<given-names>A.V.</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Marcais</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
<name>
<surname>Yorke</surname>
<given-names>J.A.</given-names>
</name>
</person-group>
<article-title>Mis-assembled “segmental duplications” in two versions of the Bos taurus genome</article-title>
<source>PLoS One</source>
<year>2012</year>
<volume>7</volume>
<fpage>e42680</fpage>
<pub-id pub-id-type="pmid">22880081</pub-id>
</element-citation>
</ref>
<ref id="B34">
<label>34.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M.C.</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol.</source>
<year>2010</year>
<volume>11</volume>
<fpage>R116</fpage>
<pub-id pub-id-type="pmid">21114842</pub-id>
</element-citation>
</ref>
<ref id="B35">
<label>35.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M.S.</given-names>
</name>
</person-group>
<article-title>Estimating the repeat structure and length of DNA sequences using L-tuples</article-title>
<source>Genome Res.</source>
<year>2003</year>
<volume>13</volume>
<fpage>1916</fpage>
<lpage>1922</lpage>
<pub-id pub-id-type="pmid">12902383</pub-id>
</element-citation>
</ref>
<ref id="B36">
<label>36.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langley</surname>
<given-names>C.H.</given-names>
</name>
<name>
<surname>Crepeau</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cardeno</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Corbett-Detig</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>K.</given-names>
</name>
</person-group>
<article-title>Circumventing heterozygosity: sequencing the amplified genome of a single haploid Drosophila melanogaster embryo</article-title>
<source>Genetics</source>
<year>2011</year>
<volume>188</volume>
<fpage>239</fpage>
<lpage>246</lpage>
<pub-id pub-id-type="pmid">21441209</pub-id>
</element-citation>
</ref>
<ref id="B37">
<label>37.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcais</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C.</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="B38">
<label>38.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abrusan</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Grundmann</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>DeMester</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Makalowski</surname>
<given-names>W.</given-names>
</name>
</person-group>
<article-title>TEclass—a tool for automated classification of unknown eukaryotic transposable elements</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1329</fpage>
<lpage>1330</lpage>
<pub-id pub-id-type="pmid">19349283</pub-id>
</element-citation>
</ref>
<ref id="B39">
<label>39.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>W.J.</given-names>
</name>
</person-group>
<article-title>BLAT—the BLAST-like alignment tool</article-title>
<source>Genome Res.</source>
<year>2002</year>
<volume>12</volume>
<fpage>656</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="pmid">11932250</pub-id>
</element-citation>
</ref>
<ref id="B40">
<label>40.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adams</surname>
<given-names>M.D.</given-names>
</name>
<name>
<surname>Celniker</surname>
<given-names>S.E.</given-names>
</name>
<name>
<surname>Holt</surname>
<given-names>R.A.</given-names>
</name>
<name>
<surname>Evans</surname>
<given-names>C.A.</given-names>
</name>
<name>
<surname>Gocayne</surname>
<given-names>J.D.</given-names>
</name>
<name>
<surname>Amanatides</surname>
<given-names>P.G.</given-names>
</name>
<name>
<surname>Scherer</surname>
<given-names>S.E.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P.W.</given-names>
</name>
<name>
<surname>Hoskins</surname>
<given-names>R.A.</given-names>
</name>
<name>
<surname>Galle</surname>
<given-names>R.F.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The genome sequence of
<italic>Drosophila melanogaster</italic>
</article-title>
<source>Science</source>
<year>2000</year>
<volume>287</volume>
<fpage>2185</fpage>
<lpage>2195</lpage>
<pub-id pub-id-type="pmid">10731132</pub-id>
</element-citation>
</ref>
<ref id="B41">
<label>41.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Celniker</surname>
<given-names>S.E.</given-names>
</name>
<name>
<surname>Wheeler</surname>
<given-names>D.A.</given-names>
</name>
<name>
<surname>Kronmiller</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>J.W.</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Adams</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Champe</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dugan</surname>
<given-names>S.P.</given-names>
</name>
<name>
<surname>Frise</surname>
<given-names>E.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Finishing a whole-genome shotgun: release 3 of the
<italic>Drosophila melanogaster</italic>
euchromatic genome sequence</article-title>
<source>Genome Biol.</source>
<year>2002</year>
<volume>3</volume>
<comment>RESEARCH0079</comment>
</element-citation>
</ref>
<ref id="B42">
<label>42.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dohm</surname>
<given-names>J.C.</given-names>
</name>
<name>
<surname>Lottaz</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Borodina</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Himmelbauer</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>Substantial biases in ultra-short read data sets from high-throughput DNA sequencing</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>e105</fpage>
<pub-id pub-id-type="pmid">18660515</pub-id>
</element-citation>
</ref>
<ref id="B43">
<label>43.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Macas</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Neumann</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>Ogre elements—a distinct group of plant Ty3/gypsy-like retrotransposons</article-title>
<source>Gene</source>
<year>2007</year>
<volume>390</volume>
<fpage>108</fpage>
<lpage>116</lpage>
<pub-id pub-id-type="pmid">17052864</pub-id>
</element-citation>
</ref>
<ref id="B44">
<label>44.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kohany</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Gentles</surname>
<given-names>A.J.</given-names>
</name>
<name>
<surname>Hankus</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Jurka</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>474</fpage>
<pub-id pub-id-type="pmid">17064419</pub-id>
</element-citation>
</ref>
<ref id="B45">
<label>45.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bergman</surname>
<given-names>C.M.</given-names>
</name>
<name>
<surname>Quesneville</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>Discovering and detecting transposable elements in genome sequences</article-title>
<source>Brief Bioinform.</source>
<year>2007</year>
<volume>8</volume>
<fpage>382</fpage>
<lpage>392</lpage>
<pub-id pub-id-type="pmid">17932080</pub-id>
</element-citation>
</ref>
<ref id="B46">
<label>46.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Smith</surname>
<given-names>C.D.</given-names>
</name>
<name>
<surname>Shu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mungall</surname>
<given-names>C.J.</given-names>
</name>
<name>
<surname>Karpen</surname>
<given-names>G.H.</given-names>
</name>
</person-group>
<article-title>The Release 5.1 annotation of
<italic>Drosophila melanogaster</italic>
heterochromatin</article-title>
<source>Science</source>
<year>2007</year>
<volume>316</volume>
<fpage>1586</fpage>
<lpage>1591</lpage>
<pub-id pub-id-type="pmid">17569856</pub-id>
</element-citation>
</ref>
<ref id="B47">
<label>47.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bennett</surname>
<given-names>E.A.</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Mills</surname>
<given-names>R.E.</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Moran</surname>
<given-names>J.V.</given-names>
</name>
<name>
<surname>Weichenrieder</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Devine</surname>
<given-names>S.E.</given-names>
</name>
</person-group>
<article-title>Active Alu retrotransposons in the human genome</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>1875</fpage>
<lpage>1883</lpage>
<pub-id pub-id-type="pmid">18836035</pub-id>
</element-citation>
</ref>
<ref id="B48">
<label>48.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morissette</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Flamand</surname>
<given-names>L.</given-names>
</name>
</person-group>
<article-title>Herpesviruses and chromosomal integration</article-title>
<source>J. Virol.</source>
<year>2010</year>
<volume>84</volume>
<fpage>12100</fpage>
<lpage>12109</lpage>
<pub-id pub-id-type="pmid">20844040</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F370 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F370 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021