Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A Cross-Language Approach to Historic Document Retrieval

Identifieur interne : 000D24 ( Istex/Corpus ); précédent : 000D23; suivant : 000D25

A Cross-Language Approach to Historic Document Retrieval

Auteurs : Marijn Koolen ; Frans Adriaans ; Jaap Kamps ; Maarten De Rijke

Source :

RBID : ISTEX:F7E92D83042FD1B63D36DE5B5BF79368F41831EA

Abstract

Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).

Url:
DOI: 10.1007/11735106_36

Links to Exploration step

ISTEX:F7E92D83042FD1B63D36DE5B5BF79368F41831EA

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A Cross-Language Approach to Historic Document Retrieval</title>
<author>
<name sortKey="Koolen, Marijn" sort="Koolen, Marijn" uniqKey="Koolen M" first="Marijn" last="Koolen">Marijn Koolen</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Adriaans, Frans" sort="Adriaans, Frans" uniqKey="Adriaans F" first="Frans" last="Adriaans">Frans Adriaans</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kamps, Jaap" sort="Kamps, Jaap" uniqKey="Kamps J" first="Jaap" last="Kamps">Jaap Kamps</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="De Rijke, Maarten" sort="De Rijke, Maarten" uniqKey="De Rijke M" first="Maarten" last="De Rijke">Maarten De Rijke</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:F7E92D83042FD1B63D36DE5B5BF79368F41831EA</idno>
<date when="2006" year="2006">2006</date>
<idno type="doi">10.1007/11735106_36</idno>
<idno type="url">https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000D24</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">A Cross-Language Approach to Historic Document Retrieval</title>
<author>
<name sortKey="Koolen, Marijn" sort="Koolen, Marijn" uniqKey="Koolen M" first="Marijn" last="Koolen">Marijn Koolen</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Adriaans, Frans" sort="Adriaans, Frans" uniqKey="Adriaans F" first="Frans" last="Adriaans">Frans Adriaans</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kamps, Jaap" sort="Kamps, Jaap" uniqKey="Kamps J" first="Jaap" last="Kamps">Jaap Kamps</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="De Rijke, Maarten" sort="De Rijke, Maarten" uniqKey="De Rijke M" first="Maarten" last="De Rijke">Maarten De Rijke</name>
<affiliation>
<mods:affiliation>ISLA, University of Amsterdam, The Netherlands</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2006</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">F7E92D83042FD1B63D36DE5B5BF79368F41831EA</idno>
<idno type="DOI">10.1007/11735106_36</idno>
<idno type="ChapterID">36</idno>
<idno type="ChapterID">Chap36</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).</div>
</front>
</TEI>
<istex>
<corpusName>springer</corpusName>
<author>
<json:item>
<name>Marijn Koolen</name>
<affiliations>
<json:string>ISLA, University of Amsterdam, The Netherlands</json:string>
<json:string>Archives and Information Studies, University of Amsterdam, The Netherlands</json:string>
</affiliations>
</json:item>
<json:item>
<name>Frans Adriaans</name>
<affiliations>
<json:string>ISLA, University of Amsterdam, The Netherlands</json:string>
<json:string>Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands</json:string>
</affiliations>
</json:item>
<json:item>
<name>Jaap Kamps</name>
<affiliations>
<json:string>ISLA, University of Amsterdam, The Netherlands</json:string>
<json:string>Archives and Information Studies, University of Amsterdam, The Netherlands</json:string>
</affiliations>
</json:item>
<json:item>
<name>Maarten de Rijke</name>
<affiliations>
<json:string>ISLA, University of Amsterdam, The Netherlands</json:string>
</affiliations>
</json:item>
</author>
<language>
<json:string>eng</json:string>
</language>
<abstract>Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).</abstract>
<qualityIndicators>
<score>7.592</score>
<pdfVersion>1.3</pdfVersion>
<pdfPageSize>430 x 660 pts</pdfPageSize>
<refBibsNative>false</refBibsNative>
<keywordCount>0</keywordCount>
<abstractCharCount>1556</abstractCharCount>
<pdfWordCount>5219</pdfWordCount>
<pdfCharCount>31260</pdfCharCount>
<pdfPageCount>13</pdfPageCount>
<abstractWordCount>216</abstractWordCount>
</qualityIndicators>
<title>A Cross-Language Approach to Historic Document Retrieval</title>
<genre.original>
<json:string>OriginalPaper</json:string>
</genre.original>
<chapterId>
<json:string>36</json:string>
<json:string>Chap36</json:string>
</chapterId>
<genre>
<json:string>conference [eBooks]</json:string>
</genre>
<serie>
<editor>
<json:item>
<name>David Hutchison</name>
<affiliations>
<json:string>Lancaster University, UK</json:string>
</affiliations>
</json:item>
<json:item>
<name>Takeo Kanade</name>
<affiliations>
<json:string>Carnegie Mellon University, Pittsburgh, PA, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Josef Kittler</name>
<affiliations>
<json:string>University of Surrey, Guildford, UK</json:string>
</affiliations>
</json:item>
<json:item>
<name>Jon M. Kleinberg</name>
<affiliations>
<json:string>Cornell University, Ithaca, NY, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Friedemann Mattern</name>
<affiliations>
<json:string>ETH Zurich, Switzerland</json:string>
</affiliations>
</json:item>
<json:item>
<name>John C. Mitchell</name>
<affiliations>
<json:string>Stanford University, CA, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Moni Naor</name>
<affiliations>
<json:string>Weizmann Institute of Science, Rehovot, Israel</json:string>
</affiliations>
</json:item>
<json:item>
<name>Oscar Nierstrasz</name>
<affiliations>
<json:string>University of Bern, Switzerland</json:string>
</affiliations>
</json:item>
<json:item>
<name>C. Pandu Rangan</name>
<affiliations>
<json:string>Indian Institute of Technology, Madras, India</json:string>
</affiliations>
</json:item>
<json:item>
<name>Bernhard Steffen</name>
<affiliations>
<json:string>University of Dortmund, Germany</json:string>
</affiliations>
</json:item>
<json:item>
<name>Madhu Sudan</name>
<affiliations>
<json:string>Massachusetts Institute of Technology, MA, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Demetri Terzopoulos</name>
<affiliations>
<json:string>University of California, Los Angeles, CA, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Dough Tygar</name>
<affiliations>
<json:string>University of California, Berkeley, CA, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Moshe Y. Vardi</name>
<affiliations>
<json:string>Rice University, Houston, TX, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Gerhard Weikum</name>
<affiliations>
<json:string>Max-Planck Institute of Computer Science, Saarbruecken, Germany</json:string>
</affiliations>
</json:item>
</editor>
<issn>
<json:string>0302-9743</json:string>
</issn>
<language>
<json:string>unknown</json:string>
</language>
<eissn>
<json:string>1611-3349</json:string>
</eissn>
<title>Lecture Notes in Computer Science</title>
<copyrightDate>2006</copyrightDate>
</serie>
<host>
<editor>
<json:item>
<name>Mounia Lalmas</name>
<affiliations>
<json:string>Queen Mary, University of London, London, UK</json:string>
<json:string>E-mail: mounia@dcs.qmul.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Andy MacFarlane</name>
<affiliations>
<json:string>Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK</json:string>
<json:string>E-mail: andym@soi.city.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Stefan Rüger</name>
<affiliations>
<json:string>Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK</json:string>
<json:string>E-mail: s.rueger@open.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Anastasios Tombros</name>
<affiliations>
<json:string>Queen Mary University of London, UK</json:string>
<json:string>E-mail: tassos@dcs.qmul.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Theodora Tsikrika</name>
<affiliations>
<json:string>CWI, Amsterdam, The Netherlands</json:string>
<json:string>E-mail: theodora@dcs.qmul.ac.uk</json:string>
</affiliations>
</json:item>
<json:item>
<name>Alexei Yavlinsky</name>
<affiliations>
<json:string>Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK</json:string>
<json:string>E-mail: alexei.yavlinsky@imperial.ac.uk</json:string>
</affiliations>
</json:item>
</editor>
<subject>
<json:item>
<value>Computer Science</value>
</json:item>
<json:item>
<value>Computer Science</value>
</json:item>
<json:item>
<value>Information Storage and Retrieval</value>
</json:item>
<json:item>
<value>Database Management</value>
</json:item>
<json:item>
<value>Artificial Intelligence (incl. Robotics)</value>
</json:item>
<json:item>
<value>Information Systems Applications (incl.Internet)</value>
</json:item>
<json:item>
<value>Multimedia Information Systems</value>
</json:item>
<json:item>
<value>Document Preparation and Text Processing</value>
</json:item>
</subject>
<isbn>
<json:string>978-3-540-33347-0</json:string>
</isbn>
<language>
<json:string>unknown</json:string>
</language>
<eissn>
<json:string>1611-3349</json:string>
</eissn>
<title>Advances in Information Retrieval</title>
<genre.original>
<json:string>Proceedings</json:string>
</genre.original>
<bookId>
<json:string>978-3-540-33348-7</json:string>
</bookId>
<volume>3936</volume>
<pages>
<last>419</last>
<first>407</first>
</pages>
<issn>
<json:string>0302-9743</json:string>
</issn>
<genre>
<json:string>Book Series</json:string>
</genre>
<eisbn>
<json:string>978-3-540-33348-7</json:string>
</eisbn>
<copyrightDate>2006</copyrightDate>
<doi>
<json:string>10.1007/11735106</json:string>
</doi>
</host>
<publicationDate>2006</publicationDate>
<copyrightDate>2006</copyrightDate>
<doi>
<json:string>10.1007/11735106_36</json:string>
</doi>
<id>F7E92D83042FD1B63D36DE5B5BF79368F41831EA</id>
<fulltext>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/fulltext/pdf</uri>
</json:item>
<json:item>
<original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a" type="main" xml:lang="en">A Cross-Language Approach to Historic Document Retrieval</title>
<respStmt xml:id="ISTEX-API" resp="Références bibliographiques récupérées via GROBID" name="ISTEX-API (INIST-CNRS)"></respStmt>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>Springer Berlin Heidelberg</publisher>
<pubPlace>Berlin, Heidelberg</pubPlace>
<availability>
<p>SPRINGER</p>
</availability>
<date>2006</date>
</publicationStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a" type="main" xml:lang="en">A Cross-Language Approach to Historic Document Retrieval</title>
<author>
<persName>
<forename type="first">Marijn</forename>
<surname>Koolen</surname>
</persName>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</affiliation>
</author>
<author>
<persName>
<forename type="first">Frans</forename>
<surname>Adriaans</surname>
</persName>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands</affiliation>
</author>
<author>
<persName>
<forename type="first">Jaap</forename>
<surname>Kamps</surname>
</persName>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</affiliation>
</author>
<author>
<persName>
<forename type="first">Maarten</forename>
<surname>de Rijke</surname>
</persName>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
</author>
</analytic>
<monogr>
<title level="m">Advances in Information Retrieval</title>
<title level="m" type="sub">28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings</title>
<idno type="pISBN">978-3-540-33347-0</idno>
<idno type="eISBN">978-3-540-33348-7</idno>
<idno type="pISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="DOI">10.1007/11735106</idno>
<idno type="BookID">978-3-540-33348-7</idno>
<idno type="BookTitleID">138131</idno>
<idno type="BookSequenceNumber">3936</idno>
<idno type="BookVolumeNumber">3936</idno>
<idno type="BookChapterCount">69</idno>
<editor>
<persName>
<forename type="first">Mounia</forename>
<surname>Lalmas</surname>
</persName>
<email>mounia@dcs.qmul.ac.uk</email>
<affiliation>Queen Mary, University of London, London, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Andy</forename>
<surname>MacFarlane</surname>
</persName>
<email>andym@soi.city.ac.uk</email>
<affiliation>Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Stefan</forename>
<surname>Rüger</surname>
</persName>
<email>s.rueger@open.ac.uk</email>
<affiliation>Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Anastasios</forename>
<surname>Tombros</surname>
</persName>
<email>tassos@dcs.qmul.ac.uk</email>
<affiliation>Queen Mary University of London, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Theodora</forename>
<surname>Tsikrika</surname>
</persName>
<email>theodora@dcs.qmul.ac.uk</email>
<affiliation>CWI, Amsterdam, The Netherlands</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Alexei</forename>
<surname>Yavlinsky</surname>
</persName>
<email>alexei.yavlinsky@imperial.ac.uk</email>
<affiliation>Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK</affiliation>
</editor>
<imprint>
<publisher>Springer Berlin Heidelberg</publisher>
<pubPlace>Berlin, Heidelberg</pubPlace>
<date type="published" when="2006"></date>
<biblScope unit="volume">3936</biblScope>
<biblScope unit="page" from="407">407</biblScope>
<biblScope unit="page" to="419">419</biblScope>
</imprint>
</monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<editor>
<persName>
<forename type="first">David</forename>
<surname>Hutchison</surname>
</persName>
<affiliation>Lancaster University, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Takeo</forename>
<surname>Kanade</surname>
</persName>
<affiliation>Carnegie Mellon University, Pittsburgh, PA, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Josef</forename>
<surname>Kittler</surname>
</persName>
<affiliation>University of Surrey, Guildford, UK</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Jon</forename>
<forename type="first">M.</forename>
<surname>Kleinberg</surname>
</persName>
<affiliation>Cornell University, Ithaca, NY, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Friedemann</forename>
<surname>Mattern</surname>
</persName>
<affiliation>ETH Zurich, Switzerland</affiliation>
</editor>
<editor>
<persName>
<forename type="first">John</forename>
<forename type="first">C.</forename>
<surname>Mitchell</surname>
</persName>
<affiliation>Stanford University, CA, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Moni</forename>
<surname>Naor</surname>
</persName>
<affiliation>Weizmann Institute of Science, Rehovot, Israel</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Oscar</forename>
<surname>Nierstrasz</surname>
</persName>
<affiliation>University of Bern, Switzerland</affiliation>
</editor>
<editor>
<persName>
<forename type="first">C.</forename>
<surname>Pandu Rangan</surname>
</persName>
<affiliation>Indian Institute of Technology, Madras, India</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Bernhard</forename>
<surname>Steffen</surname>
</persName>
<affiliation>University of Dortmund, Germany</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Madhu</forename>
<surname>Sudan</surname>
</persName>
<affiliation>Massachusetts Institute of Technology, MA, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Demetri</forename>
<surname>Terzopoulos</surname>
</persName>
<affiliation>University of California, Los Angeles, CA, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Dough</forename>
<surname>Tygar</surname>
</persName>
<affiliation>University of California, Berkeley, CA, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Moshe</forename>
<forename type="first">Y.</forename>
<surname>Vardi</surname>
</persName>
<affiliation>Rice University, Houston, TX, USA</affiliation>
</editor>
<editor>
<persName>
<forename type="first">Gerhard</forename>
<surname>Weikum</surname>
</persName>
<affiliation>Max-Planck Institute of Computer Science, Saarbruecken, Germany</affiliation>
</editor>
<biblScope>
<date>2006</date>
</biblScope>
<idno type="pISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="seriesId">558</idno>
</series>
<idno type="istex">F7E92D83042FD1B63D36DE5B5BF79368F41831EA</idno>
<idno type="DOI">10.1007/11735106_36</idno>
<idno type="ChapterID">36</idno>
<idno type="ChapterID">Chap36</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2006</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).</p>
</abstract>
<textClass>
<keywords scheme="Book Subject Collection">
<list>
<label>SUCO11645</label>
<item>
<term>Computer Science</term>
</item>
</list>
</keywords>
</textClass>
<textClass>
<keywords scheme="Book Subject Group">
<list>
<label>I</label>
<label>I18032</label>
<label>I18024</label>
<label>I21017</label>
<label>I18040</label>
<label>I18059</label>
<label>I21033</label>
<item>
<term>Computer Science</term>
</item>
<item>
<term>Information Storage and Retrieval</term>
</item>
<item>
<term>Database Management</term>
</item>
<item>
<term>Artificial Intelligence (incl. Robotics)</term>
</item>
<item>
<term>Information Systems Applications (incl.Internet)</term>
</item>
<item>
<term>Multimedia Information Systems</term>
</item>
<item>
<term>Document Preparation and Text Processing</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2006">Published</change>
<change xml:id="refBibs-istex" who="#ISTEX-API" when="2016-3-20">References added</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="Springer, Publisher found" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//Springer-Verlag//DTD A++ V2.4//EN" URI="http://devel.springer.de/A++/V2.4/DTD/A++V2.4.dtd" name="istex:docType"></istex:docType>
<istex:document>
<Publisher>
<PublisherInfo>
<PublisherName>Springer Berlin Heidelberg</PublisherName>
<PublisherLocation>Berlin, Heidelberg</PublisherLocation>
</PublisherInfo>
<Series>
<SeriesInfo SeriesType="Series" TocLevels="0">
<SeriesID>558</SeriesID>
<SeriesPrintISSN>0302-9743</SeriesPrintISSN>
<SeriesElectronicISSN>1611-3349</SeriesElectronicISSN>
<SeriesTitle Language="En">Lecture Notes in Computer Science</SeriesTitle>
</SeriesInfo>
<SeriesHeader>
<EditorGroup>
<Editor AffiliationIDS="Aff1">
<EditorName DisplayOrder="Western">
<GivenName>David</GivenName>
<FamilyName>Hutchison</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff2">
<EditorName DisplayOrder="Western">
<GivenName>Takeo</GivenName>
<FamilyName>Kanade</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff3">
<EditorName DisplayOrder="Western">
<GivenName>Josef</GivenName>
<FamilyName>Kittler</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff4">
<EditorName DisplayOrder="Western">
<GivenName>Jon</GivenName>
<GivenName>M.</GivenName>
<FamilyName>Kleinberg</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff5">
<EditorName DisplayOrder="Western">
<GivenName>Friedemann</GivenName>
<FamilyName>Mattern</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff6">
<EditorName DisplayOrder="Western">
<GivenName>John</GivenName>
<GivenName>C.</GivenName>
<FamilyName>Mitchell</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff7">
<EditorName DisplayOrder="Western">
<GivenName>Moni</GivenName>
<FamilyName>Naor</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff8">
<EditorName DisplayOrder="Western">
<GivenName>Oscar</GivenName>
<FamilyName>Nierstrasz</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff9">
<EditorName DisplayOrder="Western">
<GivenName>C.</GivenName>
<FamilyName>Pandu Rangan</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff10">
<EditorName DisplayOrder="Western">
<GivenName>Bernhard</GivenName>
<FamilyName>Steffen</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff11">
<EditorName DisplayOrder="Western">
<GivenName>Madhu</GivenName>
<FamilyName>Sudan</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff12">
<EditorName DisplayOrder="Western">
<GivenName>Demetri</GivenName>
<FamilyName>Terzopoulos</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff13">
<EditorName DisplayOrder="Western">
<GivenName>Dough</GivenName>
<FamilyName>Tygar</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff14">
<EditorName DisplayOrder="Western">
<GivenName>Moshe</GivenName>
<GivenName>Y.</GivenName>
<FamilyName>Vardi</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff15">
<EditorName DisplayOrder="Western">
<GivenName>Gerhard</GivenName>
<FamilyName>Weikum</FamilyName>
</EditorName>
</Editor>
<Affiliation ID="Aff1">
<OrgName>Lancaster University</OrgName>
<OrgAddress>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff2">
<OrgName>Carnegie Mellon University</OrgName>
<OrgAddress>
<City>Pittsburgh</City>
<State>PA</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff3">
<OrgName>University of Surrey</OrgName>
<OrgAddress>
<City>Guildford</City>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff4">
<OrgName>Cornell University</OrgName>
<OrgAddress>
<City>Ithaca</City>
<State>NY</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff5">
<OrgName>ETH Zurich</OrgName>
<OrgAddress>
<Country>Switzerland</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff6">
<OrgName>Stanford University</OrgName>
<OrgAddress>
<City>CA</City>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff7">
<OrgName>Weizmann Institute of Science</OrgName>
<OrgAddress>
<City>Rehovot</City>
<Country>Israel</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff8">
<OrgName>University of Bern</OrgName>
<OrgAddress>
<Country>Switzerland</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff9">
<OrgName>Indian Institute of Technology</OrgName>
<OrgAddress>
<City>Madras</City>
<Country>India</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff10">
<OrgName>University of Dortmund</OrgName>
<OrgAddress>
<Country>Germany</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff11">
<OrgName>Massachusetts Institute of Technology</OrgName>
<OrgAddress>
<City>MA</City>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff12">
<OrgName>University of California</OrgName>
<OrgAddress>
<City>Los Angeles</City>
<State>CA</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff13">
<OrgName>University of California</OrgName>
<OrgAddress>
<City>Berkeley</City>
<State>CA</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff14">
<OrgName>Rice University</OrgName>
<OrgAddress>
<City>Houston</City>
<State>TX</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff15">
<OrgName>Max-Planck Institute of Computer Science</OrgName>
<OrgAddress>
<City>Saarbruecken</City>
<Country>Germany</Country>
</OrgAddress>
</Affiliation>
</EditorGroup>
</SeriesHeader>
<Book Language="En">
<BookInfo BookProductType="Proceedings" ContainsESM="No" Language="En" MediaType="eBook" NumberingDepth="2" NumberingStyle="ContentOnly" OutputMedium="All" TocLevels="0">
<BookID>978-3-540-33348-7</BookID>
<BookTitle>Advances in Information Retrieval</BookTitle>
<BookSubTitle>28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings</BookSubTitle>
<BookVolumeNumber>3936</BookVolumeNumber>
<BookSequenceNumber>3936</BookSequenceNumber>
<BookDOI>10.1007/11735106</BookDOI>
<BookTitleID>138131</BookTitleID>
<BookPrintISBN>978-3-540-33347-0</BookPrintISBN>
<BookElectronicISBN>978-3-540-33348-7</BookElectronicISBN>
<BookChapterCount>69</BookChapterCount>
<BookCopyright>
<CopyrightHolderName>Springer Berlin Heidelberg</CopyrightHolderName>
<CopyrightYear>2006</CopyrightYear>
</BookCopyright>
<BookSubjectGroup>
<BookSubject Code="I" Type="Primary">Computer Science</BookSubject>
<BookSubject Code="I18032" Priority="1" Type="Secondary">Information Storage and Retrieval</BookSubject>
<BookSubject Code="I18024" Priority="2" Type="Secondary">Database Management</BookSubject>
<BookSubject Code="I21017" Priority="3" Type="Secondary">Artificial Intelligence (incl. Robotics)</BookSubject>
<BookSubject Code="I18040" Priority="4" Type="Secondary">Information Systems Applications (incl.Internet)</BookSubject>
<BookSubject Code="I18059" Priority="5" Type="Secondary">Multimedia Information Systems</BookSubject>
<BookSubject Code="I21033" Priority="6" Type="Secondary">Document Preparation and Text Processing</BookSubject>
<SubjectCollection Code="SUCO11645">Computer Science</SubjectCollection>
</BookSubjectGroup>
<BookContext>
<SeriesID>558</SeriesID>
</BookContext>
</BookInfo>
<BookHeader>
<EditorGroup>
<Editor AffiliationIDS="Aff16">
<EditorName DisplayOrder="Western">
<GivenName>Mounia</GivenName>
<FamilyName>Lalmas</FamilyName>
</EditorName>
<Contact>
<Email>mounia@dcs.qmul.ac.uk</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff17">
<EditorName DisplayOrder="Western">
<GivenName>Andy</GivenName>
<FamilyName>MacFarlane</FamilyName>
</EditorName>
<Contact>
<Email>andym@soi.city.ac.uk</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff18">
<EditorName DisplayOrder="Western">
<GivenName>Stefan</GivenName>
<FamilyName>Rüger</FamilyName>
</EditorName>
<Contact>
<Email>s.rueger@open.ac.uk</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff19">
<EditorName DisplayOrder="Western">
<GivenName>Anastasios</GivenName>
<FamilyName>Tombros</FamilyName>
</EditorName>
<Contact>
<Email>tassos@dcs.qmul.ac.uk</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff20">
<EditorName DisplayOrder="Western">
<GivenName>Theodora</GivenName>
<FamilyName>Tsikrika</FamilyName>
</EditorName>
<Contact>
<Email>theodora@dcs.qmul.ac.uk</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff21">
<EditorName DisplayOrder="Western">
<GivenName>Alexei</GivenName>
<FamilyName>Yavlinsky</FamilyName>
</EditorName>
<Contact>
<Email>alexei.yavlinsky@imperial.ac.uk</Email>
</Contact>
</Editor>
<Affiliation ID="Aff16">
<OrgDivision>Queen Mary</OrgDivision>
<OrgName>University of London</OrgName>
<OrgAddress>
<City>London</City>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff17">
<OrgDivision>Department of Information Science</OrgDivision>
<OrgName>City University</OrgName>
<OrgAddress>
<Street>Northampton Square</Street>
<Postcode>EC1V OHB</Postcode>
<City>London</City>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff18">
<OrgDivision>Knowledge Media Institute</OrgDivision>
<OrgName>The Open University</OrgName>
<OrgAddress>
<Postcode>MK7 6AA</Postcode>
<City>Milton Keynes</City>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff19">
<OrgName>Queen Mary University of London</OrgName>
<OrgAddress>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff20">
<OrgName>CWI</OrgName>
<OrgAddress>
<City>Amsterdam</City>
<Country>The Netherlands</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff21">
<OrgDivision>Department of Computing</OrgDivision>
<OrgName>Imperial College London</OrgName>
<OrgAddress>
<Street>South Kensington Campus</Street>
<Postcode>SW7 2AZ</Postcode>
<City>London</City>
<Country>UK</Country>
</OrgAddress>
</Affiliation>
</EditorGroup>
</BookHeader>
<Part ID="Part12">
<PartInfo TocLevels="0">
<PartID>12</PartID>
<PartSequenceNumber>12</PartSequenceNumber>
<PartTitle>Cross-Language Retrieval</PartTitle>
<PartChapterCount>3</PartChapterCount>
<PartContext>
<SeriesID>558</SeriesID>
<BookTitle>Advances in Information Retrieval</BookTitle>
</PartContext>
</PartInfo>
<Chapter ID="Chap36" Language="En">
<ChapterInfo ChapterType="OriginalPaper" ContainsESM="No" NumberingDepth="2" NumberingStyle="ContentOnly" TocLevels="0">
<ChapterID>36</ChapterID>
<ChapterDOI>10.1007/11735106_36</ChapterDOI>
<ChapterSequenceNumber>36</ChapterSequenceNumber>
<ChapterTitle Language="En">A Cross-Language Approach to Historic Document Retrieval</ChapterTitle>
<ChapterFirstPage>407</ChapterFirstPage>
<ChapterLastPage>419</ChapterLastPage>
<ChapterCopyright>
<CopyrightHolderName>Springer-Verlag Berlin Heidelberg</CopyrightHolderName>
<CopyrightYear>2006</CopyrightYear>
</ChapterCopyright>
<ChapterGrants Type="Regular">
<MetadataGrant Grant="OpenAccess"></MetadataGrant>
<AbstractGrant Grant="OpenAccess"></AbstractGrant>
<BodyPDFGrant Grant="Restricted"></BodyPDFGrant>
<BodyHTMLGrant Grant="Restricted"></BodyHTMLGrant>
<BibliographyGrant Grant="Restricted"></BibliographyGrant>
<ESMGrant Grant="Restricted"></ESMGrant>
</ChapterGrants>
<ChapterContext>
<SeriesID>558</SeriesID>
<PartID>12</PartID>
<BookID>978-3-540-33348-7</BookID>
<BookTitle>Advances in Information Retrieval</BookTitle>
</ChapterContext>
</ChapterInfo>
<ChapterHeader>
<AuthorGroup>
<Author AffiliationIDS="Aff22 Aff23">
<AuthorName DisplayOrder="Western">
<GivenName>Marijn</GivenName>
<FamilyName>Koolen</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff22 Aff24">
<AuthorName DisplayOrder="Western">
<GivenName>Frans</GivenName>
<FamilyName>Adriaans</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff22 Aff23">
<AuthorName DisplayOrder="Western">
<GivenName>Jaap</GivenName>
<FamilyName>Kamps</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff22">
<AuthorName DisplayOrder="Western">
<GivenName>Maarten</GivenName>
<Particle>de</Particle>
<FamilyName>Rijke</FamilyName>
</AuthorName>
</Author>
<Affiliation ID="Aff22">
<OrgDivision>ISLA</OrgDivision>
<OrgName>University of Amsterdam</OrgName>
<OrgAddress>
<Country>The Netherlands</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff23">
<OrgDivision>Archives and Information Studies</OrgDivision>
<OrgName>University of Amsterdam</OrgName>
<OrgAddress>
<Country>The Netherlands</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff24">
<OrgDivision>Utrecht Institute of Linguistics OTS</OrgDivision>
<OrgName>Utrecht University</OrgName>
<OrgAddress>
<Country>The Netherlands</Country>
</OrgAddress>
</Affiliation>
</AuthorGroup>
<Abstract ID="Abs1" Language="En">
<Heading>Abstract</Heading>
<Para>Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).</Para>
</Abstract>
</ChapterHeader>
<NoBody></NoBody>
</Chapter>
</Part>
</Book>
</Series>
</Publisher>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo lang="en">
<title>A Cross-Language Approach to Historic Document Retrieval</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA" lang="en">
<title>A Cross-Language Approach to Historic Document Retrieval</title>
</titleInfo>
<name type="personal">
<namePart type="given">Marijn</namePart>
<namePart type="family">Koolen</namePart>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Frans</namePart>
<namePart type="family">Adriaans</namePart>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jaap</namePart>
<namePart type="family">Kamps</namePart>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<affiliation>Archives and Information Studies, University of Amsterdam, The Netherlands</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Maarten</namePart>
<namePart type="family">de Rijke</namePart>
<affiliation>ISLA, University of Amsterdam, The Netherlands</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="conference [eBooks]" displayLabel="OriginalPaper"></genre>
<originInfo>
<publisher>Springer Berlin Heidelberg</publisher>
<place>
<placeTerm type="text">Berlin, Heidelberg</placeTerm>
</place>
<dateIssued encoding="w3cdtf">2006</dateIssued>
<copyrightDate encoding="w3cdtf">2006</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).</abstract>
<relatedItem type="host">
<titleInfo>
<title>Advances in Information Retrieval</title>
<subTitle>28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings</subTitle>
</titleInfo>
<name type="personal">
<namePart type="given">Mounia</namePart>
<namePart type="family">Lalmas</namePart>
<affiliation>Queen Mary, University of London, London, UK</affiliation>
<affiliation>E-mail: mounia@dcs.qmul.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Andy</namePart>
<namePart type="family">MacFarlane</namePart>
<affiliation>Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK</affiliation>
<affiliation>E-mail: andym@soi.city.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Stefan</namePart>
<namePart type="family">Rüger</namePart>
<affiliation>Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK</affiliation>
<affiliation>E-mail: s.rueger@open.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Anastasios</namePart>
<namePart type="family">Tombros</namePart>
<affiliation>Queen Mary University of London, UK</affiliation>
<affiliation>E-mail: tassos@dcs.qmul.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Theodora</namePart>
<namePart type="family">Tsikrika</namePart>
<affiliation>CWI, Amsterdam, The Netherlands</affiliation>
<affiliation>E-mail: theodora@dcs.qmul.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Alexei</namePart>
<namePart type="family">Yavlinsky</namePart>
<affiliation>Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK</affiliation>
<affiliation>E-mail: alexei.yavlinsky@imperial.ac.uk</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<genre type="Book Series" displayLabel="Proceedings"></genre>
<originInfo>
<copyrightDate encoding="w3cdtf">2006</copyrightDate>
<issuance>monographic</issuance>
</originInfo>
<subject>
<genre>Book Subject Collection</genre>
<topic authority="SpringerSubjectCodes" authorityURI="SUCO11645">Computer Science</topic>
</subject>
<subject>
<genre>Book Subject Group</genre>
<topic authority="SpringerSubjectCodes" authorityURI="I">Computer Science</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I18032">Information Storage and Retrieval</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I18024">Database Management</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I21017">Artificial Intelligence (incl. Robotics)</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I18040">Information Systems Applications (incl.Internet)</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I18059">Multimedia Information Systems</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I21033">Document Preparation and Text Processing</topic>
</subject>
<identifier type="DOI">10.1007/11735106</identifier>
<identifier type="ISBN">978-3-540-33347-0</identifier>
<identifier type="eISBN">978-3-540-33348-7</identifier>
<identifier type="ISSN">0302-9743</identifier>
<identifier type="eISSN">1611-3349</identifier>
<identifier type="BookTitleID">138131</identifier>
<identifier type="BookID">978-3-540-33348-7</identifier>
<identifier type="BookChapterCount">69</identifier>
<identifier type="BookVolumeNumber">3936</identifier>
<identifier type="BookSequenceNumber">3936</identifier>
<identifier type="PartChapterCount">3</identifier>
<part>
<date>2006</date>
<detail type="part">
<title>Cross-Language Retrieval</title>
</detail>
<detail type="volume">
<number>3936</number>
<caption>vol.</caption>
</detail>
<extent unit="pages">
<start>407</start>
<end>419</end>
</extent>
</part>
<recordInfo>
<recordOrigin>Springer Berlin Heidelberg, 2006</recordOrigin>
</recordInfo>
</relatedItem>
<relatedItem type="series">
<titleInfo>
<title>Lecture Notes in Computer Science</title>
</titleInfo>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Hutchison</namePart>
<affiliation>Lancaster University, UK</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Takeo</namePart>
<namePart type="family">Kanade</namePart>
<affiliation>Carnegie Mellon University, Pittsburgh, PA, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Josef</namePart>
<namePart type="family">Kittler</namePart>
<affiliation>University of Surrey, Guildford, UK</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jon</namePart>
<namePart type="given">M.</namePart>
<namePart type="family">Kleinberg</namePart>
<affiliation>Cornell University, Ithaca, NY, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Friedemann</namePart>
<namePart type="family">Mattern</namePart>
<affiliation>ETH Zurich, Switzerland</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">John</namePart>
<namePart type="given">C.</namePart>
<namePart type="family">Mitchell</namePart>
<affiliation>Stanford University, CA, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Moni</namePart>
<namePart type="family">Naor</namePart>
<affiliation>Weizmann Institute of Science, Rehovot, Israel</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Oscar</namePart>
<namePart type="family">Nierstrasz</namePart>
<affiliation>University of Bern, Switzerland</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">C.</namePart>
<namePart type="family">Pandu Rangan</namePart>
<affiliation>Indian Institute of Technology, Madras, India</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Bernhard</namePart>
<namePart type="family">Steffen</namePart>
<affiliation>University of Dortmund, Germany</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Madhu</namePart>
<namePart type="family">Sudan</namePart>
<affiliation>Massachusetts Institute of Technology, MA, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Demetri</namePart>
<namePart type="family">Terzopoulos</namePart>
<affiliation>University of California, Los Angeles, CA, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Dough</namePart>
<namePart type="family">Tygar</namePart>
<affiliation>University of California, Berkeley, CA, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Moshe</namePart>
<namePart type="given">Y.</namePart>
<namePart type="family">Vardi</namePart>
<affiliation>Rice University, Houston, TX, USA</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Gerhard</namePart>
<namePart type="family">Weikum</namePart>
<affiliation>Max-Planck Institute of Computer Science, Saarbruecken, Germany</affiliation>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<copyrightDate encoding="w3cdtf">2006</copyrightDate>
<issuance>serial</issuance>
</originInfo>
<identifier type="ISSN">0302-9743</identifier>
<identifier type="eISSN">1611-3349</identifier>
<identifier type="SeriesID">558</identifier>
<recordInfo>
<recordOrigin>Springer Berlin Heidelberg, 2006</recordOrigin>
</recordInfo>
</relatedItem>
<identifier type="istex">F7E92D83042FD1B63D36DE5B5BF79368F41831EA</identifier>
<identifier type="DOI">10.1007/11735106_36</identifier>
<identifier type="ChapterID">36</identifier>
<identifier type="ChapterID">Chap36</identifier>
<accessCondition type="use and reproduction" contentType="copyright">Springer Berlin Heidelberg, 2006</accessCondition>
<recordInfo>
<recordContentSource>SPRINGER</recordContentSource>
<recordOrigin>Springer-Verlag Berlin Heidelberg, 2006</recordOrigin>
</recordInfo>
</mods>
</metadata>
<enrichments>
<istex:refBibTEI uri="https://api.istex.fr/document/F7E92D83042FD1B63D36DE5B5BF79368F41831EA/enrichments/refBib">
<teiHeader></teiHeader>
<text>
<front></front>
<body></body>
<back>
<listBibl>
<biblStruct xml:id="b0">
<monogr>
<title></title>
<author>
<persName>
<forename type="first">Brabants</forename>
<forename type="middle">Recht</forename>
<surname>Costumen Van Antwerpen</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b1">
<analytic>
<title level="a" type="main">Cross-language evaluation forum: Objectives, results, achievements</title>
<author>
<persName>
<forename type="first">M</forename>
<surname>Braschler</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">C</forename>
<surname>Peters</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Information Retrieval</title>
<imprint>
<biblScope unit="volume">7</biblScope>
<biblScope unit="page" from="7" to="31"></biblScope>
<date type="published" when="2004"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b2">
<monogr>
<title level="m" type="main">Information retrieval from Dutch historical corpora</title>
<author>
<persName>
<forename type="first">L</forename>
<surname>Braun</surname>
</persName>
</author>
<imprint>
<date type="published" when="2002"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b3">
<monogr>
<title level="m" type="main">Cross language evaluation forum</title>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b4">
<analytic>
<title level="a" type="main">Overview of the TREC 2004 web track National Institute for Standards and Technology</title>
<author>
<persName>
<forename type="first">N</forename>
<surname>Craswell</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">D</forename>
<surname>Hawking</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="m">The Thirteenth Text REtrieval Conference</title>
<imprint>
<date type="published" when="2004"></date>
<biblScope unit="page" from="500" to="251"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b5">
<monogr>
<title level="m" type="main">Digitale bibliotheek voor de Nederlandse letteren</title>
<author>
<persName>
<forename type="first">Dbnl</forename>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b6">
<monogr>
<title level="m" type="main">Technology challenges for digital culture</title>
<author>
<persName>
<surname>Digicult</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b7">
<analytic>
<title level="a" type="main">Bootstrap methods: Another look at the jackknife</title>
<author>
<persName>
<forename type="first">B</forename>
<surname>Efron</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Annals of Statistics</title>
<imprint>
<biblScope unit="volume">7</biblScope>
<biblScope unit="page" from="1" to="26"></biblScope>
<date type="published" when="1979"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b8">
<monogr>
<title></title>
<author>
<persName>
<forename type="first">Gelders Recht. Gelders</forename>
<surname>Land-En</surname>
</persName>
</author>
<author>
<persName>
<surname>Stadsrecht</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b9">
<analytic>
<title level="a" type="main">Monolingual document retrieval for European languages</title>
<author>
<persName>
<forename type="first">V</forename>
<surname>Hollink</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">J</forename>
<surname>Kamps</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">C</forename>
<surname>Monz</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">M</forename>
<surname>De Rijke</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Information Retrieval</title>
<imprint>
<biblScope unit="volume">7</biblScope>
<biblScope unit="page" from="33" to="52"></biblScope>
<date type="published" when="2004"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b10">
<monogr>
<title level="m" type="main">Geschiedenis van het Nederlands</title>
<author>
<persName>
<forename type="first">M</forename>
<surname>Hüning</surname>
</persName>
</author>
<imprint>
<date type="published" when="1996"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b11">
<analytic>
<title level="a" type="main">Technique for automatically correcting words in text</title>
<author>
<persName>
<forename type="first">K</forename>
<surname>Kukich</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">ACM Computing Surveys</title>
<imprint>
<biblScope unit="volume">24</biblScope>
<biblScope unit="page" from="377" to="439"></biblScope>
<date type="published" when="1992"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b12">
<monogr>
<title level="m" type="main">Understanding Digital Libraries. The Morgan Kaufmann series in multimedia information and systems</title>
<author>
<persName>
<forename type="first">M</forename>
<surname>Lesk</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
<publisher>Morgan Kaufmann</publisher>
</imprint>
</monogr>
<note>second. edition</note>
</biblStruct>
<biblStruct xml:id="b13">
<monogr>
<title level="m" type="main">The Lucene search engine</title>
<author>
<persName>
<surname>Lucene</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b14">
<monogr>
<title level="m" type="main">Text-to-speech for Dutch</title>
<author>
<persName>
<surname>Nextens</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b15">
<analytic>
<title level="a" type="main">Word variant identification in old french</title>
<author>
<persName>
<forename type="first">A</forename>
<forename type="middle">J</forename>
<surname>O 'rourke</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">A</forename>
<forename type="middle">M</forename>
<surname>Robertson</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">P</forename>
<surname>Willett</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">P</forename>
<surname>Eley</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">P</forename>
<surname>Simons</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Information Research</title>
<imprint>
<biblScope unit="volume">2</biblScope>
<date type="published" when="1996"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b16">
<analytic>
<title level="a" type="main">Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods</title>
<author>
<persName>
<forename type="first">A</forename>
<forename type="middle">M</forename>
<surname>Robertson</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">P</forename>
<surname>Willett</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="m">Proceedings ACM SIGIR '92</title>
<meeting>ACM SIGIR '92
<address>
<addrLine>New York, NY, USA</addrLine>
</address>
</meeting>
<imprint>
<publisher>ACM Press</publisher>
<date type="published" when="1992"></date>
<biblScope unit="page" from="256" to="265"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b17">
<analytic>
<title level="a" type="main">Searching for historical word forms in text databases using spelling-correction methods</title>
<author>
<persName>
<forename type="first">H</forename>
<forename type="middle">J</forename>
<surname>Rogers</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">P</forename>
<surname>Willett</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Journal of Documentation</title>
<imprint>
<biblScope unit="volume">7</biblScope>
<biblScope unit="page" from="333" to="353"></biblScope>
<date type="published" when="1991"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b18">
<monogr>
<title level="m" type="main">Specification of Letters</title>
<author>
<persName>
<forename type="first">R</forename>
<forename type="middle">C</forename>
<surname>Russell</surname>
</persName>
</author>
<imprint>
<date type="published" when="1918"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b19">
<monogr>
<title level="m" type="main">Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison</title>
<author>
<persName>
<forename type="first">D</forename>
<surname>Sankoff</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">J</forename>
<surname>Kruskal</surname>
</persName>
</author>
<imprint>
<date type="published" when="1983"></date>
<publisher>Addison-Wesley Publishing Co</publisher>
<pubPlace>Reading, Massachusetts, USA</pubPlace>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b20">
<analytic>
<title level="a" type="main">Statistical inference in retrieval effectiveness evaluation</title>
<author>
<persName>
<forename type="first">J</forename>
<surname>Savoy</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Information Processing and Management</title>
<imprint>
<biblScope unit="volume">33</biblScope>
<biblScope unit="page" from="495" to="512"></biblScope>
<date type="published" when="1997"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b21">
<analytic>
<title level="a" type="main">Combining multiple strategies for effective monolingual and crosslanguage retrieval</title>
<author>
<persName>
<forename type="first">J</forename>
<surname>Savoy</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Information Retrieval</title>
<imprint>
<biblScope unit="volume">7</biblScope>
<biblScope unit="page" from="121" to="148"></biblScope>
<date type="published" when="2004"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b22">
<monogr>
<title level="m" type="main">A language for stemming algorithms</title>
<author>
<persName>
<surname>Snowball</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b23">
<analytic>
<title level="a" type="main">The string-to-string correction problem</title>
<author>
<persName>
<forename type="first">R</forename>
<forename type="middle">A</forename>
<surname>Wagner</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">M</forename>
<forename type="middle">J</forename>
<surname>Fischer</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Journal of the ACM</title>
<imprint>
<biblScope unit="volume">21</biblScope>
<biblScope unit="page" from="168" to="173"></biblScope>
<date type="published" when="1974"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b24">
<monogr>
<title level="m" type="main">Indo-european languages</title>
<author>
<persName>
<surname>Wikipedia</surname>
</persName>
</author>
<imprint>
<date type="published" when="2005"></date>
</imprint>
</monogr>
</biblStruct>
</listBibl>
</back>
</text>
</istex:refBibTEI>
</enrichments>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D24 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000D24 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:F7E92D83042FD1B63D36DE5B5BF79368F41831EA
   |texte=   A Cross-Language Approach to Historic Document Retrieval
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024