Serveur d'exploration sur l'Université de Trèves

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

ScienceTreks an autonomous digital library system

Identifieur interne : 001B89 ( Istex/Corpus ); précédent : 001B88; suivant : 001B90

ScienceTreks an autonomous digital library system

Auteurs : A. R. D. Prasad ; Alexander Ivanyukovich ; Maurizio Marchese ; Fausto Giunchiglia

Source :

RBID : ISTEX:8DB9DDC074E1353C99D633F07A936EE6057B2B99

Abstract

Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Designmethodologyapproach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library ADL system called ScienceTreks that supports a broad range of methods for document acquisition does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive and provides application programming interfaces API to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed ADL system can be used in automating endtoend information retrieval and processing, supporting the control and elimination of errorprone human intervention in the process. Originalityvalue High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for valueadded services within existing and future semanticenabled digital library systems.

Url:
DOI: 10.1108/14684520810897368

Links to Exploration step

ISTEX:8DB9DDC074E1353C99D633F07A936EE6057B2B99

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ScienceTreks an autonomous digital library system</title>
<author wicri:is="90%">
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
</author>
<author>
<name sortKey="Ivanyukovich, Alexander" sort="Ivanyukovich, Alexander" uniqKey="Ivanyukovich A" first="Alexander" last="Ivanyukovich">Alexander Ivanyukovich</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Marchese, Maurizio" sort="Marchese, Maurizio" uniqKey="Marchese M" first="Maurizio" last="Marchese">Maurizio Marchese</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Giunchiglia, Fausto" sort="Giunchiglia, Fausto" uniqKey="Giunchiglia F" first="Fausto" last="Giunchiglia">Fausto Giunchiglia</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:8DB9DDC074E1353C99D633F07A936EE6057B2B99</idno>
<date when="2008" year="2008">2008</date>
<idno type="doi">10.1108/14684520810897368</idno>
<idno type="url">https://api.istex.fr/document/8DB9DDC074E1353C99D633F07A936EE6057B2B99/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001B89</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001B89</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">ScienceTreks an autonomous digital library system</title>
<author wicri:is="90%">
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
</author>
<author>
<name sortKey="Ivanyukovich, Alexander" sort="Ivanyukovich, Alexander" uniqKey="Ivanyukovich A" first="Alexander" last="Ivanyukovich">Alexander Ivanyukovich</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Marchese, Maurizio" sort="Marchese, Maurizio" uniqKey="Marchese M" first="Maurizio" last="Marchese">Maurizio Marchese</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Giunchiglia, Fausto" sort="Giunchiglia, Fausto" uniqKey="Giunchiglia F" first="Fausto" last="Giunchiglia">Fausto Giunchiglia</name>
<affiliation>
<mods:affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Online Information Review</title>
<idno type="ISSN">1468-4527</idno>
<imprint>
<publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2008-08-08">2008-08-08</date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="488">488</biblScope>
<biblScope unit="page" to="499">499</biblScope>
</imprint>
<idno type="ISSN">1468-4527</idno>
</series>
<idno type="istex">8DB9DDC074E1353C99D633F07A936EE6057B2B99</idno>
<idno type="DOI">10.1108/14684520810897368</idno>
<idno type="filenameID">2640320403</idno>
<idno type="original-pdf">2640320403.pdf</idno>
<idno type="href">14684520810897368.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">1468-4527</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract">Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Designmethodologyapproach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library ADL system called ScienceTreks that supports a broad range of methods for document acquisition does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive and provides application programming interfaces API to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed ADL system can be used in automating endtoend information retrieval and processing, supporting the control and elimination of errorprone human intervention in the process. Originalityvalue High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for valueadded services within existing and future semanticenabled digital library systems.</div>
</front>
</TEI>
<istex>
<corpusName>emerald</corpusName>
<editor>
<json:item>
<name>A.R.D. Prasad</name>
</json:item>
</editor>
<author>
<json:item>
<name>A.R.D. Prasad</name>
</json:item>
<json:item>
<name>Alexander Ivanyukovich</name>
<affiliations>
<json:string>Department of Information and Communication Technology, University of Trento, Trento, Italy</json:string>
</affiliations>
</json:item>
<json:item>
<name>Maurizio Marchese</name>
<affiliations>
<json:string>Department of Information and Communication Technology, University of Trento, Trento, Italy</json:string>
</affiliations>
</json:item>
<json:item>
<name>Fausto Giunchiglia</name>
<affiliations>
<json:string>Department of Information and Communication Technology, University of Trento, Trento, Italy</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Digital libraries</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Information retrieval</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Library systems</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Automation</value>
</json:item>
</subject>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>research-article</json:string>
</originalGenre>
<abstract>Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Designmethodologyapproach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library ADL system called ScienceTreks that supports a broad range of methods for document acquisition does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive and provides application programming interfaces API to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed ADL system can be used in automating endtoend information retrieval and processing, supporting the control and elimination of errorprone human intervention in the process. Originalityvalue High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for valueadded services within existing and future semanticenabled digital library systems.</abstract>
<qualityIndicators>
<score>9.307</score>
<pdfVersion>1.3</pdfVersion>
<pdfPageSize>519 x 680 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>4</keywordCount>
<abstractCharCount>1495</abstractCharCount>
<pdfWordCount>4763</pdfWordCount>
<pdfCharCount>32055</pdfCharCount>
<pdfPageCount>12</pdfPageCount>
<abstractWordCount>212</abstractWordCount>
</qualityIndicators>
<title>ScienceTreks an autonomous digital library system</title>
<refBibs>
<json:item>
<author>
<json:item>
<name>S. Brin</name>
</json:item>
<json:item>
<name>L. Page</name>
</json:item>
</author>
<host>
<volume>30</volume>
<issue>17</issue>
<author></author>
<title>Proceedings of the 7th World Wide Web Conference, Computer Networks and ISDN Systems</title>
</host>
<title>The anatomy of a largescale hypertextual web search engine</title>
</json:item>
<json:item>
<author>
<json:item>
<name>J. Cho</name>
</json:item>
<json:item>
<name>H. GarciaMolina</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the WWW2002, Honolulu, Hawaii, 711 May</title>
</host>
<title>Parallel crawlers</title>
</json:item>
<json:item>
<author>
<json:item>
<name>J.G. Conrad</name>
</json:item>
<json:item>
<name>C.P. Schriber</name>
</json:item>
</author>
<host>
<volume>57</volume>
<issue>7</issue>
<author></author>
<title>Journal of the American Society for Information Science and Technology</title>
</host>
<title>Managing dj vu collection building for the identification of nonidentical duplicate documents</title>
</json:item>
<json:item>
<author>
<json:item>
<name>J. Cordy</name>
</json:item>
</author>
<host>
<volume>110</volume>
<pages>
<last>31</last>
<first>3</first>
</pages>
<author></author>
<title>Proceedings of 4th International Workshop on Language Descriptions, Tools and Applications. Electronic Notes in Theoretical Computer Science</title>
</host>
<title>Txl a language for programming language tools and applications</title>
</json:item>
<json:item>
<author>
<json:item>
<name>M. Diligenti</name>
</json:item>
<json:item>
<name>F.M. Coetzee</name>
</json:item>
<json:item>
<name>S. Lawrence</name>
</json:item>
<json:item>
<name>C.L. Giles</name>
</json:item>
<json:item>
<name>M. Gori</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 26th International Conference on Very Large Data Bases</title>
</host>
<title>Focused crawling using context graphs</title>
</json:item>
<json:item>
<author>
<json:item>
<name>C.L. Giles</name>
</json:item>
<json:item>
<name>K.D. Bollacker</name>
</json:item>
<json:item>
<name>S. Lawrence</name>
</json:item>
</author>
<host>
<pages>
<last>98</last>
<first>89</first>
</pages>
<author></author>
<title>Proceedings of the 3rd ACM Conference on Digital Libraries</title>
</host>
<title>CiteSeer an automatic citation indexing system</title>
</json:item>
<json:item>
<author>
<json:item>
<name>H. Han</name>
</json:item>
<json:item>
<name>C.L. Giles</name>
</json:item>
<json:item>
<name>E. Manavoglu</name>
</json:item>
<json:item>
<name>H. Zha</name>
</json:item>
<json:item>
<name>Z. Zhang</name>
</json:item>
<json:item>
<name>E.A. Fox</name>
</json:item>
</author>
<host>
<pages>
<last>48</last>
<first>37</first>
</pages>
<author></author>
<title>Proceedings of the 3rd ACMIEEECS joint Conference on Digital Libraries</title>
</host>
<title>Automatic document metadata extraction using support vector machines</title>
</json:item>
<json:item>
<author>
<json:item>
<name>B.P. Heath</name>
</json:item>
<json:item>
<name>D.J. McArthur</name>
</json:item>
<json:item>
<name>M.K. McClelland</name>
</json:item>
<json:item>
<name>R.J. Vetter</name>
</json:item>
</author>
<host>
<volume>49</volume>
<pages>
<last>74</last>
<first>68</first>
</pages>
<issue>7</issue>
<author></author>
<title>Communications of the ACM</title>
</host>
<title>Metadata lessons from the iLumina digital library</title>
</json:item>
<json:item>
<author>
<json:item>
<name>Y. Ioannidis</name>
</json:item>
<json:item>
<name>D. Maier</name>
</json:item>
<json:item>
<name>S. Abiteboul</name>
</json:item>
<json:item>
<name>P. Buneman</name>
</json:item>
<json:item>
<name>S. Davidson</name>
</json:item>
<json:item>
<name>E. Fox</name>
</json:item>
<json:item>
<name>A. Halevy</name>
</json:item>
<json:item>
<name>C. Knoblock</name>
</json:item>
<json:item>
<name>F. Rabitti</name>
</json:item>
<json:item>
<name>H. Schek</name>
</json:item>
<json:item>
<name>G. Weikum</name>
</json:item>
</author>
<host>
<volume>5</volume>
<issue>4</issue>
<author></author>
<title>International Journal on Digital Libraries</title>
</host>
<title>Digital library informationtechnology infrastructures</title>
</json:item>
<json:item>
<author>
<json:item>
<name>A. Ivanyukovich</name>
</json:item>
<json:item>
<name>M. Marchese</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 1st International Conference on Multidisciplinary Information Sciences and Technologies, InScit2006, Merida, Spain</title>
</host>
<title>Unsupervised freetext processing and structuring in digital archives</title>
</json:item>
<json:item>
<author>
<json:item>
<name>A. Ivanyukovich</name>
</json:item>
<json:item>
<name>M. Marchese</name>
</json:item>
</author>
<host>
<author></author>
<title>SWAP 2006 Semantic Web Applications and Perspectives, Proceedings of the 3rd Italian Semantic Web Workshop, Pisa, Italy</title>
</host>
<title>Unsupervised metadata extraction in scientific digital libraries using apriori domainspecific knowledge</title>
</json:item>
<json:item>
<author>
<json:item>
<name>A. Ivanyukovich</name>
</json:item>
<json:item>
<name>M. Marchese</name>
</json:item>
<json:item>
<name>P. Reuther</name>
</json:item>
</author>
<host>
<volume>4675</volume>
<author></author>
<title>Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science</title>
</host>
<title>Assessing quality dynamics in unsupervised metadata extraction for digital libraries</title>
</json:item>
<json:item>
<author>
<json:item>
<name>N. Kiyavitskaya</name>
</json:item>
<json:item>
<name>N. Zeni</name>
</json:item>
<json:item>
<name>J.R. Cordy</name>
</json:item>
<json:item>
<name>L. Mich</name>
</json:item>
<json:item>
<name>J. Mylopoulos</name>
</json:item>
</author>
<host>
<author></author>
<title>Advanced Information Systems Engineering 18th International Conference, CAiSE 2006, Luxembourg, Luxembourg, Proceedings, Lecture Notes in Computer Science</title>
</host>
<title>Semiautomatic semantic annotations for next generation information systems</title>
</json:item>
<json:item>
<author>
<json:item>
<name>S. Klink</name>
</json:item>
<json:item>
<name>P. Reuther</name>
</json:item>
<json:item>
<name>A. Weber</name>
</json:item>
<json:item>
<name>B. Walter</name>
</json:item>
<json:item>
<name>M. Ley</name>
</json:item>
</author>
<host>
<volume>4080</volume>
<author></author>
<title>Database and Expert Systems Applications, 17th International Conference, DEXA 2006, Krakw, Poland, Proceedings, Lecture Notes in Computer Science</title>
</host>
<title>Analysing social networks within bibliographical data</title>
</json:item>
<json:item>
<author>
<json:item>
<name>S.R. Kruk</name>
</json:item>
<json:item>
<name>S. Decker</name>
</json:item>
<json:item>
<name>L. Zieborak</name>
</json:item>
</author>
<host>
<volume>3588</volume>
<author></author>
<title>Database and Expert Systems Applications, Lecture Notes in Computer Science</title>
</host>
<title>JeromeDL adding semantic web technologies to digital libraries</title>
</json:item>
<json:item>
<author>
<json:item>
<name>C. Lagoze</name>
</json:item>
<json:item>
<name>D. Krafft</name>
</json:item>
<json:item>
<name>T. Cornwell</name>
</json:item>
<json:item>
<name>N. Dushay</name>
</json:item>
<json:item>
<name>D. Eckstrom</name>
</json:item>
<json:item>
<name>J. Saylor</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 6th ACMIEEECS Joint Conference on Digital Libraries</title>
</host>
<title>Metadata aggregation and automated digital libraries a retrospective on the NSDL experience</title>
</json:item>
<json:item>
<author>
<json:item>
<name>M. Ley</name>
</json:item>
<json:item>
<name>P. Reuther</name>
</json:item>
</author>
<host>
<volume>RNTIE6</volume>
<pages>
<last>10</last>
<first>5</first>
</pages>
<author></author>
<title>Extraction et Gestion des Connaissances EGC'2006, Revue des Nouvelles Technologies de l'Information</title>
</host>
<title>Maintaining an online bibliographical database the problem of data quality</title>
</json:item>
<json:item>
<author>
<json:item>
<name>M.E.J. Newman</name>
</json:item>
</author>
<host>
<volume>64</volume>
<author></author>
<title>Physical Review E</title>
</host>
<title>Scientific collaboration networks. I. Network construction and fundamental results</title>
</json:item>
<json:item>
<author>
<json:item>
<name>E.C.M. Noyons</name>
</json:item>
<json:item>
<name>H.F. Moed</name>
</json:item>
<json:item>
<name>M. Luwel</name>
</json:item>
</author>
<host>
<volume>50</volume>
<issue>2</issue>
<author></author>
<title>Journal of the American Society for Information Science</title>
</host>
<title>Combining mapping and citation analysis for evaluative bibliometric purposes a bibliometric study</title>
</json:item>
<json:item>
<author>
<json:item>
<name>L. Peshkin</name>
</json:item>
<json:item>
<name>A. Pfeffer</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico</title>
</host>
<title>Bayesian information extraction network</title>
</json:item>
<json:item>
<author>
<json:item>
<name>Y. Petinot</name>
</json:item>
<json:item>
<name>C.L. Giles</name>
</json:item>
<json:item>
<name>V. Bhatnagar</name>
</json:item>
<json:item>
<name>P.B. Teregowda</name>
</json:item>
<json:item>
<name>H. Han</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 4th ACMIEEECS Joint Conference on Digital Libraries</title>
</host>
<title>Enabling interoperability for autonomous digital libraries an API to CiteSeer services</title>
</json:item>
<json:item>
<author>
<json:item>
<name>Y. Petinot</name>
</json:item>
<json:item>
<name>C.L. Giles</name>
</json:item>
<json:item>
<name>V. Bhatnagar</name>
</json:item>
<json:item>
<name>P.B. Teregowda</name>
</json:item>
<json:item>
<name>H. Han</name>
</json:item>
<json:item>
<name>I. Councill</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 13th ACM International Conference on Information and Knowledge Management</title>
</host>
<title>CiteSeerAPI towards seamless resource location and interlinking for digital libraries</title>
</json:item>
<json:item>
<author>
<json:item>
<name>P. Reuther</name>
</json:item>
<json:item>
<name>B. Walter</name>
</json:item>
</author>
<host>
<volume>1</volume>
<pages>
<last>99</last>
<first>89</first>
</pages>
<issue>2</issue>
<author></author>
<title>International Journal of Metadata, Semantics and Ontologies</title>
</host>
<title>Survey on test collections and techniques for personal name matching</title>
</json:item>
<json:item>
<author>
<json:item>
<name>G. Salton</name>
</json:item>
<json:item>
<name>A. Singhal</name>
</json:item>
<json:item>
<name>M. Mitra</name>
</json:item>
<json:item>
<name>C. Buckley</name>
</json:item>
</author>
<host>
<volume>33</volume>
<pages>
<last>207</last>
<first>193</first>
</pages>
<issue>2</issue>
<author></author>
<title>Information Processing & Management</title>
</host>
<title>Automatic text structuring and summarization</title>
</json:item>
<json:item>
<author>
<json:item>
<name>H. Suleman</name>
</json:item>
<json:item>
<name>E.A. Fox</name>
</json:item>
</author>
<host>
<volume>2458</volume>
<author></author>
<title>Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science</title>
</host>
<title>Designing protocols in support of digital library componentization</title>
</json:item>
<json:item>
<author>
<json:item>
<name>C. Tryfonopoulos</name>
</json:item>
<json:item>
<name>S. Idreos</name>
</json:item>
<json:item>
<name>M. Koubarakis</name>
</json:item>
</author>
<host>
<author></author>
<title>Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries ECDL 2005, Vienna, Austria</title>
</host>
<title>LibraRing an architecture for distributed digital libraries based on DHT</title>
</json:item>
<json:item>
<author>
<json:item>
<name>A. van Raan</name>
</json:item>
</author>
<host>
<volume>38</volume>
<issue>1</issue>
<author></author>
<title>Scientometrics</title>
</host>
<title>Scientometrics stateoftheart</title>
</json:item>
<json:item>
<author>
<json:item>
<name>H. Yang</name>
</json:item>
<json:item>
<name>J. Callan</name>
</json:item>
<json:item>
<name>S. Shulman</name>
</json:item>
</author>
<host>
<volume>151</volume>
<author></author>
<title>Proceedings of the 2006 National Conference on Digital Government Research, ACM International Conference Proceeding Series</title>
</host>
<title>Next steps in nearduplicate detection for eRulemaking</title>
</json:item>
</refBibs>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<volume>32</volume>
<publisherId>
<json:string>oir</json:string>
</publisherId>
<pages>
<last>499</last>
<first>488</first>
</pages>
<issn>
<json:string>1468-4527</json:string>
</issn>
<issue>4</issue>
<subject>
<json:item>
<value>Information & knowledge management</value>
</json:item>
<json:item>
<value>Information & communications technology</value>
</json:item>
<json:item>
<value>Internet</value>
</json:item>
<json:item>
<value>Library & information science</value>
</json:item>
<json:item>
<value>Collection building & management</value>
</json:item>
<json:item>
<value>Information behaviour & retrieval</value>
</json:item>
<json:item>
<value>Records management & preservation</value>
</json:item>
<json:item>
<value>Bibliometrics</value>
</json:item>
<json:item>
<value>Databases</value>
</json:item>
<json:item>
<value>Document management</value>
</json:item>
</subject>
<genre>
<json:string>journal</json:string>
</genre>
<language>
<json:string>unknown</json:string>
</language>
<title>Online Information Review</title>
<doi>
<json:string>10.1108/oir</json:string>
</doi>
</host>
<categories>
<wos>
<json:string>social science</json:string>
<json:string>information science & library science</json:string>
<json:string>science</json:string>
<json:string>computer science, information systems</json:string>
</wos>
<scienceMetrix>
<json:string>economic & social sciences</json:string>
<json:string>social sciences</json:string>
<json:string>information & library sciences</json:string>
</scienceMetrix>
</categories>
<publicationDate>2008</publicationDate>
<copyrightDate>2008</copyrightDate>
<doi>
<json:string>10.1108/14684520810897368</json:string>
</doi>
<id>8DB9DDC074E1353C99D633F07A936EE6057B2B99</id>
<score>0.02403277</score>
<fulltext>
<json:item>
<extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/document/8DB9DDC074E1353C99D633F07A936EE6057B2B99/fulltext/pdf</uri>
</json:item>
<json:item>
<extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/document/8DB9DDC074E1353C99D633F07A936EE6057B2B99/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/8DB9DDC074E1353C99D633F07A936EE6057B2B99/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a" type="main" xml:lang="en">ScienceTreks an autonomous digital library system</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>Emerald Group Publishing Limited</publisher>
<availability>
<p>© Emerald Group Publishing Limited</p>
</availability>
<date>2008</date>
</publicationStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a" type="main" xml:lang="en">ScienceTreks an autonomous digital library system</title>
<author xml:id="author-1">
<persName>
<forename type="first">A.R.D.</forename>
<surname>Prasad</surname>
</persName>
</author>
<editor>
<persName>
<forename type="first">A.R.D.</forename>
<surname>Prasad</surname>
</persName>
</editor>
<author xml:id="author-3">
<persName>
<forename type="first">Alexander</forename>
<surname>Ivanyukovich</surname>
</persName>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
</author>
<author xml:id="author-4">
<persName>
<forename type="first">Maurizio</forename>
<surname>Marchese</surname>
</persName>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
</author>
<author xml:id="author-5">
<persName>
<forename type="first">Fausto</forename>
<surname>Giunchiglia</surname>
</persName>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
</author>
</analytic>
<monogr>
<title level="j">Online Information Review</title>
<idno type="pISSN">1468-4527</idno>
<idno type="DOI">10.1108/oir</idno>
<imprint>
<publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2008-08-08"></date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="488">488</biblScope>
<biblScope unit="page" to="499">499</biblScope>
</imprint>
</monogr>
<idno type="istex">8DB9DDC074E1353C99D633F07A936EE6057B2B99</idno>
<idno type="DOI">10.1108/14684520810897368</idno>
<idno type="filenameID">2640320403</idno>
<idno type="original-pdf">2640320403.pdf</idno>
<idno type="href">14684520810897368.pdf</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2008</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract>
<p>Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Designmethodologyapproach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library ADL system called ScienceTreks that supports a broad range of methods for document acquisition does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive and provides application programming interfaces API to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed ADL system can be used in automating endtoend information retrieval and processing, supporting the control and elimination of errorprone human intervention in the process. Originalityvalue High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for valueadded services within existing and future semanticenabled digital library systems.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<head>keywords</head>
<item>
<term>Digital libraries</term>
</item>
<item>
<term>Information retrieval</term>
</item>
<item>
<term>Library systems</term>
</item>
<item>
<term>Automation</term>
</item>
</list>
</keywords>
</textClass>
<textClass>
<keywords scheme="Emerald Subject Group">
<list>
<label>cat-IKM</label>
<item>
<term>Information & knowledge management</term>
</item>
<label>cat-ICT</label>
<item>
<term>Information & communications technology</term>
</item>
<label>cat-INT</label>
<item>
<term>Internet</term>
</item>
</list>
</keywords>
</textClass>
<textClass>
<keywords scheme="Emerald Subject Group">
<list>
<label>cat-LISC</label>
<item>
<term>Library & information science</term>
</item>
<label>cat-CBM</label>
<item>
<term>Collection building & management</term>
</item>
<label>cat-IBRT</label>
<item>
<term>Information behaviour & retrieval</term>
</item>
<label>cat-RMP</label>
<item>
<term>Records management & preservation</term>
</item>
<label>cat-BIB</label>
<item>
<term>Bibliometrics</term>
</item>
<label>cat-DAT</label>
<item>
<term>Databases</term>
</item>
<label>cat-DOCM</label>
<item>
<term>Document management</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2008-08-08">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/document/8DB9DDC074E1353C99D633F07A936EE6057B2B99/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="corpus emerald not found" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:document><!-- Auto generated NISO JATS XML created by Atypon out of MCB DTD source files. Do Not Edit! -->
<article dtd-version="1.0" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">oir</journal-id>
<journal-id journal-id-type="doi">10.1108/oir</journal-id>
<journal-title-group>
<journal-title>Online Information Review</journal-title>
</journal-title-group>
<issn pub-type="ppub">1468-4527</issn>
<publisher>
<publisher-name>Emerald Group Publishing Limited</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.1108/14684520810897368</article-id>
<article-id pub-id-type="original-pdf">2640320403.pdf</article-id>
<article-id pub-id-type="filename">2640320403</article-id>
<article-categories>
<subj-group subj-group-type="type-of-publication">
<compound-subject>
<compound-subject-part content-type="code">research-article</compound-subject-part>
<compound-subject-part content-type="label">Research paper</compound-subject-part>
</compound-subject>
</subj-group>
<subj-group subj-group-type="subject">
<compound-subject>
<compound-subject-part content-type="code">cat-IKM</compound-subject-part>
<compound-subject-part content-type="label">Information & knowledge management</compound-subject-part>
</compound-subject>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-ICT</compound-subject-part>
<compound-subject-part content-type="label">Information & communications technology</compound-subject-part>
</compound-subject>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-INT</compound-subject-part>
<compound-subject-part content-type="label">Internet</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="subject">
<compound-subject>
<compound-subject-part content-type="code">cat-LISC</compound-subject-part>
<compound-subject-part content-type="label">Library & information science</compound-subject-part>
</compound-subject>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-CBM</compound-subject-part>
<compound-subject-part content-type="label">Collection building & management</compound-subject-part>
</compound-subject>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-BIB</compound-subject-part>
<compound-subject-part content-type="label">Bibliometrics</compound-subject-part>
</compound-subject>
<compound-subject>
<compound-subject-part content-type="code">cat-DAT</compound-subject-part>
<compound-subject-part content-type="label">Databases</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-IBRT</compound-subject-part>
<compound-subject-part content-type="label">Information behaviour & retrieval</compound-subject-part>
</compound-subject>
</subj-group>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-RMP</compound-subject-part>
<compound-subject-part content-type="label">Records management & preservation</compound-subject-part>
</compound-subject>
<subj-group>
<compound-subject>
<compound-subject-part content-type="code">cat-DOCM</compound-subject-part>
<compound-subject-part content-type="label">Document management</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>ScienceTreks: an autonomous digital library system</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="editor">
<string-name>
<given-names>A.R.D.</given-names>
<surname>Prasad</surname>
</string-name>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="author">
<string-name>
<given-names>Alexander</given-names>
<surname>Ivanyukovich</surname>
</string-name>
<aff>Department of Information and Communication Technology, University of Trento, Trento, Italy</aff>
</contrib>
<x></x>
<contrib contrib-type="author">
<string-name>
<given-names>Maurizio</given-names>
<surname>Marchese</surname>
</string-name>
<aff>Department of Information and Communication Technology, University of Trento, Trento, Italy</aff>
</contrib>
<x></x>
<contrib contrib-type="author">
<string-name>
<given-names>Fausto</given-names>
<surname>Giunchiglia</surname>
</string-name>
<aff>Department of Information and Communication Technology, University of Trento, Trento, Italy</aff>
</contrib>
</contrib-group>
<pub-date pub-type="ppub">
<day>08</day>
<month>08</month>
<year>2008</year>
</pub-date>
<volume>32</volume>
<issue>4</issue>
<issue-title>The Semantic Web and Web Design</issue-title>
<issue-title content-type="short">Semantic Web and Web Design</issue-title>
<fpage>488</fpage>
<lpage>499</lpage>
<permissions>
<copyright-statement>© Emerald Group Publishing Limited</copyright-statement>
<copyright-year>2008</copyright-year>
<license license-type="publisher">
<license-p></license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="14684520810897368.pdf"></self-uri>
<abstract>
<sec>
<title content-type="abstract-heading">Purpose</title>
<x></x>
<p>The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content.</p>
</sec>
<sec>
<title content-type="abstract-heading">Design/methodology/approach</title>
<x></x>
<p>The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed.</p>
</sec>
<sec>
<title content-type="abstract-heading">Findings</title>
<x></x>
<p>The proposed pipeline is implemented in a working prototype of an autonomous digital library (A‐DL) system called ScienceTreks that: supports a broad range of methods for document acquisition; does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive; and provides application programming interfaces (API) to support easy integration of external systems and tools in the existing pipeline.</p>
</sec>
<sec>
<title content-type="abstract-heading">Practical implications</title>
<x></x>
<p>The proposed A‐DL system can be used in automating end‐to‐end information retrieval and processing, supporting the control and elimination of error‐prone human intervention in the process.</p>
</sec>
<sec>
<title content-type="abstract-heading">Originality/value</title>
<x></x>
<p>High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for value‐added services within existing and future semantic‐enabled digital library systems.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Digital libraries</kwd>
<x>, </x>
<kwd>Information retrieval</kwd>
<x>, </x>
<kwd>Library systems</kwd>
<x>, </x>
<kwd>Automation</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>peer-reviewed</meta-name>
<meta-value>no</meta-value>
</custom-meta>
<custom-meta>
<meta-name>academic-content</meta-name>
<meta-value>yes</meta-value>
</custom-meta>
<custom-meta>
<meta-name>rightslink</meta-name>
<meta-value>included</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<ack>
<p>The authors acknowledge C. Lee Giles for useful comments and advice during initial brainstorming on the system architecture, and Patrick Reuther for his advice on the definition of the evaluation set and for information exchange on DBLP datasets.</p>
</ack>
</front>
<body>
<sec>
<title>Introduction</title>
<p>We are experiencing exponential information growth and facing the problem of its management. Some of the critical issues are the management of very large repositories of digital objects, the existence of many standards to encode the same information in natural language and the complexity of identification of information relevance (within the user's request, within the digital object and within a collection of digital objects). Distinct elements of the outlined problem have long been under investigation, such as library systems, search engines, natural language processing (NLP) techniques, statistical methods of information analysis, etc. In our view the area has only recently matured enough to shift the research attention from individual issues to a global approach to the problem, at least in specific, vertical domains. We focus on the vertical domain of scholarly/scientific content.</p>
<p>Different systems are currently available online, from commercial digital libraries (like Scopus (
<ext-link ext-link-type="uri" xlink:href="http://www.scopus.com/">www.scopus.com/</ext-link>
), Web of Knowledge (
<ext-link ext-link-type="uri" xlink:href="http://www.isiwebofknowledge.com/">www.isiwebofknowledge.com/</ext-link>
), IEEEXplore (
<ext-link ext-link-type="uri" xlink:href="http://ieeexplore.ieee.org/">http://ieeexplore.ieee.org/</ext-link>
), ACM Digital Library (
<ext-link ext-link-type="uri" xlink:href="http://portal.acm.org">http://portal.acm.org</ext-link>
)) to non‐commercial digital libraries (CiteSeer.IST (
<ext-link ext-link-type="uri" xlink:href="http://citeseer.ist.psu.edu/">http://citeseer.ist.psu.edu/</ext-link>
), DBLP (
<ext-link ext-link-type="uri" xlink:href="http://dblp.uni-trier.de/">http://dblp.uni‐trier.de/</ext-link>
)) and current versions of commercially‐managed systems that explore novel business models for academic search engines (like Google Scholar (
<ext-link ext-link-type="uri" xlink:href="http://scholar.google.com/">http://scholar.google.com/</ext-link>
) and Microsoft's Live Search Academic (
<ext-link ext-link-type="uri" xlink:href="http://academic.live.com/">http://academic.live.com/</ext-link>
).</p>
<p>The existence of such variety and size of content as well as increasing accessibility opens the way to semantic‐enabled services like unsupervised document clustering, author profiling (
<ext-link ext-link-type="uri" xlink:href="http://www.rexa.info">www.rexa.info</ext-link>
), scientometrics (
<xref ref-type="bibr" rid="b27">van Raan, 1997</xref>
), science domains mapping (
<xref ref-type="bibr" rid="b19">Noyons
<italic>et al.</italic>
, 1999</xref>
), scientific social networks analysis (
<xref ref-type="bibr" rid="b18">Newman, 2001</xref>
;
<xref ref-type="bibr" rid="b14">Klink
<italic>et al.</italic>
, 2006</xref>
), etc. However, the implementation of such semantic‐aware services requires the annotation of the available content with high‐quality metadata.</p>
<p>The two different information sources of scientific content – traditional/journal‐based sources and internet sources – present important differences in the approach to metadata annotation. On the one hand, traditional sources are usually based on manually prepared information (from certified authorities such as professional associations like ACM and IEEE, and commercial publishers like Elsevier, Springer, etc.). On the other hand the exponential increase of digital scientific publishing models requires support for the automation of all human‐dependent parts of such annotation.</p>
<p>In this paper, we address the problem of automation of all steps in the creation of a semantically‐enriched scientific digital library. We propose an “information extraction pipeline” from digital document acquisition to format transformation and quality automatic information extraction and annotation. An unsupervised information extraction pipeline implementation creates what we call an autonomous digital library (A‐DL) system. In this paper, we present an overall architecture for such a system and we describe in some detail a first prototype – the ScienceTreks system. In particular, our prototype:
<list list-type="bullet">
<list-item>
<label></label>
<p>Supports a broad range of methods for document acquisition, from local file system repository and generic internet crawling to focused internet crawling.</p>
</list-item>
<list-item>
<label></label>
<p>Does not rely on any external information sources, but is solely based on the existing information in the document itself and in the overall set of documents currently present in a given digital archive.</p>
</list-item>
<list-item>
<label></label>
<p>Provides application programming interfaces (API) to support easy integration of external systems and tools in the existing pipeline. It is thus open to extension and potential improvements in metadata extraction and processing by other methods and tools.</p>
</list-item>
</list>
Furthermore, we present preliminary results on the evaluation of the quality of our novel approach to metadata extraction, where the emphasis is on the exploitation of the knowledge available within the available document collection.</p>
<p>In this paper, we discuss related work before presenting the overall system architecture of an A‐DL system. We describe in some details the implementation of the individual information extraction pipeline steps in a prototype system called ScienceTreks. We describe the proposed methodology to evaluate the dynamics of metadata extraction quality and provide and analyse the preliminary results obtained in our evaluation procedure. Finally, we summarise the results and discuss our future work.</p>
</sec>
<sec>
<title>Related work</title>
<p>Wide adoption of open standards for inter‐exchange in the digital library domain (such as Dublin Core (The Dublin Core Metadata Element Set, ISO 15836: 2003,
<ext-link ext-link-type="uri" xlink:href="http://www.niso.org/international/SC4/n515.pdf">www.niso.org/international/SC4/n515.pdf</ext-link>
), IEEE Learning Objects Metadata (LOM) (IEEE Standard 1484.12.1,
<ext-link ext-link-type="uri" xlink:href="http://ltsc.ieee.org/wg12/">http://ltsc.ieee.org/wg12/</ext-link>
) and OAI‐PMH (
<ext-link ext-link-type="uri" xlink:href="http://openarchives.org">http://openarchives.org</ext-link>
)) and the recent appearance of a number of commercial digital library systems from big market players (like (Google
<ext-link ext-link-type="uri" xlink:href="http://openarchives.org">http://openarchives.org</ext-link>
) and Microsoft (Live Search Academic,
<ext-link ext-link-type="uri" xlink:href="http://academic.live.com">http://academic.live.com</ext-link>
)) may serve as an indicator of the growth of the overall digital library domain. Also, existing academic digital library systems are enlarging their content size. The CiteSeer.IST autonomous citation indexing system (
<xref ref-type="bibr" rid="b6">Giles
<italic>et al.</italic>
, 1998</xref>
) has recently reached 730,000 scientific articles. Specialised academic pre‐print archives, like ArXiv (ArXiv,
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org">http://arxiv.org</ext-link>
) in physics, Cogprints (Cogprints,
<ext-link ext-link-type="uri" xlink:href="http://cogprints.org">http://cogprints.org</ext-link>
) in cognitive science, and RePEc (RePEc,
<ext-link ext-link-type="uri" xlink:href="http://repec.org">http://repec.org</ext-link>
) in economics among others, are in constant growth.</p>
<p>Most of the recent research on digital library can be summarised under a number of topics – metadata description schemes and their application, interoperability schemes, large‐scale digital library systems and distributed architectures, near‐duplicates (revisions, corrections, etc.) identification and handling, and semantic‐enabled services application (classification, personalised digital library systems, etc.).</p>
<p>In particular, recent feasibility studies of Dublin Core, OAI‐PMH and LOM metadata description standards by
<xref ref-type="bibr" rid="b8">Heath
<italic>et al.</italic>
(2005)</xref>
and
<xref ref-type="bibr" rid="b16">Lagoze
<italic>et al.</italic>
(2006)</xref>
have reported difficulties with standards applicability in live digital library systems and can be considered a reference for eventual standards review. The studies were based on 3‐5 years of experiments and were mainly connected with the high cost of deployment and maintenance. Similar studies have been carried out for interoperability protocols (OAI/XOAI/ODL) between different digital library systems as well as components inside single digital library systems (
<xref ref-type="bibr" rid="b25">Suleman and Fox, 2002</xref>
;
<xref ref-type="bibr" rid="b22">Petinot
<italic>et al.</italic>
, 2004b</xref>
). The studies did not report any standards shortcomings, but were rather focused on the architectural patterns in digital library systems.</p>
<p>In the scope of continuous data growth, the eventual design of distributed architecture for digital library systems and user requirements analysis were recently undertaken by
<xref ref-type="bibr" rid="b9">Ioannidis
<italic>et al.</italic>
(2005)</xref>
and
<xref ref-type="bibr" rid="b26">Tryfonopoulos
<italic>et al.</italic>
(2005)</xref>
. These works have proposed a global evolution scheme for digital library systems and outlined existing problems such as data organisation, results presentation, requests evaluation and others. The related problem of data versioning and duplicates processing was recently reviewed and new methods for near‐duplicates elimination were proposed (
<xref ref-type="bibr" rid="b28">Yang
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="b3">Conrad and Schriber, 2006</xref>
). Research into the future evolution of digital library systems applying semantic web methods (
<xref ref-type="bibr" rid="b15">Kruk
<italic>et al.</italic>
, 2005</xref>
) has shown the possibility of improving user experience in content search and navigation.</p>
<p>Altogether these topics reflect a more global goal toward process automation in digital library systems and indicate possible application areas. Our work contributes towards this goal with a proposed information extraction pipeline architecture and the corresponding implementation in a prototype of an A‐DL system.</p>
</sec>
<sec>
<title>Autonomous digital library system</title>
<p>Simplifying information gathering, processing and extraction is a challenging problem. In the traditional approach most of the real work is done by a human “information engineer” who possesses specific knowledge about the content and has special training in the information processing methods.</p>
<p>In this paper, we propose and analyse an “information extraction pipeline”, from digital document acquisition and format transformation to quality automatic information extraction and annotation. Such an information extraction pipeline can be separated into a number of operational steps:
<list list-type="order">
<list-item>
<label>1. </label>
<p>
<italic>Crawling.</italic>
The sources of initial raw data (digital scientific documents) for the pipeline input are collected.</p>
</list-item>
<list-item>
<label>2. </label>
<p>
<italic>Parsing and harmonisation.</italic>
Document format transformation (e.g. from PDF to text) and pre‐processing operations are undertaken.</p>
</list-item>
<list-item>
<label>3. </label>
<p>
<italic>Metadata extraction.</italic>
A number of sequential operations are supported. First logical structures within single documents (i.e. header, abstract, introduction, etc.) are identified, then single entities (references, etc.) within single documents are recognised, finally metadata (authors, title, publication authority, affiliations, etc.) within single entities are extracted.</p>
</list-item>
<list-item>
<label>4. </label>
<p>
<italic>Metadata processing.</italic>
Relations between the identified metadata are created, for instance, the creation of the network of interlinked documents, e.g. citation graph, identification of topics, co‐author analysis, etc.</p>
</list-item>
<list-item>
<label>5. </label>
<p>
<italic>Searching.</italic>
The gathered data (documents) and extracted metadata are indexed and mapped into a searchable database to deliver fast, scalable and reliable access with search and browse functionalities for human as well as non‐human (typically web services) users.</p>
</list-item>
</list>
At the end of this process further recognition and formalisation of the relevant metadata in proper semantic concepts can be performed to enable semantic‐aware innovative services.</p>
<p>An A‐DL system aims toward unsupervised execution of the information extraction pipeline steps outlined above. To this end we have designed and implemented a prototype of a scalable and distributed A‐DL system that covers the identified digital library archive functionalities and is in the process of expansion to the semantic‐based functionalities. The logical architecture of such an A‐DL system can be described in eight layers:
<list list-type="order">
<list-item>
<label>1. </label>
<p>internal data structure;</p>
</list-item>
<list-item>
<label>2. </label>
<p>information retrieval;</p>
</list-item>
<list-item>
<label>3. </label>
<p>parsing and harmonisation;</p>
</list-item>
<list-item>
<label>4. </label>
<p>metadata extraction;</p>
</list-item>
<list-item>
<label>5. </label>
<p>metadata processing;</p>
</list-item>
<list-item>
<label>6. </label>
<p>information management (search and retrieval);</p>
</list-item>
<list-item>
<label>7. </label>
<p>application management; and</p>
</list-item>
<list-item>
<label>8. </label>
<p>interfaces.</p>
</list-item>
</list>
Layers (1), (7) and (8) represent infrastructural functionalities, while layers (2)‐(6) represent the implementation of the information extraction pipeline. Each block in the presented pipeline is loosely coupled with the rest through a common data representation scheme. It is important to note that, unlike other digital library systems that provide to external systems only API for final metadata querying (
<xref ref-type="bibr" rid="b21">Petinot
<italic>et al.</italic>
, 2004a</xref>
), our system architecture allows easy integration of external systems and tools in the existing pipeline.</p>
<p>Schematically, each document goes through a number of transactions covering document retrieval, text parsing and harmonisation, metadata extraction, metadata processing and indexing.</p>
</sec>
<sec>
<title>ScienceTreks architecture overview</title>
<p>Our prototype A‐DL system – ScienceTreks – consists of the five major modules implementing information extraction plus the internal support for data structure.
<xref ref-type="fig" rid="F_2640320403001">Figure 1</xref>
shows a diagram of this architecture, indicating the flow of data in the sub‐system. Here, we will skip the presentation of the specific implementation of the internal data structure since it is out of the scope of the current paper. We only mention that it is connected to the implementation of a distributed file system from the Apache Nutch project (
<ext-link ext-link-type="uri" xlink:href="http://nutch.org">http://nutch.org</ext-link>
). In the following sections we describe each module.</p>
<sec>
<title>Crawler</title>
<p>The crawler module is essential to the overall system since it is the main source of initial raw data for the pipeline input. Large‐scale crawler design is a relevant research problem in itself as well as the technical and technological challenge (
<xref ref-type="bibr" rid="b1">Brin and Page, 1998</xref>
;
<xref ref-type="bibr" rid="b5">Diligenti
<italic>et al.</italic>
, 2000</xref>
;
<xref ref-type="bibr" rid="b2">Cho and Garcia‐Molina, 2002</xref>
). Some of the issues studied include crawling schemes for better coverage quality (focused crawling, random walk, etc.), duplicates and near‐duplicates identification (as well as content versioning), parallel crawling (independent, dynamic assignment, etc.) and crawler trap identification (infinite loops, generated content, etc.).</p>
<p>Our crawler module is designed for a broad range of possible application domains so it supports several methods of document acquisition, in particular:
<list list-type="bullet">
<list-item>
<label></label>
<p>simple system bootstrapping from an existing document set, generating missing metadata if needed;</p>
</list-item>
<list-item>
<label></label>
<p>document retrieval from the internet using either a list of direct links to documents or links to pages with documents (one level in‐depth crawling);</p>
</list-item>
<list-item>
<label></label>
<p>focused internet crawling using an indicated list of domains; and</p>
</list-item>
<list-item>
<label></label>
<p>internet‐wide crawling.</p>
</list-item>
</list>
All discovered archived documents are uncompressed, so all eventually broken archives are discarded already in the acquisition phase. Overall, crawler functionality includes compliance with the standards (HTTP, FTP, cookies, robots.txt, etc.), correct session handling, crawler trap recovery (infinite loops, etc.), distributed crawling support, fault tolerance and other minor technical features (management, monitoring, etc.).</p>
</sec>
<sec>
<title>Parser</title>
<p>The parser module functionality covers the transformation of documents to plain text and some text pre‐processing operations. Our parser currently supports the two most popular formats for publishing scientific documents – PDF and PostScript. While document‐to‐text transformation is a technical task, text pre‐processing includes some research problems like text flow recognition and collateral elements detection, such as table of contents, index, headers, footers, etc. (
<xref ref-type="bibr" rid="b24">Salton
<italic>et al.</italic>
, 1997</xref>
;
<xref ref-type="bibr" rid="b10">Ivanyukovich and Marchese, 2006a</xref>
). The problem of text flow recognition is historically connected with the PostScript document format (As well as printing process optimisation) – articles found on the internet can have normal and reverse page ordering. In the development of the module, we have evaluated and incorporated a number of approaches such as the numbers succession method, hyphens concatenation method, text flow prediction using hidden Markov models (HMM)/dynamic Bayesian networks (DBNs). More details on these methods can found in
<xref ref-type="bibr" rid="b11">Ivanyukovich and Marchese (2006b)</xref>
.</p>
</sec>
<sec>
<title>Metadata extractor</title>
<p>In this module cleaned text is transformed into structured segments (abstract, introduction, references section, etc.). The references section is processed against individual references and afterwards individual metadata fields are extracted for each reference. These include, for the moment, a sub‐set of the Dublin Core metadata standard, i.e. authors, title, conference proceedings, publication year and some other fields within the document's text.</p>
<p>Automatic metadata extraction has been under investigation for a considerable time and numerous methods are available for the purpose, such as – regular expressions, rule‐based automata, machine learning and NLP are among the most popular (
<xref ref-type="bibr" rid="b7">Han
<italic>et al.</italic>
, 2003</xref>
). Regular expressions and rule‐based automata do not require training, are easy to implement and are fast. However, they require a domain expert for creation and tuning, they lack adaptability, their complexity increases for a moderate‐to‐large number of features and they are usually difficult to adapt.</p>
<p>Machine learning techniques for information extraction include symbolic learning, inductive logic programming, grammar induction, support vector machines, HMM, DBNs and statistical methods (
<xref ref-type="bibr" rid="b20">Peshkin and Pfeffer, 2003</xref>
). In theory, machine learning techniques are robust and adaptive, but in practice they require the training set size to be the same order of magnitude as the set under investigation, which limits their application. Another challenge in the application of machine learning to metadata extraction is the absence of false positives during training – training can be done only on true positives. NLP methods can deliver the best results but are really complex, are language‐dependent and do not particularly perform well in terms of speed. They are usually used in combination with other techniques.</p>
<p>Another research problem connected with metadata extraction is metadata normalisation and comparison. In the digital library domain this includes the normalisation of references and of authors, and references comparison under uncertainty (partially overlapping information in the references under comparison).</p>
<p>In the implementation of this module, we have followed a novel method for unsupervised metadata extraction based on a priori domain‐specific knowledge. It consists in two major steps:
<list list-type="order">
<list-item>
<label>1. </label>
<p>pattern‐based metadata extraction using a finite state machine (FSM); and</p>
</list-item>
<list-item>
<label>2. </label>
<p>statistical correction using a priori domain‐specific knowledge.</p>
</list-item>
</list>
For the first step we have analysed, tested and adapted an existing state‐of‐the‐art implementation of a specialised FSM‐based lexical grammar parser for fast text processing (
<xref ref-type="bibr" rid="b13">Kiyavitskaya
<italic>et al.</italic>
, 2006</xref>
;
<xref ref-type="bibr" rid="b4">Cordy, 2004</xref>
). For the second step, we have investigated and developed statistical methods that allow metadata correction and enrichment without the need to access external information sources. More details on the methods used can be found in
<xref ref-type="bibr" rid="b11">Ivanyukovich and Marchese (2006b)</xref>
.</p>
</sec>
<sec>
<title>Metadata processor</title>
<p>The next step in the information extraction pipeline is dedicated to the creation of relations between metadata sets, in particular in the creation of a network of interlinked documents – a citation graph. This network includes a bi‐directional linking scheme consisting of forward links – from a document to its references (documents that it cites); and backward links – from a document to its referees (documents that cite it). Unlike other digital library systems we have omitted the identification of a document's title and authors in the previous metadata extraction step, because at that stage we could use only the techniques described in the previous section. However, at this stage we can use a document's structure as well as metadata already collected for more precise title and author recognition. At present, the approach is limited to the use of the internal set of metadata. However, it can be extended to the use of existing, external high‐quality metadata repositories (like Digital Bibliography & Library Project (DBLP) and publishers' collected data (IEEE, ACM, Elsevier, etc.)). This combined approach will improve the quality of the resulting citation graph.</p>
</sec>
<sec>
<title>Indexer and front‐end</title>
<p>These modules cover both typical search engine and digital library functionalities. For performance reasons we have included in the system support for index distribution over multiple PCs, fast record location mechanisms based on distributed file system facilities and cache mechanisms for both queries and documents.</p>
<p>According to our study the text‐to‐binary content ratio is approximately 10 per cent and the index‐to‐text ratio is approximately 30 per cent (depending on indexing techniques such as stemming, stop‐word elimination, etc.). This gives us an estimate of required memory consumption at the front‐end – for each 50Gb of processed content, we expect an addition of around 1.5Gb to the index. Index search speed is in inverse proportion to the index size. This fact adds another architectural constraint – the index size should be small for usability reasons. The exact index size depends on the possible speed of read operation. Our implementation enables index distribution over a network of distributed PCs where each node can keep its part of the index always in‐memory thus optimising the speed of read operations.</p>
<p>The front‐end functionalities are simple and straightforward. Currently they support metadata and full‐text searching, metadata retrieval and binary content retrieval (cached versions of documents). Additional features are limited at present to citation‐based ranking functionalities.</p>
</sec>
</sec>
<sec>
<title>Evaluation methodology and results</title>
<p>The ScienceTreks project currently contains about 500,000 documents. The order of magnitude of the base collection makes it clear that manual quality evaluation is not feasible. This fact has been reported already in related works, and other methodologies (involving automated or semi‐automated methods) have been proposed based on the specifics of each concrete dataset (
<xref ref-type="bibr" rid="b23">Reuther and Walter, 2006</xref>
).</p>
<p>Comparison of the methods involved in information extraction in existing automated digital library systems (CiteSeer.IST, GoogleScholar and Live Search Academic to name a few) is complex due the fact that there is no “golden set” of metadata publicly available. By “golden set” we mean a set of metadata that is either completely verified manually or is strictly aligned with such a set. An ideal golden set should be based on an intersection of the collections of articles in different systems so it would be possible to measure the influence of different processing methods and a priori assumptions on the resulting metadata quality within each system. This dataset should be aligned with manually verified metadata – we need a one‐to‐one connection between the articles we have in the system and metadata describing these articles.</p>
<p>In the domain of scientific publications the DBLP maintained at the University of Trier is likely a good candidate for such a golden metadata set. DBLP is a strongly human‐dependent collection of bibliographic records which have all been manually acquired and checked for quality (
<xref ref-type="bibr" rid="b17">Ley and Reuther, 2006</xref>
). Recently, DBLP contains metadata for more than 860,000 publication records published by more than 450,000 authors. The size and the high quality, which is respected throughout the scientific community, as well as the focus on computer science publications makes DBLP an ideal starting point for a publicly available golden set for digital libraries in the computer science domain.</p>
<p>The method we have used for document identification (described in the previous section) allows us to utilise any external metadata sources directly. According to the definition of the “golden set” provided above, we require an intersection between the DBLP and ScienceTreks datasets. We have achieved this using the complete DBLP references collection for document identification within ScienceTreks. The approach guarantees this identification; however, it is still possible to have several articles from the same authors with titles including one another. It is hard to completely avoid this, but we have introduced a minimal title length constraint in our dataset to reduce this possibility. From the resulting collection of identified articles we have selected an initial test set of 45,000 documents as the golden set. Obviously the selected golden set is not complete in any way; that is, the method we have used for its construction does not presume either constraints on the references/citations quantity or particular community/publishing authority/publication time coverage. This gives us roughly random article collection.</p>
<p>In our tests, we compare metadata extracted with our methods (see previous sections) with the golden set metadata using a Levenshtein distance metric. This comparison gives us a distribution of identified metadata over edit distances, together with the average edit distance and related variance and deviation. Further evaluation was done by varying the size of the golden set to assess the quality in respect to the growth of the dataset during evolution of the system (metadata quality dynamics). For more details on the preparation and representativeness evaluation of the golden set as well as precise methodology of metadata comparison for titles and authors, refer to
<xref ref-type="bibr" rid="b12">Ivanyukovich
<italic>et al.</italic>
(2007)</xref>
.</p>
<p>The results of the evaluation of extracted titles revealed a high percentage of exact matches, around 37 per cent. Thereafter, a relative shallow distribution is present in the range of Levenshtein edit distances (20‐100), with a relative maximum of around 45‐50, and accounts for the remaining partially‐recognised/unrecognised titles. In sequentially enlarging the document sets, we have observed that the overall shape of the distribution is unchanged, while the title recognition percentage rises linearly from 37 to 46 per cent at the same time as we enlarged the document sets from 45,000 to 165,000.</p>
<p>The results for the author identification quality within the initial set of 45,000 documents appeared to be better than title quality – around 53 per cent of absolutely correct author identification. Following from our simple Boolean comparison for single author, the distribution of author recognition is bi‐modal with two sharp peaks at correct author recognition (value = 1.53 per cent) and complete miss (value = 0, 44 per cent). The remaining (small) 3 per cent consists of partially identified authors in the total number of authors of the article. Metadata quality dynamics in this case show limited variation – in enlarging the document set from 45,000 to 165,000, the normalised author recognition value rose only a few per cent from the 53 to 56 per cent.</p>
<p>These preliminary results show that our approach is capable of achieving a significant recognition quality level (approximately 37 per cent for title and 53 per cent for authors) within even a limited document set (45,000), without use of human supervision or any external knowledge sources or training sets. Moreover, the recognition quality level for titles in our tests increased linearly with the size of the processed document set.</p>
</sec>
<sec>
<title>Conclusions and future work</title>
<p>In this paper, we have presented and discussed an information extraction pipeline that includes digital document acquisition, appropriate format transformation and quality information extraction and annotation. The proposed pipeline has been implemented in a working prototype of an A‐DL system – ScienceTreks – that:
<list list-type="bullet">
<list-item>
<label></label>
<p>Supports a broad range of methods for document acquisition, from local file system repository and generic internet crawling to focused internet crawling.</p>
</list-item>
<list-item>
<label></label>
<p>Does not rely on any external information sources but is solely based on the existing information in the document itself and in the overall set of documents currently present in a given digital archive.</p>
</list-item>
<list-item>
<label></label>
<p>Provides API to support easy integration of external systems and tools in the existing pipeline. It is thus open to extension and potential improvements in metadata extraction and processing by other methods and tools.</p>
</list-item>
<list-item>
<label></label>
<p>Is capable of achieving a significant recognition quality level (approximately 46 per cent for title and 56 per cent for authors), without use of human supervision or any external knowledge sources or training sets. Combined with existing external knowledge sources or other metadata extraction methods, the approach can further improve overall metadata quality and coverage.</p>
</list-item>
</list>
High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for value‐added services within existing and future semantic‐enabled digital library systems.</p>
</sec>
<sec>
<fig position="float" id="F_2640320403001">
<label>
<bold>Figure 1
<x> </x>
</bold>
</label>
<caption>
<p>Main modules and dataflow in the A‐DL system</p>
</caption>
<graphic xlink:href="2640320403001.tif"></graphic>
</fig>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="b1">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Brin</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Page</surname>
,
<given-names>L.</given-names>
</string-name>
</person-group>
(
<year>1998</year>
), “
<article-title>
<italic>The anatomy of a large‐scale hypertextual web search engine</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 7th World Wide Web Conference, Computer Networks and ISDN Systems</italic>
</source>
, Vol.
<volume>30</volume>
Nos
<issue>1/7</issue>
, pp.
<fpage>107</fpage>
<x></x>
<lpage>17</lpage>
.</mixed-citation>
</ref>
<ref id="b2">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Cho</surname>
,
<given-names>J.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Garcia‐Molina</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
(
<year>2002</year>
), “
<article-title>
<italic>Parallel crawlers</italic>
</article-title>
”,
<source>
<italic>Proceedings of the WWW2002, Honolulu, Hawaii, 7‐11 May</italic>
</source>
, available at: www2002.org/CDROM/refereed/108/.</mixed-citation>
</ref>
<ref id="b3">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Conrad</surname>
,
<given-names>J.G.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Schriber</surname>
,
<given-names>C.P.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Managing
<italic>déjà vu</italic>
: collection building for the identification of nonidentical duplicate documents</italic>
</article-title>
”,
<source>
<italic>Journal of the American Society for Information Science and Technology</italic>
</source>
, Vol.
<volume>57</volume>
No.
<issue>7</issue>
, pp.
<fpage>921</fpage>
<x></x>
<lpage>32</lpage>
.</mixed-citation>
</ref>
<ref id="b4">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Cordy</surname>
,
<given-names>J.</given-names>
</string-name>
</person-group>
(
<year>2004</year>
), “
<article-title>
<italic>Txl – a language for programming language tools and applications</italic>
</article-title>
”,
<source>
<italic>Proceedings of 4th International Workshop on Language Descriptions, Tools and Applications. Electronic Notes in Theoretical Computer Science</italic>
</source>
, Vol.
<volume>110</volume>
, pp.
<fpage>3</fpage>
<x></x>
<lpage>31</lpage>
.</mixed-citation>
</ref>
<ref id="b5">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Diligenti</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Coetzee</surname>
,
<given-names>F.M.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Lawrence</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Giles</surname>
,
<given-names>C.L.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Gori</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>2000</year>
), “
<article-title>
<italic>Focused crawling using context graphs</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 26th International Conference on Very Large Data Bases</italic>
</source>
,
<publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>
,
<publisher-loc>San Francisco, CA</publisher-loc>
, pp.
<fpage>527</fpage>
<x></x>
<lpage>34</lpage>
.</mixed-citation>
</ref>
<ref id="b6">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Giles</surname>
,
<given-names>C.L.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Bollacker</surname>
,
<given-names>K.D.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Lawrence</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
(
<year>1998</year>
), “
<article-title>
<italic>CiteSeer: an automatic citation indexing system</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 3rd ACM Conference on Digital Libraries</italic>
</source>
,
<publisher-name>ACM</publisher-name>
,
<publisher-loc>New York, NY</publisher-loc>
, pp.
<fpage>89</fpage>
<x></x>
<lpage>98</lpage>
.</mixed-citation>
</ref>
<ref id="b7">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Han</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Giles</surname>
,
<given-names>C.L.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Manavoglu</surname>
,
<given-names>E.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Zha</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Zhang</surname>
,
<given-names>Z.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Fox</surname>
,
<given-names>E.A.</given-names>
</string-name>
</person-group>
(
<year>2003</year>
), “
<article-title>
<italic>Automatic document metadata extraction using support vector machines</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 3rd ACM/IEEE‐CS joint Conference on Digital Libraries</italic>
</source>
,
<publisher-name>IEEE Computer Society</publisher-name>
,
<publisher-loc>Washington, DC</publisher-loc>
, pp.
<fpage>37</fpage>
<x></x>
<lpage>48</lpage>
.</mixed-citation>
</ref>
<ref id="b8">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Heath</surname>
,
<given-names>B.P.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>McArthur</surname>
,
<given-names>D.J.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>McClelland</surname>
,
<given-names>M.K.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Vetter</surname>
,
<given-names>R.J.</given-names>
</string-name>
</person-group>
(
<year>2005</year>
), “
<article-title>
<italic>Metadata lessons from the iLumina digital library</italic>
</article-title>
”,
<source>
<italic>Communications of the ACM</italic>
</source>
, Vol.
<volume>49</volume>
No.
<issue>7</issue>
, pp.
<fpage>68</fpage>
<x></x>
<lpage>74</lpage>
.</mixed-citation>
</ref>
<ref id="b9">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Ioannidis</surname>
,
<given-names>Y.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Maier</surname>
,
<given-names>D.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Abiteboul</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Buneman</surname>
,
<given-names>P.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Davidson</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Fox</surname>
,
<given-names>E.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Halevy</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Knoblock</surname>
,
<given-names>C.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Rabitti</surname>
,
<given-names>F.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Schek</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Weikum</surname>
,
<given-names>G.</given-names>
</string-name>
</person-group>
(
<year>2005</year>
), “
<article-title>
<italic>Digital library information‐technology infrastructures</italic>
</article-title>
”,
<source>
<italic>International Journal on Digital Libraries</italic>
</source>
, Vol.
<volume>5</volume>
No.
<issue>4</issue>
, pp.
<fpage>266</fpage>
<x></x>
<lpage>74</lpage>
.</mixed-citation>
</ref>
<ref id="b10">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Ivanyukovich</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Marchese</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>2006a</year>
), “
<article-title>
<italic>Unsupervised free‐text processing and structuring in digital archives</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 1st International Conference on Multidisciplinary Information Sciences and Technologies, InScit2006, Merida, Spain</italic>
</source>
,
<italic>25‐28 October</italic>
, available at:
<ext-link ext-link-type="uri" xlink:href="http://www.science.unitn.it/~marchese/pdf/inscit2006_full_paper.pdf">www.science.unitn.it/ ∼ marchese/pdf/inscit2006_full_paper.pdf</ext-link>
.</mixed-citation>
</ref>
<ref id="b11">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Ivanyukovich</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Marchese</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>2006b</year>
), “
<article-title>
<italic>Unsupervised metadata extraction in scientific digital libraries using a‐priori domain‐specific knowledge</italic>
</article-title>
”,
<source>
<italic>SWAP 2006 – Semantic Web Applications and Perspectives, Proceedings of the 3rd Italian Semantic Web Workshop, Pisa, Italy</italic>
</source>
,
<italic>18‐20 December</italic>
, CEUR Workshop Proceedings, 201, available at:
<ext-link ext-link-type="uri" xlink:href="http://ceur-ws.org/Vol-201/19.pdf">http://ceur‐ws.org/Vol‐201/19.pdf</ext-link>
.</mixed-citation>
</ref>
<ref id="b12">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Ivanyukovich</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Marchese</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Reuther</surname>
,
<given-names>P.</given-names>
</string-name>
</person-group>
(
<year>2007</year>
), “
<article-title>
<italic>Assessing quality dynamics in unsupervised metadata extraction for digital libraries</italic>
</article-title>
”,
<source>
<italic>Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science</italic>
</source>
, Vol.
<volume>4675</volume>
,
<publisher-name>Springer</publisher-name>
,
<publisher-loc>Berlin</publisher-loc>
, pp.
<fpage>454</fpage>
<x></x>
<lpage>7</lpage>
.</mixed-citation>
</ref>
<ref id="b13">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Kiyavitskaya</surname>
,
<given-names>N.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Zeni</surname>
,
<given-names>N.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Cordy</surname>
,
<given-names>J.R.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Mich</surname>
,
<given-names>L.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Mylopoulos</surname>
,
<given-names>J.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Semi‐automatic semantic annotations for next generation information systems</italic>
</article-title>
”,
<source>
<italic>Advanced Information Systems Engineering: 18th International Conference, CAiSE 2006, Luxembourg, Luxembourg, Proceedings, Lecture Notes in Computer Science</italic>
</source>
,
<publisher-name>Springer</publisher-name>
,
<publisher-loc>Berlin</publisher-loc>
, 5‐9 June.</mixed-citation>
</ref>
<ref id="b14">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Klink</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Reuther</surname>
,
<given-names>P.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Weber</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Walter</surname>
,
<given-names>B.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Ley</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Analysing social networks within bibliographical data</italic>
</article-title>
”,
<source>
<italic>Database and Expert Systems Applications, 17th International Conference, DEXA 2006, Kraków, Poland, Proceedings, Lecture Notes in Computer Science</italic>
</source>
, Vol.
<volume>4080</volume>
,
<publisher-name>Springer</publisher-name>
,
<publisher-loc>Berlin</publisher-loc>
, 4‐8 September, pp.
<fpage>234</fpage>
<x></x>
<lpage>43</lpage>
.</mixed-citation>
</ref>
<ref id="b15">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Kruk</surname>
,
<given-names>S.R.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Decker</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Zieborak</surname>
,
<given-names>L.</given-names>
</string-name>
</person-group>
(
<year>2005</year>
), “
<article-title>
<italic>JeromeDL – adding semantic web technologies to digital libraries</italic>
</article-title>
”,
<source>
<italic>Database and Expert Systems Applications, Lecture Notes in Computer Science</italic>
</source>
, Vol.
<volume>3588</volume>
, pp.
<fpage>716</fpage>
<x></x>
<lpage>25</lpage>
.</mixed-citation>
</ref>
<ref id="b16">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Lagoze</surname>
,
<given-names>C.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Krafft</surname>
,
<given-names>D.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Cornwell</surname>
,
<given-names>T.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Dushay</surname>
,
<given-names>N.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Eckstrom</surname>
,
<given-names>D.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Saylor</surname>
,
<given-names>J.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Metadata aggregation and ‘automated digital libraries’: a retrospective on the NSDL experience</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 6th ACM/IEEE‐CS Joint Conference on Digital Libraries</italic>
</source>
,
<publisher-name>ACM</publisher-name>
,
<publisher-loc>New York, NY</publisher-loc>
, pp.
<fpage>230</fpage>
<x></x>
<lpage>9</lpage>
.</mixed-citation>
</ref>
<ref id="b17">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Ley</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Reuther</surname>
,
<given-names>P.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Maintaining an online bibliographical database: the problem of data quality</italic>
</article-title>
”,
<source>
<italic>Extraction et Gestion des Connaissances (EGC'2006), Revue des Nouvelles Technologies de l'Information</italic>
</source>
, Vol.
<volume>RNTI‐E‐6</volume>
, pp.
<fpage>5</fpage>
<x></x>
<lpage>10</lpage>
.</mixed-citation>
</ref>
<ref id="b18">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Newman</surname>
,
<given-names>M.E.J.</given-names>
</string-name>
</person-group>
(
<year>2001</year>
), “
<article-title>
<italic>Scientific collaboration networks. I. Network construction and fundamental results</italic>
</article-title>
”,
<source>
<italic>Physical Review E</italic>
</source>
, Vol.
<volume>64</volume>
, 016131, available at: www‐personal.umich.edu/ ∼ mejn/papers/016131.pdf.</mixed-citation>
</ref>
<ref id="b19">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Noyons</surname>
,
<given-names>E.C.M.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Moed</surname>
,
<given-names>H.F.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Luwel</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>1999</year>
), “
<article-title>
<italic>Combining mapping and citation analysis for evaluative bibliometric purposes: a bibliometric study</italic>
</article-title>
”,
<source>
<italic>Journal of the American Society for Information Science</italic>
</source>
, Vol.
<volume>50</volume>
No.
<issue>2</issue>
, pp.
<fpage>115</fpage>
<x></x>
<lpage>31</lpage>
.</mixed-citation>
</ref>
<ref id="b20">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Peshkin</surname>
,
<given-names>L.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Pfeffer</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
(
<year>2003</year>
), “
<article-title>
<italic>Bayesian information extraction network</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico</italic>
</source>
, 9‐15 August, available at:
<ext-link ext-link-type="uri" xlink:href="http://www.eecs.harvard.edu/~pesha/Public/BIEN.pdf">www.eecs.harvard.edu/ ∼ pesha/Public/BIEN.pdf</ext-link>
.</mixed-citation>
</ref>
<ref id="b21">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Petinot</surname>
,
<given-names>Y.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Giles</surname>
,
<given-names>C.L.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Bhatnagar</surname>
,
<given-names>V.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Teregowda</surname>
,
<given-names>P.B.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Han</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
(
<year>2004a</year>
), “
<article-title>
<italic>Enabling interoperability for autonomous digital libraries: an API to CiteSeer services</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 4th ACM/IEEE‐CS Joint Conference on Digital Libraries</italic>
</source>
,
<publisher-name>ACM</publisher-name>
,
<publisher-loc>New York, NY</publisher-loc>
, pp.
<fpage>372</fpage>
<x></x>
<lpage>3</lpage>
.</mixed-citation>
</ref>
<ref id="b22">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Petinot</surname>
,
<given-names>Y.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Giles</surname>
,
<given-names>C.L.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Bhatnagar</surname>
,
<given-names>V.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Teregowda</surname>
,
<given-names>P.B.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Han</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Councill</surname>
,
<given-names>I.</given-names>
</string-name>
</person-group>
(
<year>2004b</year>
), “
<article-title>
<italic>CiteSeer‐API: towards seamless resource location and interlinking for digital libraries</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 13th ACM International Conference on Information and Knowledge Management</italic>
</source>
,
<publisher-name>ACM</publisher-name>
,
<publisher-loc>New York, NY</publisher-loc>
, pp.
<fpage>553</fpage>
<x></x>
<lpage>61</lpage>
.</mixed-citation>
</ref>
<ref id="b23">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Reuther</surname>
,
<given-names>P.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Walter</surname>
,
<given-names>B.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Survey on test collections and techniques for personal name matching</italic>
</article-title>
”,
<source>
<italic>International Journal of Metadata, Semantics and Ontologies</italic>
</source>
, Vol.
<volume>1</volume>
No.
<issue>2</issue>
, pp.
<fpage>89</fpage>
<x></x>
<lpage>99</lpage>
.</mixed-citation>
</ref>
<ref id="b24">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Salton</surname>
,
<given-names>G.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Singhal</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Mitra</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Buckley</surname>
,
<given-names>C.</given-names>
</string-name>
</person-group>
(
<year>1997</year>
), “
<article-title>
<italic>Automatic text structuring and summarization</italic>
</article-title>
”,
<source>
<italic>Information Processing & Management</italic>
</source>
, Vol.
<volume>33</volume>
No.
<issue>2</issue>
, pp.
<fpage>193</fpage>
<x></x>
<lpage>207</lpage>
.</mixed-citation>
</ref>
<ref id="b25">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Suleman</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Fox</surname>
,
<given-names>E.A.</given-names>
</string-name>
</person-group>
(
<year>2002</year>
), “
<article-title>
<italic>Designing protocols in support of digital library componentization</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science</italic>
</source>
, Vol.
<volume>2458</volume>
,
<publisher-name>Springer</publisher-name>
,
<publisher-loc>London</publisher-loc>
, pp.
<fpage>568</fpage>
<x></x>
<lpage>82</lpage>
.</mixed-citation>
</ref>
<ref id="b26">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Tryfonopoulos</surname>
,
<given-names>C.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Idreos</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Koubarakis</surname>
,
<given-names>M.</given-names>
</string-name>
</person-group>
(
<year>2005</year>
), “
<article-title>
<italic>LibraRing: an architecture for distributed digital libraries based on DHT</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), Vienna, Austria</italic>
</source>
, 18‐23 September, available at:
<ext-link ext-link-type="uri" xlink:href="http://www.mpi-inf.mpg.de/~trifon/papers/pdf/ecdl05-TIK.pdf">www.mpi‐inf.mpg.de/ ∼ trifon/papers/pdf/ecdl05‐TIK.pdf</ext-link>
.</mixed-citation>
</ref>
<ref id="b27">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>van Raan</surname>
,
<given-names>A.</given-names>
</string-name>
</person-group>
(
<year>1997</year>
), “
<article-title>
<italic>Scientometrics: state‐of‐the‐art</italic>
</article-title>
”,
<source>
<italic>Scientometrics</italic>
</source>
, Vol.
<volume>38</volume>
No.
<issue>1</issue>
, pp.
<fpage>205</fpage>
<x></x>
<lpage>18</lpage>
.</mixed-citation>
</ref>
<ref id="b28">
<mixed-citation>
<person-group person-group-type="author">
<string-name>
<surname>Yang</surname>
,
<given-names>H.</given-names>
</string-name>
</person-group>
,
<person-group person-group-type="author">
<string-name>
<surname>Callan</surname>
,
<given-names>J.</given-names>
</string-name>
</person-group>
and
<person-group person-group-type="author">
<string-name>
<surname>Shulman</surname>
,
<given-names>S.</given-names>
</string-name>
</person-group>
(
<year>2006</year>
), “
<article-title>
<italic>Next steps in near‐duplicate detection for eRulemaking</italic>
</article-title>
”,
<source>
<italic>Proceedings of the 2006 National Conference on Digital Government Research, ACM International Conference Proceeding Series</italic>
</source>
, Vol.
<volume>151</volume>
,
<publisher-name>ACM</publisher-name>
,
<publisher-loc>New York NY</publisher-loc>
, pp.
<fpage>239</fpage>
<x></x>
<lpage>48</lpage>
.</mixed-citation>
</ref>
</ref-list>
<app-group>
<app id="APP1">
<title>Corresponding author</title>
<p>Maurizio Marchese can be contacted at: marchese@dit.unitn.it</p>
</app>
</app-group>
</back>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo lang="en">
<title>ScienceTreks an autonomous digital library system</title>
</titleInfo>
<titleInfo type="alternative" lang="en" contentType="CDATA">
<title>ScienceTreks an autonomous digital library system</title>
</titleInfo>
<name type="personal">
<namePart type="given">A.R.D.</namePart>
<namePart type="family">Prasad</namePart>
</name>
<name type="personal">
<namePart type="given">A.R.D.</namePart>
<namePart type="family">Prasad</namePart>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Alexander</namePart>
<namePart type="family">Ivanyukovich</namePart>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Maurizio</namePart>
<namePart type="family">Marchese</namePart>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Fausto</namePart>
<namePart type="family">Giunchiglia</namePart>
<affiliation>Department of Information and Communication Technology, University of Trento, Trento, Italy</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article"></genre>
<originInfo>
<publisher>Emerald Group Publishing Limited</publisher>
<dateIssued encoding="w3cdtf">2008-08-08</dateIssued>
<copyrightDate encoding="w3cdtf">2008</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract>Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Designmethodologyapproach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library ADL system called ScienceTreks that supports a broad range of methods for document acquisition does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive and provides application programming interfaces API to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed ADL system can be used in automating endtoend information retrieval and processing, supporting the control and elimination of errorprone human intervention in the process. Originalityvalue High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for valueadded services within existing and future semanticenabled digital library systems.</abstract>
<subject>
<genre>keywords</genre>
<topic>Digital libraries</topic>
<topic>Information retrieval</topic>
<topic>Library systems</topic>
<topic>Automation</topic>
</subject>
<relatedItem type="host">
<titleInfo>
<title>Online Information Review</title>
</titleInfo>
<genre type="journal">journal</genre>
<subject>
<genre>Emerald Subject Group</genre>
<topic authority="SubjectCodesPrimary" authorityURI="cat-IKM">Information & knowledge management</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-ICT">Information & communications technology</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-INT">Internet</topic>
</subject>
<subject>
<genre>Emerald Subject Group</genre>
<topic authority="SubjectCodesPrimary" authorityURI="cat-LISC">Library & information science</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-CBM">Collection building & management</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-IBRT">Information behaviour & retrieval</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-RMP">Records management & preservation</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-BIB">Bibliometrics</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-DAT">Databases</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-DOCM">Document management</topic>
</subject>
<identifier type="ISSN">1468-4527</identifier>
<identifier type="PublisherID">oir</identifier>
<identifier type="DOI">10.1108/oir</identifier>
<part>
<date>2008</date>
<detail type="title">
<title>The Semantic Web and Web Design</title>
</detail>
<detail type="volume">
<caption>vol.</caption>
<number>32</number>
</detail>
<detail type="issue">
<caption>no.</caption>
<number>4</number>
</detail>
<extent unit="pages">
<start>488</start>
<end>499</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">8DB9DDC074E1353C99D633F07A936EE6057B2B99</identifier>
<identifier type="DOI">10.1108/14684520810897368</identifier>
<identifier type="filenameID">2640320403</identifier>
<identifier type="original-pdf">2640320403.pdf</identifier>
<identifier type="href">14684520810897368.pdf</identifier>
<accessCondition type="use and reproduction" contentType="copyright">© Emerald Group Publishing Limited</accessCondition>
<recordInfo>
<recordContentSource>EMERALD</recordContentSource>
</recordInfo>
</mods>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Rhénanie/explor/UnivTrevesV1/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001B89 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 001B89 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Rhénanie
   |area=    UnivTrevesV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:8DB9DDC074E1353C99D633F07A936EE6057B2B99
   |texte=   ScienceTreks an autonomous digital library system
}}

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Sat Jul 22 16:29:01 2017. Site generation: Wed Feb 28 14:55:37 2024