Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A segmentation method for bibliographic references by contextual tagging of fields

Identifieur interne : 007991 ( Main/Merge ); précédent : 007990; suivant : 007992

A segmentation method for bibliographic references by contextual tagging of fields

Auteurs : Dominique Besagni ; Abdel Belaïd [France] ; Nelly Benet

Source :

RBID : CRIN:besagni03a

English descriptors

Abstract

In this paper, a method based on part of speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom up way, without an a priori model, gathering structural elements from basic tags to sub-fields and fields. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to record fields : ``authors'', title, «conference name», «date», etc. Non labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75,9% references are completely segmented from 2500 references.

Links toward previous steps (curation, corpus...)


Links to Exploration step

CRIN:besagni03a

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" wicri:score="406">A segmentation method for bibliographic references by contextual tagging of fields</title>
</titleStmt>
<publicationStmt>
<idno type="RBID">CRIN:besagni03a</idno>
<date when="2003" year="2003">2003</date>
<idno type="wicri:Area/Crin/Corpus">003934</idno>
<idno type="wicri:Area/Crin/Curation">003934</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Curation">003934</idno>
<idno type="wicri:Area/Crin/Checkpoint">000B44</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Checkpoint">000B44</idno>
<idno type="wicri:Area/Main/Merge">007991</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A segmentation method for bibliographic references by contextual tagging of fields</title>
<author>
<name sortKey="Besagni, Dominique" sort="Besagni, Dominique" uniqKey="Besagni D" first="Dominique" last="Besagni">Dominique Besagni</name>
</author>
<author>
<name sortKey="Belaid, Abdel" sort="Belaid, Abdel" uniqKey="Belaid A" first="Abdel" last="Belaid">Abdel Belaïd</name>
<affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
<author>
<name sortKey="Benet, Nelly" sort="Benet, Nelly" uniqKey="Benet N" first="Nelly" last="Benet">Nelly Benet</name>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>citation</term>
<term>part of speech tagging</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en" wicri:score="1934">In this paper, a method based on part of speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom up way, without an a priori model, gathering structural elements from basic tags to sub-fields and fields. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to record fields : ``authors'', title, «conference name», «date», etc. Non labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75,9% references are completely segmented from 2500 references.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 007991 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 007991 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     CRIN:besagni03a
   |texte=   A segmentation method for bibliographic references by contextual tagging of fields
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022