Serveur d'exploration Hippolyte Bernheim

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Supervised Data Extraction

Identifieur interne : 000470 ( Main/Exploration ); précédent : 000469; suivant : 000471

Supervised Data Extraction

Auteurs : N. Georgiev [France] ; J. M. Labat [France] ; Jean-Luc Minel [France] ; L. Nicolas [France]

Source :

RBID : Hal:halshs-00121758

Descripteurs français

English descriptors

Abstract

The process of data extraction from internet sources have been
originating the interest of the scientific society for the past years. However there
are still no well established standards because of the heterogeneous nature of
the information in the Global Network. Nevertheless there is still something in
common – all the data is available in HTML format for compatibility reasons.
This article presents our methodology and the prototype system we've created
to extract data from HTML pages. We use XPath as data extraction language
and have developed a methodology for visual wrapper generation. Our
approach takes advantage of the implicit correlation between the data and the
surrounding structure. Some evaluation tests are given also in order justify our
methods.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Supervised Data Extraction</title>
<author>
<name sortKey="Georgiev, N" sort="Georgiev, N" uniqKey="Georgiev N" first="N." last="Georgiev">N. Georgiev</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Labat, J M" sort="Labat, J M" uniqKey="Labat J" first="J. M." last="Labat">J. M. Labat</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Minel, Jean Luc" sort="Minel, Jean Luc" uniqKey="Minel J" first="Jean-Luc" last="Minel">Jean-Luc Minel</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-1057" status="VALID">
<idno type="RNSR">200112501N</idno>
<orgName>Modèles, Dynamiques, Corpus</orgName>
<orgName type="acronym">MoDyCo</orgName>
<date type="start">2001</date>
<desc>
<address>
<addrLine>Université Paris 10 Bâtiment A - Bureau 402 A 200, avenue de la République 92001 Nanterre Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.modyco.fr</ref>
</desc>
<listRelation>
<relation name="UMR7114" active="#struct-116205" type="direct"></relation>
<relation name="UMR7114" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="UMR7114" active="#struct-116205" type="direct">
<org type="institution" xml:id="struct-116205" status="VALID">
<idno type="IdRef">026403587</idno>
<orgName>Université Paris Nanterre</orgName>
<orgName type="acronym">UPN</orgName>
<date type="start">1970</date>
<desc>
<address>
<addrLine>200 avenue de la République92001 Nanterre cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisnanterre.fr</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR7114" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Nicolas, L" sort="Nicolas, L" uniqKey="Nicolas L" first="L." last="Nicolas">L. Nicolas</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:halshs-00121758</idno>
<idno type="halId">halshs-00121758</idno>
<idno type="halUri">https://halshs.archives-ouvertes.fr/halshs-00121758</idno>
<idno type="url">https://halshs.archives-ouvertes.fr/halshs-00121758</idno>
<date when="2005">2005</date>
<idno type="wicri:Area/Hal/Corpus">000140</idno>
<idno type="wicri:Area/Hal/Curation">000140</idno>
<idno type="wicri:Area/Hal/Checkpoint">000211</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">000211</idno>
<idno type="wicri:Area/Main/Merge">000474</idno>
<idno type="wicri:Area/Main/Curation">000470</idno>
<idno type="wicri:Area/Main/Exploration">000470</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Supervised Data Extraction</title>
<author>
<name sortKey="Georgiev, N" sort="Georgiev, N" uniqKey="Georgiev N" first="N." last="Georgiev">N. Georgiev</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Labat, J M" sort="Labat, J M" uniqKey="Labat J" first="J. M." last="Labat">J. M. Labat</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Minel, Jean Luc" sort="Minel, Jean Luc" uniqKey="Minel J" first="Jean-Luc" last="Minel">Jean-Luc Minel</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-1057" status="VALID">
<idno type="RNSR">200112501N</idno>
<orgName>Modèles, Dynamiques, Corpus</orgName>
<orgName type="acronym">MoDyCo</orgName>
<date type="start">2001</date>
<desc>
<address>
<addrLine>Université Paris 10 Bâtiment A - Bureau 402 A 200, avenue de la République 92001 Nanterre Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.modyco.fr</ref>
</desc>
<listRelation>
<relation name="UMR7114" active="#struct-116205" type="direct"></relation>
<relation name="UMR7114" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="UMR7114" active="#struct-116205" type="direct">
<org type="institution" xml:id="struct-116205" status="VALID">
<idno type="IdRef">026403587</idno>
<orgName>Université Paris Nanterre</orgName>
<orgName type="acronym">UPN</orgName>
<date type="start">1970</date>
<desc>
<address>
<addrLine>200 avenue de la République92001 Nanterre cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisnanterre.fr</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR7114" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Nicolas, L" sort="Nicolas, L" uniqKey="Nicolas L" first="L." last="Nicolas">L. Nicolas</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-16654" status="OLD">
<idno type="RNSR">199814046F</idno>
<orgName>Centre de Recherche en Informatique de Paris 5</orgName>
<orgName type="acronym">CRIP5 - EA 2517</orgName>
<date type="start">2002-01-01</date>
<date type="end">2009-12-31</date>
<desc>
<address>
<addrLine>45, rue des Saints Pères 75270 Paris Cedex 06</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation name="EA 2517" active="#struct-301664" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA 2517" active="#struct-301664" type="direct">
<org type="institution" xml:id="struct-301664" status="VALID">
<idno type="IdRef">026404788</idno>
<idno type="ISNI">0000 0001 2188 0914 </idno>
<orgName>Université Paris Descartes - Paris 5</orgName>
<orgName type="acronym">UPD5</orgName>
<desc>
<address>
<addrLine>12, rue de l'École de Médecine - 75270 Paris cedex 06</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.parisdescartes.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Data Mining</term>
<term>Information Extraction</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>Extraction d'information</term>
<term>Patron d'extraction</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The process of data extraction from internet sources have been
originating the interest of the scientific society for the past years. However there
are still no well established standards because of the heterogeneous nature of
the information in the Global Network. Nevertheless there is still something in
common – all the data is available in HTML format for compatibility reasons.
This article presents our methodology and the prototype system we've created
to extract data from HTML pages. We use XPath as data extraction language
and have developed a methodology for visual wrapper generation. Our
approach takes advantage of the implicit correlation between the data and the
surrounding structure. Some evaluation tests are given also in order justify our
methods.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Georgiev, N" sort="Georgiev, N" uniqKey="Georgiev N" first="N." last="Georgiev">N. Georgiev</name>
</noRegion>
<name sortKey="Labat, J M" sort="Labat, J M" uniqKey="Labat J" first="J. M." last="Labat">J. M. Labat</name>
<name sortKey="Minel, Jean Luc" sort="Minel, Jean Luc" uniqKey="Minel J" first="Jean-Luc" last="Minel">Jean-Luc Minel</name>
<name sortKey="Nicolas, L" sort="Nicolas, L" uniqKey="Nicolas L" first="L." last="Nicolas">L. Nicolas</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Psychologie/explor/BernheimV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000470 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000470 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Psychologie
   |area=    BernheimV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:halshs-00121758
   |texte=   Supervised Data Extraction
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Mar 5 17:33:33 2018. Site generation: Thu Apr 29 15:49:51 2021