Système d'information stratégique et agriculture (serveur d'exploration)

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Learning a concept‐based document similarity measure

Identifieur interne : 000716 ( Istex/Curation ); précédent : 000715; suivant : 000717

Learning a concept‐based document similarity measure

Auteurs : Lan Huang [Nouvelle-Zélande] ; David Milne [Nouvelle-Zélande] ; Eibe Frank [Nouvelle-Zélande] ; Ian H. Witten [Nouvelle-Zélande]

Source :

RBID : ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4

Abstract

Document similarity measures are crucial components of many text‐analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine‐learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.

Url:
DOI: 10.1002/asi.22689

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Learning a concept‐based document similarity measure</title>
<author>
<name sortKey="Huang, Lan" sort="Huang, Lan" uniqKey="Huang L" first="Lan" last="Huang">Lan Huang</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: lh92@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Milne, David" sort="Milne, David" uniqKey="Milne D" first="David" last="Milne">David Milne</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: dnk2@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Frank, Eibe" sort="Frank, Eibe" uniqKey="Frank E" first="Eibe" last="Frank">Eibe Frank</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: eibe@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Witten, Ian H" sort="Witten, Ian H" uniqKey="Witten I" first="Ian H." last="Witten">Ian H. Witten</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: ihw@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4</idno>
<date when="2012" year="2012">2012</date>
<idno type="doi">10.1002/asi.22689</idno>
<idno type="url">https://api.istex.fr/document/64E99E686DBD7D98165109C6DB70FBB2082596C4/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000752</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000752</idno>
<idno type="wicri:Area/Istex/Curation">000716</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Learning a concept‐based document similarity measure</title>
<author>
<name sortKey="Huang, Lan" sort="Huang, Lan" uniqKey="Huang L" first="Lan" last="Huang">Lan Huang</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: lh92@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Milne, David" sort="Milne, David" uniqKey="Milne D" first="David" last="Milne">David Milne</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: dnk2@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Frank, Eibe" sort="Frank, Eibe" uniqKey="Frank E" first="Eibe" last="Frank">Eibe Frank</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: eibe@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author>
<name sortKey="Witten, Ian H" sort="Witten, Ian H" uniqKey="Witten I" first="Ian H." last="Witten">Ian H. Witten</name>
<affiliation wicri:level="1">
<mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<mods:affiliation>E-mail: ihw@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Journal of the American Society for Information Science and Technology</title>
<title level="j" type="abbrev">J Am Soc Inf Sci Tec</title>
<idno type="ISSN">1532-2882</idno>
<idno type="eISSN">1532-2890</idno>
<imprint>
<publisher>Blackwell Publishing Ltd</publisher>
<date type="published" when="2012-08">2012-08</date>
<biblScope unit="volume">63</biblScope>
<biblScope unit="issue">8</biblScope>
<biblScope unit="page" from="1593">1593</biblScope>
<biblScope unit="page" to="1608">1608</biblScope>
</imprint>
<idno type="ISSN">1532-2882</idno>
</series>
<idno type="istex">64E99E686DBD7D98165109C6DB70FBB2082596C4</idno>
<idno type="DOI">10.1002/asi.22689</idno>
<idno type="ArticleID">ASI22689</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">1532-2882</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract">Document similarity measures are crucial components of many text‐analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine‐learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Agronomie/explor/SisAgriV1/Data/Istex/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000716 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Curation/biblio.hfd -nk 000716 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Agronomie
   |area=    SisAgriV1
   |flux=    Istex
   |étape=   Curation
   |type=    RBID
   |clé=     ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4
   |texte=   Learning a concept‐based document similarity measure
}}

Wicri

This area was generated with Dilib version V0.6.28.
Data generation: Wed Mar 29 00:06:34 2017. Site generation: Tue Mar 12 12:44:16 2024