Learning a concept‐based document similarity measure
Identifieur interne : 000716 ( Istex/Curation ); précédent : 000715; suivant : 000717Learning a concept‐based document similarity measure
Auteurs : Lan Huang [Nouvelle-Zélande] ; David Milne [Nouvelle-Zélande] ; Eibe Frank [Nouvelle-Zélande] ; Ian H. Witten [Nouvelle-Zélande]Source :
- Journal of the American Society for Information Science and Technology [ 1532-2882 ] ; 2012-08.
Abstract
Document similarity measures are crucial components of many text‐analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine‐learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.
Url:
DOI: 10.1002/asi.22689
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000752
Links to Exploration step
ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Learning a concept‐based document similarity measure</title>
<author><name sortKey="Huang, Lan" sort="Huang, Lan" uniqKey="Huang L" first="Lan" last="Huang">Lan Huang</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: lh92@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Milne, David" sort="Milne, David" uniqKey="Milne D" first="David" last="Milne">David Milne</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: dnk2@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Frank, Eibe" sort="Frank, Eibe" uniqKey="Frank E" first="Eibe" last="Frank">Eibe Frank</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: eibe@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Witten, Ian H" sort="Witten, Ian H" uniqKey="Witten I" first="Ian H." last="Witten">Ian H. Witten</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: ihw@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4</idno>
<date when="2012" year="2012">2012</date>
<idno type="doi">10.1002/asi.22689</idno>
<idno type="url">https://api.istex.fr/document/64E99E686DBD7D98165109C6DB70FBB2082596C4/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000752</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000752</idno>
<idno type="wicri:Area/Istex/Curation">000716</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Learning a concept‐based document similarity measure</title>
<author><name sortKey="Huang, Lan" sort="Huang, Lan" uniqKey="Huang L" first="Lan" last="Huang">Lan Huang</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: lh92@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Milne, David" sort="Milne, David" uniqKey="Milne D" first="David" last="Milne">David Milne</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: dnk2@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Frank, Eibe" sort="Frank, Eibe" uniqKey="Frank E" first="Eibe" last="Frank">Eibe Frank</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: eibe@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
<author><name sortKey="Witten, Ian H" sort="Witten, Ian H" uniqKey="Witten I" first="Ian H." last="Witten">Ian H. Witten</name>
<affiliation wicri:level="1"><mods:affiliation>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand</mods:affiliation>
<country xml:lang="fr">Nouvelle-Zélande</country>
<wicri:regionArea>Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: ihw@cs.waikato.ac.nz</mods:affiliation>
<country wicri:rule="url">Nouvelle-Zélande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of the American Society for Information Science and Technology</title>
<title level="j" type="abbrev">J Am Soc Inf Sci Tec</title>
<idno type="ISSN">1532-2882</idno>
<idno type="eISSN">1532-2890</idno>
<imprint><publisher>Blackwell Publishing Ltd</publisher>
<date type="published" when="2012-08">2012-08</date>
<biblScope unit="volume">63</biblScope>
<biblScope unit="issue">8</biblScope>
<biblScope unit="page" from="1593">1593</biblScope>
<biblScope unit="page" to="1608">1608</biblScope>
</imprint>
<idno type="ISSN">1532-2882</idno>
</series>
<idno type="istex">64E99E686DBD7D98165109C6DB70FBB2082596C4</idno>
<idno type="DOI">10.1002/asi.22689</idno>
<idno type="ArticleID">ASI22689</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">1532-2882</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract">Document similarity measures are crucial components of many text‐analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine‐learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Agronomie/explor/SisAgriV1/Data/Istex/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000716 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Istex/Curation/biblio.hfd -nk 000716 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Agronomie |area= SisAgriV1 |flux= Istex |étape= Curation |type= RBID |clé= ISTEX:64E99E686DBD7D98165109C6DB70FBB2082596C4 |texte= Learning a concept‐based document similarity measure }}
This area was generated with Dilib version V0.6.28. |