N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
Identifieur interne : 000D71 ( Main/Merge ); précédent : 000D70; suivant : 000D72N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
Auteurs : Artur Šili [Croatie] ; Jean-Hugues Chauchat [France] ; Bojana Dalbelo Baši [Croatie] ; Annie Morin [France]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2007.
Abstract
Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.
Url:
DOI: 10.1007/978-3-540-77002-2_56
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001D02
- to stream Istex, to step Curation: 001B82
- to stream Istex, to step Checkpoint: 000773
Links to Exploration step
ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEBLe document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author><name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</author>
<author><name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
</author>
<author><name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
</author>
<author><name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-77002-2_56</idno>
<idno type="url">https://api.istex.fr/document/E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001D02</idno>
<idno type="wicri:Area/Istex/Curation">001B82</idno>
<idno type="wicri:Area/Istex/Checkpoint">000773</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Sili A:n:grams:and</idno>
<idno type="wicri:Area/Main/Merge">000D71</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author><name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
<affiliation wicri:level="1"><country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author><name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, 5 avenue Pierre Mendès France, 69676 Bron Cedex</wicri:regionArea>
<placeName><region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Bron</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<affiliation wicri:level="1"><country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author><name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
<affiliation wicri:level="4"><country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rennes 1, IRISA, 35042 Rennes Cedex</wicri:regionArea>
<placeName><region type="region" nuts="2">Région Bretagne</region>
<settlement type="city">Rennes</settlement>
</placeName>
<orgName type="university">Université de Rennes 1</orgName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<idno type="DOI">10.1007/978-3-540-77002-2_56</idno>
<idno type="ChapterID">56</idno>
<idno type="ChapterID">Chap56</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D71 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000D71 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB |texte= N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus }}
This area was generated with Dilib version V0.6.32. |