You Don’t Have to Think Twice if You Carefully Tokenize
Identifieur interne : 000363 ( Istex/Curation ); précédent : 000362; suivant : 000364You Don’t Have to Think Twice if You Carefully Tokenize
Auteurs : Stefan Klatt [Allemagne] ; Bernd Bohnet [Allemagne]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2005.
Abstract
Abstract: Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can. We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and – as a consequence – a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.
Url:
DOI: 10.1007/978-3-540-30211-7_32
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000363
Links to Exploration step
ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">You Don’t Have to Think Twice if You Carefully Tokenize</title>
<author><name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
<affiliation wicri:level="3"><mods:affiliation>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart</mods:affiliation>
<country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: klatt@iis.uni-stuttgart.de</mods:affiliation>
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author><name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
<affiliation wicri:level="3"><mods:affiliation>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart</mods:affiliation>
<country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: bohnet@iis.uni-stuttgart.de</mods:affiliation>
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475</idno>
<date when="2005" year="2005">2005</date>
<idno type="doi">10.1007/978-3-540-30211-7_32</idno>
<idno type="url">https://api.istex.fr/document/6800DF6D171E421B4E2D105EF08CFA86A02E8475/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000363</idno>
<idno type="wicri:Area/Istex/Curation">000363</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">You Don’t Have to Think Twice if You Carefully Tokenize</title>
<author><name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
<affiliation wicri:level="3"><mods:affiliation>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart</mods:affiliation>
<country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: klatt@iis.uni-stuttgart.de</mods:affiliation>
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author><name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
<affiliation wicri:level="3"><mods:affiliation>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart</mods:affiliation>
<country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: bohnet@iis.uni-stuttgart.de</mods:affiliation>
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2005</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">6800DF6D171E421B4E2D105EF08CFA86A02E8475</idno>
<idno type="DOI">10.1007/978-3-540-30211-7_32</idno>
<idno type="ChapterID">32</idno>
<idno type="ChapterID">Chap32</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can. We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and – as a consequence – a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Istex/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000363 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Istex/Curation/biblio.hfd -nk 000363 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Ticri |area= TeiVM2 |flux= Istex |étape= Curation |type= RBID |clé= ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475 |texte= You Don’t Have to Think Twice if You Carefully Tokenize }}
This area was generated with Dilib version V0.6.31. |