Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine

Identifieur interne : 00AC21 ( Main/Merge ); précédent : 00AC20; suivant : 00AC22

Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine

Auteurs : Imed Zitouni ; Jean-François Mari ; Kamel Smaïli ; Jean-Paul Haton [France]

Source :

RBID : CRIN:zitouni99a

English descriptors

Abstract

In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the ?Le Monde? newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models.

Links toward previous steps (curation, corpus...)


Links to Exploration step

CRIN:zitouni99a

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" wicri:score="453">Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine</title>
</titleStmt>
<publicationStmt>
<idno type="RBID">CRIN:zitouni99a</idno>
<date when="1999" year="1999">1999</date>
<idno type="wicri:Area/Crin/Corpus">002570</idno>
<idno type="wicri:Area/Crin/Curation">002570</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Curation">002570</idno>
<idno type="wicri:Area/Crin/Checkpoint">001E15</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Checkpoint">001E15</idno>
<idno type="wicri:Area/Main/Merge">00AC21</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine</title>
<author>
<name sortKey="Zitouni, Imed" sort="Zitouni, Imed" uniqKey="Zitouni I" first="Imed" last="Zitouni">Imed Zitouni</name>
</author>
<author>
<name sortKey="Mari, Jean Francois" sort="Mari, Jean Francois" uniqKey="Mari J" first="Jean-François" last="Mari">Jean-François Mari</name>
</author>
<author>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaïli">Kamel Smaïli</name>
</author>
<author>
<name sortKey="Haton, Jean Paul" sort="Haton, Jean Paul" uniqKey="Haton J" first="Jean-Paul" last="Haton">Jean-Paul Haton</name>
<affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>language model</term>
<term>n-gram</term>
<term>sequence</term>
<term>sequence model</term>
<term>speech</term>
<term>speech recognition</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en" wicri:score="2501">In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the ?Le Monde? newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 00AC21 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 00AC21 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     CRIN:zitouni99a
   |texte=   Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022