Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic discovery of topics and acoustic morphemes from speech

Identifieur interne : 000252 ( PascalFrancis/Checkpoint ); précédent : 000251; suivant : 000253

Automatic discovery of topics and acoustic morphemes from speech

Auteurs : Christophe Cerisara [France]

Source :

RBID : Francis:09-0158748

Descripteurs français

English descriptors

Abstract

This work deals with automatic lexical acquisition and topic discovery from a speech stream. The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach. The resulting semantic lexicon is then used to automatically cluster the incoming speech stream into topics. The main advantages of this algorithm are its very low computational requirements and its independence to pre-defined linguistic resources, which makes it easy to port to new languages and to adapt to new tasks. It is evaluated both qualitatively and quantitatively on two corpora and on two tasks related to topic clustering. The results of these evaluations are encouraging and outline future directions of research for the proposed algorithm, such as building automatic orthographic labels of the lexical items.


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Francis:09-0158748

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Automatic discovery of topics and acoustic morphemes from speech</title>
<author>
<name sortKey="Cerisara, Christophe" sort="Cerisara, Christophe" uniqKey="Cerisara C" first="Christophe" last="Cerisara">Christophe Cerisara</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>LORIA UMR 7503, Campus Scientifique</s1>
<s2>54506 Vandoeuvre-les-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">09-0158748</idno>
<date when="2009">2009</date>
<idno type="stanalyst">FRANCIS 09-0158748 INIST</idno>
<idno type="RBID">Francis:09-0158748</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000292</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000748</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000252</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000252</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Automatic discovery of topics and acoustic morphemes from speech</title>
<author>
<name sortKey="Cerisara, Christophe" sort="Cerisara, Christophe" uniqKey="Cerisara C" first="Christophe" last="Cerisara">Christophe Cerisara</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>LORIA UMR 7503, Campus Scientifique</s1>
<s2>54506 Vandoeuvre-les-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Computer speech & language : (Print)</title>
<title level="j" type="abbreviated">Comput. speech lang. : (Print)</title>
<idno type="ISSN">0885-2308</idno>
<imprint>
<date when="2009">2009</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Computer speech & language : (Print)</title>
<title level="j" type="abbreviated">Comput. speech lang. : (Print)</title>
<idno type="ISSN">0885-2308</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Knowledge acquisition</term>
<term>Speech processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Traitement automatique de la parole</term>
<term>Acquisition de connaissances</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This work deals with automatic lexical acquisition and topic discovery from a speech stream. The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach. The resulting semantic lexicon is then used to automatically cluster the incoming speech stream into topics. The main advantages of this algorithm are its very low computational requirements and its independence to pre-defined linguistic resources, which makes it easy to port to new languages and to adapt to new tasks. It is evaluated both qualitatively and quantitatively on two corpora and on two tasks related to topic clustering. The results of these evaluations are encouraging and outline future directions of research for the proposed algorithm, such as building automatic orthographic labels of the lexical items.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0885-2308</s0>
</fA01>
<fA03 i2="1">
<s0>Comput. speech lang. : (Print)</s0>
</fA03>
<fA05>
<s2>23</s2>
</fA05>
<fA06>
<s2>2</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG">
<s1>Automatic discovery of topics and acoustic morphemes from speech</s1>
</fA08>
<fA11 i1="01" i2="1">
<s1>CERISARA (Christophe)</s1>
</fA11>
<fA14 i1="01">
<s1>LORIA UMR 7503, Campus Scientifique</s1>
<s2>54506 Vandoeuvre-les-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA20>
<s1>220-239</s1>
</fA20>
<fA21>
<s1>2009</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21332</s2>
<s5>354000187365130050</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2009 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>1 p.1/4</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>09-0158748</s0>
</fA47>
<fA60>
<s1>P</s1>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Computer speech & language : (Print)</s0>
</fA64>
<fA66 i1="01">
<s0>GBR</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>This work deals with automatic lexical acquisition and topic discovery from a speech stream. The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach. The resulting semantic lexicon is then used to automatically cluster the incoming speech stream into topics. The main advantages of this algorithm are its very low computational requirements and its independence to pre-defined linguistic resources, which makes it easy to port to new languages and to adapt to new tasks. It is evaluated both qualitatively and quantitatively on two corpora and on two tasks related to topic clustering. The results of these evaluations are encouraging and outline future directions of research for the proposed algorithm, such as building automatic orthographic labels of the lexical items.</s0>
</fC01>
<fC02 i1="01" i2="L">
<s0>52478</s0>
<s1>XV</s1>
</fC02>
<fC02 i1="02" i2="L">
<s0>524</s0>
</fC02>
<fC03 i1="01" i2="L" l="FRE">
<s0>Traitement automatique de la parole</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="L" l="ENG">
<s0>Speech processing</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="L" l="FRE">
<s0>Acquisition de connaissances</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="02" i2="L" l="ENG">
<s0>Knowledge acquisition</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC07 i1="01" i2="L" l="FRE">
<s0>Linguistique informatique</s0>
<s2>NI</s2>
<s5>02</s5>
</fC07>
<fC07 i1="01" i2="L" l="ENG">
<s0>Computational linguistics</s0>
<s2>NI</s2>
<s5>02</s5>
</fC07>
<fN21>
<s1>117</s1>
</fN21>
</pA>
</standard>
</inist>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement>
<li>Vandœuvre-lès-Nancy</li>
</settlement>
</list>
<tree>
<country name="France">
<region name="Grand Est">
<name sortKey="Cerisara, Christophe" sort="Cerisara, Christophe" uniqKey="Cerisara C" first="Christophe" last="Cerisara">Christophe Cerisara</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/PascalFrancis/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000252 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Checkpoint/biblio.hfd -nk 000252 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    PascalFrancis
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Francis:09-0158748
   |texte=   Automatic discovery of topics and acoustic morphemes from speech
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022