Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context
Identifieur interne : 000685 ( PascalFrancis/Corpus ); précédent : 000684; suivant : 000686Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context
Auteurs : A. Kacem ; A. Belaïd ; M. Ben AhmedSource :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2001.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.
Notice en format standard (ISO 2709)
Pour connaître la documentation sur le format Inist Standard.
pA |
|
---|
Format Inist (serveur)
NO : | PASCAL 02-0104694 INIST |
---|---|
ET : | Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context |
AU : | KACEM (A.); BELAÏD (A.); BEN AHMED (M.) |
AF : | ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali/2040 Radès/Tunisie (1 aut.); LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239/54506 Vandoeuvre Nancy/France (2 aut.); RIADI-Tunisia/Tunisie (3 aut.) |
DT : | Publication en série; Niveau analytique |
SO : | International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 2; Pp. 97-108; Bibl. 20 ref. |
LA : | Anglais |
EA : | This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%. |
CC : | 001D02C03 |
FD : | Méthode heuristique; Segmentation; Caractère imprimé; Reconnaissance caractère; Logique floue; Logique mathématique; Formule mathématique |
ED : | Heuristic method; Segmentation; Printed character; Character recognition; Fuzzy logic; Mathematical logic; Mathematical formula |
SD : | Método heurístico; Segmentación; Carácter impreso; Reconocimiento carácter; Lógica difusa; Lógica matemática; Fórmula matemática |
LO : | INIST-26790.354000099533410030 |
ID : | 02-0104694 |
Links to Exploration step
Pascal:02-0104694Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</title>
<author><name sortKey="Kacem, A" sort="Kacem, A" uniqKey="Kacem A" first="A." last="Kacem">A. Kacem</name>
<affiliation><inist:fA14 i1="01"><s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="02"><s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Ben Ahmed, M" sort="Ben Ahmed, M" uniqKey="Ben Ahmed M" first="M." last="Ben Ahmed">M. Ben Ahmed</name>
<affiliation><inist:fA14 i1="03"><s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">02-0104694</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0104694 INIST</idno>
<idno type="RBID">Pascal:02-0104694</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000685</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</title>
<author><name sortKey="Kacem, A" sort="Kacem, A" uniqKey="Kacem A" first="A." last="Kacem">A. Kacem</name>
<affiliation><inist:fA14 i1="01"><s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="02"><s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Ben Ahmed, M" sort="Ben Ahmed, M" uniqKey="Ben Ahmed M" first="M." last="Ben Ahmed">M. Ben Ahmed</name>
<affiliation><inist:fA14 i1="03"><s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Fuzzy logic</term>
<term>Heuristic method</term>
<term>Mathematical formula</term>
<term>Mathematical logic</term>
<term>Printed character</term>
<term>Segmentation</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Méthode heuristique</term>
<term>Segmentation</term>
<term>Caractère imprimé</term>
<term>Reconnaissance caractère</term>
<term>Logique floue</term>
<term>Logique mathématique</term>
<term>Formule mathématique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>1433-2833</s0>
</fA01>
<fA03 i2="1"><s0>Int. j. doc. anal. recognit. : (Print)</s0>
</fA03>
<fA05><s2>4</s2>
</fA05>
<fA06><s2>2</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG"><s1>Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</s1>
</fA08>
<fA11 i1="01" i2="1"><s1>KACEM (A.)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>BELAÏD (A.)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>BEN AHMED (M.)</s1>
</fA11>
<fA14 i1="01"><s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02"><s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</fA14>
<fA14 i1="03"><s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</fA14>
<fA20><s1>97-108</s1>
</fA20>
<fA21><s1>2001</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA43 i1="01"><s1>INIST</s1>
<s2>26790</s2>
<s5>354000099533410030</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2002 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>20 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>02-0104694</s0>
</fA47>
<fA60><s1>P</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i1="01" i2="1"><s0>International journal on document analysis and recognition : (Print)</s0>
</fA64>
<fA66 i1="01"><s0>DEU</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Méthode heuristique</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Heuristic method</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Método heurístico</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Segmentation</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Segmentation</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Segmentación</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Caractère imprimé</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Printed character</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Carácter impreso</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Logique floue</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Fuzzy logic</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Lógica difusa</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Logique mathématique</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Mathematical logic</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Lógica matemática</s0>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Formule mathématique</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Mathematical formula</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Fórmula matemática</s0>
<s5>07</s5>
</fC03>
<fN21><s1>056</s1>
</fN21>
<fN82><s1>PSI</s1>
</fN82>
</pA>
</standard>
<server><NO>PASCAL 02-0104694 INIST</NO>
<ET>Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</ET>
<AU>KACEM (A.); BELAÏD (A.); BEN AHMED (M.)</AU>
<AF>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali/2040 Radès/Tunisie (1 aut.); LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239/54506 Vandoeuvre Nancy/France (2 aut.); RIADI-Tunisia/Tunisie (3 aut.)</AF>
<DT>Publication en série; Niveau analytique</DT>
<SO>International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 2; Pp. 97-108; Bibl. 20 ref.</SO>
<LA>Anglais</LA>
<EA>This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</EA>
<CC>001D02C03</CC>
<FD>Méthode heuristique; Segmentation; Caractère imprimé; Reconnaissance caractère; Logique floue; Logique mathématique; Formule mathématique</FD>
<ED>Heuristic method; Segmentation; Printed character; Character recognition; Fuzzy logic; Mathematical logic; Mathematical formula</ED>
<SD>Método heurístico; Segmentación; Carácter impreso; Reconocimiento carácter; Lógica difusa; Lógica matemática; Fórmula matemática</SD>
<LO>INIST-26790.354000099533410030</LO>
<ID>02-0104694</ID>
</server>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000685 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000685 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= PascalFrancis |étape= Corpus |type= RBID |clé= Pascal:02-0104694 |texte= Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context }}
This area was generated with Dilib version V0.6.32. |