Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012
Identifieur interne : 000120 ( Hal/Corpus ); précédent : 000119; suivant : 000121Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012
Auteurs : Andrew Thean ; Jean-Marc Deltorn ; Patrice Lopez ; Laurent RomarySource :
Abstract
The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.
Url:
Links to Exploration step
Hal:hal-00728779Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author><name sortKey="Thean, Andrew" sort="Thean, Andrew" uniqKey="Thean A" first="Andrew" last="Thean">Andrew Thean</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Deltorn, Jean Marc" sort="Deltorn, Jean Marc" uniqKey="Deltorn J" first="Jean-Marc" last="Deltorn">Jean-Marc Deltorn</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Lopez, Patrice" sort="Lopez, Patrice" uniqKey="Lopez P" first="Patrice" last="Lopez">Patrice Lopez</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00728779</idno>
<idno type="halId">hal-00728779</idno>
<idno type="halUri">https://hal.inria.fr/hal-00728779</idno>
<idno type="url">https://hal.inria.fr/hal-00728779</idno>
<date when="2012-09-17">2012-09-17</date>
<idno type="wicri:Area/Hal/Corpus">000120</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author><name sortKey="Thean, Andrew" sort="Thean, Andrew" uniqKey="Thean A" first="Andrew" last="Thean">Andrew Thean</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Deltorn, Jean Marc" sort="Deltorn, Jean Marc" uniqKey="Deltorn J" first="Jean-Marc" last="Deltorn">Jean-Marc Deltorn</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Lopez, Patrice" sort="Lopez, Patrice" uniqKey="Lopez P" first="Patrice" last="Lopez">Patrice Lopez</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
<author><name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
<affiliation><hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID"><orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc><address><addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation><relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-139189" type="direct"><org type="institution" xml:id="struct-139189" status="VALID"><orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc><address><addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.</div>
</front>
</TEI>
<hal api="V3"><titleStmt><title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author role="aut"><persName><forename type="first">Andrew</forename>
<surname>Thean</surname>
</persName>
<email></email>
<idno type="halauthor">758449</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Jean-Marc</forename>
<surname>Deltorn</surname>
</persName>
<email></email>
<idno type="halauthor">758450</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Patrice</forename>
<surname>Lopez</surname>
</persName>
<email></email>
<idno type="idhal">patricelopez</idno>
<idno type="halauthor">144548</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Laurent</forename>
<surname>Romary</surname>
</persName>
<email>laurent.romary@inria.fr</email>
<idno type="idhal">laurentromary</idno>
<idno type="halauthor">49567</idno>
<idno type="arXiv">http://arxiv.org/a/Romary_L</idno>
<idno type="IdRef">http://www.idref.fr/060702494</idno>
<idno type="ORCID">http://orcid.org/0000-0002-0756-0508</idno>
<idno type="VIAF">http://viaf.org/viaf/VIAF282014122</idno>
<idno type="ISNI">http://isni.org/isni/0000 0003 8879 5444</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<editor role="depositor"><persName><forename>Laurent</forename>
<surname>Romary</surname>
</persName>
<email>laurent.romary@inria.fr</email>
</editor>
</titleStmt>
<editionStmt><edition n="v1" type="current"><date type="whenSubmitted">2012-09-06 15:55:11</date>
<date type="whenModified">2012-09-07 09:39:57</date>
<date type="whenReleased">2012-09-07 09:39:57</date>
<date type="whenProduced">2012-09-17</date>
<date type="whenEndEmbargoed">2012-09-06</date>
<ref type="file" target="https://hal.inria.fr/hal-00728779/document"><date notBefore="2012-09-06"></date>
</ref>
<ref type="file" subtype="author" n="1" target="https://hal.inria.fr/hal-00728779/file/clef-ip-2012-flow.pdf"><date notBefore="2012-09-06"></date>
</ref>
</edition>
<respStmt><resp>contributor</resp>
<name key="105529"><persName><forename>Laurent</forename>
<surname>Romary</surname>
</persName>
<email>laurent.romary@inria.fr</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt><distributor>CCSD</distributor>
<idno type="halId">hal-00728779</idno>
<idno type="halUri">https://hal.inria.fr/hal-00728779</idno>
<idno type="halBibtex">thean:hal-00728779</idno>
<idno type="halRefHtml">CLEF 2012, Sep 2012, Roma, Italy. 2012</idno>
<idno type="halRef">CLEF 2012, Sep 2012, Roma, Italy. 2012</idno>
</publicationStmt>
<seriesStmt><idno type="stamp" n="INRIA">INRIA - Institut National de Recherche en Informatique et en Automatique</idno>
<idno type="stamp" n="INRIA-SACLAY">INRIA Saclay - Ile de France</idno>
</seriesStmt>
<notesStmt><note type="audience" n="2">International</note>
<note type="invited" n="0">No</note>
<note type="popular" n="0">No</note>
<note type="peer" n="1">Yes</note>
<note type="proceedings" n="1">Yes</note>
</notesStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author role="aut"><persName><forename type="first">Andrew</forename>
<surname>Thean</surname>
</persName>
<idno type="halAuthorId">758449</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Jean-Marc</forename>
<surname>Deltorn</surname>
</persName>
<idno type="halAuthorId">758450</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Patrice</forename>
<surname>Lopez</surname>
</persName>
<idno type="idHal">patricelopez</idno>
<idno type="halAuthorId">144548</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Laurent</forename>
<surname>Romary</surname>
</persName>
<email>laurent.romary@inria.fr</email>
<idno type="idHal">laurentromary</idno>
<idno type="halAuthorId">49567</idno>
<idno type="arXiv">http://arxiv.org/a/Romary_L</idno>
<idno type="IdRef">http://www.idref.fr/060702494</idno>
<idno type="ORCID">http://orcid.org/0000-0002-0756-0508</idno>
<idno type="VIAF">http://viaf.org/viaf/VIAF282014122</idno>
<idno type="ISNI">http://isni.org/isni/0000 0003 8879 5444</idno>
<affiliation ref="#struct-95237"></affiliation>
<affiliation ref="#struct-118511"></affiliation>
</author>
</analytic>
<monogr><meeting><title>CLEF 2012</title>
<date type="start">2012-09-17</date>
<date type="end">2012-09-20</date>
<settlement>Roma</settlement>
<country key="IT">Italy</country>
</meeting>
<imprint><date type="datePub">2012</date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
<profileDesc><langUsage><language ident="en">English</language>
</langUsage>
<textClass><classCode scheme="classification">H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries</classCode>
<classCode scheme="halDomain" n="info.info-cl">Computer Science [cs]/Computation and Language [cs.CL]</classCode>
<classCode scheme="halTypology" n="COMM">Conference papers</classCode>
</textClass>
<abstract xml:lang="en">The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.</abstract>
</profileDesc>
</hal>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Hal/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000120 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Hal/Corpus/biblio.hfd -nk 000120 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Hal |étape= Corpus |type= RBID |clé= Hal:hal-00728779 |texte= Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012 }}
This area was generated with Dilib version V0.6.32. |