Layout-aware text extraction from full-text PDF of scientific articles
Identifieur interne : 000260 ( Main/Merge ); précédent : 000259; suivant : 000261Layout-aware text extraction from full-text PDF of scientific articles
Auteurs : Cartic Ramakrishnan [États-Unis] ; Abhishek Patnia [États-Unis] ; Eduard Hovy [États-Unis] ; Gully Apc Burns [États-Unis]Source :
- Source Code for Biology and Medicine [ 1751-0473 ] ; 2012.
Abstract
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1)
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at
Url:
DOI: 10.1186/1751-0473-7-7
PubMed: 22640904
PubMed Central: 3441580
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 000081
- to stream Pmc, to step Curation: 000081
- to stream Pmc, to step Checkpoint: 000101
- to stream Ncbi, to step Merge: 000131
- to stream Ncbi, to step Curation: 000131
- to stream Ncbi, to step Checkpoint: 000131
Links to Exploration step
PMC:3441580Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Layout-aware text extraction from full-text PDF of scientific articles</title>
<author><name sortKey="Ramakrishnan, Cartic" sort="Ramakrishnan, Cartic" uniqKey="Ramakrishnan C" first="Cartic" last="Ramakrishnan">Cartic Ramakrishnan</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Patnia, Abhishek" sort="Patnia, Abhishek" uniqKey="Patnia A" first="Abhishek" last="Patnia">Abhishek Patnia</name>
<affiliation wicri:level="4"><nlm:aff id="I2">Computer Science Department, University of Southern California, 941 Bloom Walker, Los Angeles, CA, 90089-0781, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science Department, University of Southern California, 941 Bloom Walker, Los Angeles, CA, 90089-0781</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Hovy, Eduard" sort="Hovy, Eduard" uniqKey="Hovy E" first="Eduard" last="Hovy">Eduard Hovy</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Burns, Gully Apc" sort="Burns, Gully Apc" uniqKey="Burns G" first="Gully Apc" last="Burns">Gully Apc Burns</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">22640904</idno>
<idno type="pmc">3441580</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441580</idno>
<idno type="RBID">PMC:3441580</idno>
<idno type="doi">10.1186/1751-0473-7-7</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000081</idno>
<idno type="wicri:Area/Pmc/Curation">000081</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000101</idno>
<idno type="wicri:Area/Ncbi/Merge">000131</idno>
<idno type="wicri:Area/Ncbi/Curation">000131</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000131</idno>
<idno type="wicri:Area/Main/Merge">000260</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Layout-aware text extraction from full-text PDF of scientific articles</title>
<author><name sortKey="Ramakrishnan, Cartic" sort="Ramakrishnan, Cartic" uniqKey="Ramakrishnan C" first="Cartic" last="Ramakrishnan">Cartic Ramakrishnan</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Patnia, Abhishek" sort="Patnia, Abhishek" uniqKey="Patnia A" first="Abhishek" last="Patnia">Abhishek Patnia</name>
<affiliation wicri:level="4"><nlm:aff id="I2">Computer Science Department, University of Southern California, 941 Bloom Walker, Los Angeles, CA, 90089-0781, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science Department, University of Southern California, 941 Bloom Walker, Los Angeles, CA, 90089-0781</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Hovy, Eduard" sort="Hovy, Eduard" uniqKey="Hovy E" first="Eduard" last="Hovy">Eduard Hovy</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Burns, Gully Apc" sort="Burns, Gully Apc" uniqKey="Burns G" first="Gully Apc" last="Burns">Gully Apc Burns</name>
<affiliation wicri:level="4"><nlm:aff id="I1">Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695</wicri:regionArea>
<orgName type="university">Université de Californie du Sud</orgName>
<placeName><settlement type="city">Los Angeles</settlement>
<region type="state">Californie</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j">Source Code for Biology and Medicine</title>
<idno type="eISSN">1751-0473</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.</p>
</sec>
<sec><title>Results</title>
<p>Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) <bold>Detecting contiguous text blocks</bold>
using spatial layout processing to locate and identify blocks of contiguous text, (2) <bold>Classifying text blocks into rhetorical categories</bold>
using a rule-based method and (3) <bold>Stitching classified text blocks together in the correct order</bold>
resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision<sup>1</sup>
= 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, <sup>2</sup>
commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.</p>
</sec>
<sec><title>Conclusions</title>
<p>LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at <ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/lapdftext/">http://code.google.com/p/lapdftext/</ext-link>
.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Rebholz Schuhmann, D" uniqKey="Rebholz Schuhmann D">D Rebholz-Schuhmann</name>
</author>
<author><name sortKey="Kirsch, H" uniqKey="Kirsch H">H Kirsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Altman, Rb" uniqKey="Altman R">RB Altman</name>
</author>
<author><name sortKey="Bergman, Cm" uniqKey="Bergman C">CM Bergman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Settles, B" uniqKey="Settles B">B Settles</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rosario, B" uniqKey="Rosario B">B Rosario</name>
</author>
<author><name sortKey="Hearst, Ma" uniqKey="Hearst M">MA Hearst</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Krallinger, M" uniqKey="Krallinger M">M Krallinger</name>
</author>
<author><name sortKey="Vazquez, M" uniqKey="Vazquez M">M Vazquez</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chun, Hw" uniqKey="Chun H">HW Chun</name>
</author>
<author><name sortKey="Tsuruoka, Y" uniqKey="Tsuruoka Y">Y Tsuruoka</name>
</author>
<author><name sortKey="Kim, Jd" uniqKey="Kim J">JD Kim</name>
</author>
<author><name sortKey="Shiba, R" uniqKey="Shiba R">R Shiba</name>
</author>
<author><name sortKey="Nagata, N" uniqKey="Nagata N">N Nagata</name>
</author>
<author><name sortKey="Hishiki, T" uniqKey="Hishiki T">T Hishiki</name>
</author>
<author><name sortKey="Tsujii, J" uniqKey="Tsujii J">J Tsujii</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cohen, Kb" uniqKey="Cohen K">KB Cohen</name>
</author>
<author><name sortKey="Johnson, Hl" uniqKey="Johnson H">HL Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Alex, B" uniqKey="Alex B">B Alex</name>
</author>
<author><name sortKey="Grover, C" uniqKey="Grover C">C Grover</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ramakrishnan, C" uniqKey="Ramakrishnan C">C Ramakrishnan</name>
</author>
<author><name sortKey="Mendes, Pn" uniqKey="Mendes P">PN Mendes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ramakrishnan, C" uniqKey="Ramakrishnan C">C Ramakrishnan</name>
</author>
<author><name sortKey="Mendes, Pn" uniqKey="Mendes P">PN Mendes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Roy, S" uniqKey="Roy S">S Roy</name>
</author>
<author><name sortKey="Heinrich, K" uniqKey="Heinrich K">K Heinrich</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cohen, Am" uniqKey="Cohen A">AM Cohen</name>
</author>
<author><name sortKey="Hersh, Wr" uniqKey="Hersh W">WR Hersh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bourne, P" uniqKey="Bourne P">P Bourne</name>
</author>
<author><name sortKey="Mcentyre, J" uniqKey="Mcentyre J">J McEntyre</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Krallinger, M" uniqKey="Krallinger M">M Krallinger</name>
</author>
<author><name sortKey="Morgan, A" uniqKey="Morgan A">A Morgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Morgan, Aa" uniqKey="Morgan A">AA Morgan</name>
</author>
<author><name sortKey="Lu, Z" uniqKey="Lu Z">Z Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dowell, Kg" uniqKey="Dowell K">KG Dowell</name>
</author>
<author><name sortKey="Mcandrews Hill, Ms" uniqKey="Mcandrews Hill M">MS McAndrews-Hill</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Forgy, Cl" uniqKey="Forgy C">CL Forgy</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Needleman, Sb" uniqKey="Needleman S">SB Needleman</name>
</author>
<author><name sortKey="Wunsch, Cd" uniqKey="Wunsch C">CD Wunsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dengel, A" uniqKey="Dengel A">A Dengel</name>
</author>
<author><name sortKey="Dubiel, F" uniqKey="Dubiel F">F Dubiel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Esposito, F" uniqKey="Esposito F">F Esposito</name>
</author>
<author><name sortKey="Malerba, D" uniqKey="Malerba D">D Malerba</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Summers, Kristen" uniqKey="Summers K">Kristen Summers</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Luong, M T" uniqKey="Luong M">M-T Luong</name>
</author>
<author><name sortKey="Nguyen, Td" uniqKey="Nguyen T">TD Nguyen</name>
</author>
<author><name sortKey="Kan, M Y" uniqKey="Kan M">M-Y Kan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lafferty, Jd" uniqKey="Lafferty J">JD Lafferty</name>
</author>
<author><name sortKey="Mccallum, A" uniqKey="Mccallum A">A McCallum</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Attwood, Tk" uniqKey="Attwood T">TK Attwood</name>
</author>
<author><name sortKey="Kell, Db" uniqKey="Kell D">DB Kell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vroling, B" uniqKey="Vroling B">B Vroling</name>
</author>
<author><name sortKey="Thorne, D" uniqKey="Thorne D">D Thorne</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author><name sortKey="Mitra, P" uniqKey="Mitra P">P Mitra</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Murphy, Rf" uniqKey="Murphy R">RF Murphy</name>
</author>
<author><name sortKey="Velliste, M" uniqKey="Velliste M">M Velliste</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lee Giles, C" uniqKey="Lee Giles C">C Lee Giles</name>
</author>
<author><name sortKey="Councill, I" uniqKey="Councill I">I Councill</name>
</author>
<author><name sortKey="Kan, M Y" uniqKey="Kan M">M-Y Kan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ciccarese, P" uniqKey="Ciccarese P">P Ciccarese</name>
</author>
<author><name sortKey="Attwood, T" uniqKey="Attwood T">T Attwood</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ciccarese, P" uniqKey="Ciccarese P">P Ciccarese</name>
</author>
<author><name sortKey="Ocana, M" uniqKey="Ocana M">M Ocana</name>
</author>
<author><name sortKey="Garcia Castro, Lj" uniqKey="Garcia Castro L">LJ Garcia Castro</name>
</author>
<author><name sortKey="Das, S" uniqKey="Das S">S Das</name>
</author>
<author><name sortKey="Clark, T" uniqKey="Clark T">T Clark</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000260 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000260 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= PMC:3441580 |texte= Layout-aware text extraction from full-text PDF of scientific articles }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Merge/RBID.i -Sk "pubmed:22640904" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Merge/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |