Layout-aware text extraction from full-text PDF of scientific articles
Identifieur interne : 000081 ( Pmc/Corpus ); précédent : 000080; suivant : 000082Layout-aware text extraction from full-text PDF of scientific articles
Auteurs : Cartic Ramakrishnan ; Abhishek Patnia ; Eduard Hovy ; Gully Apc BurnsSource :
- Source Code for Biology and Medicine [ 1751-0473 ] ; 2012.
Abstract
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1)
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at
Url:
DOI: 10.1186/1751-0473-7-7
PubMed: 22640904
PubMed Central: 3441580
Links to Exploration step
PMC:3441580***** Acces problem to record *****\Le document en format XML
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000081 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000081 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Pmc |étape= Corpus |type= RBID |clé= PMC:3441580 |texte= Layout-aware text extraction from full-text PDF of scientific articles }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i -Sk "pubmed:22640904" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |