PDFMiner

Nytro · December 11, 2011

[h=1]PDFMiner[/h]

Python PDF parser and analyzer

What's It?
Download
Where to Ask
How to Install
- CJK languages support
[*] Command Line Tools
[*] Changes
[*] TODO
[*] Related Projects
[*] Terms and Conditions

[h=2]What's It?[/h] PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

[h=3]Features[/h]

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

Online Demo: (pdf -> html conversion webapp)

http://pdf2html.tabesugi.net:8080/

[h=3]Download[/h] Source distribution:

http://pypi.python.org/pypi/pdfminer/

github:

https://github.com/euske/pdfminer/

[h=3]Where to Ask[/h]

Questions and comments:

http://groups.google.com/group/pdfminer-users/

Detalii:

http://www.unixuser.org/~euske/python/pdfminer/index.html

Sign In

PDFMiner

Recommended Posts

Nytro

Join the conversation

Browse

Activity

Pages