Jump to content
Nytro

PDFMiner

Recommended Posts

Posted

[h=1]PDFMiner[/h]

Python PDF parser and analyzer

Homepage Recent Changes PDFMiner API

[h=2]What's It?[/h] PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

[h=3]Features[/h]

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

Online Demo: (pdf -> html conversion webapp)

http://pdf2html.tabesugi.net:8080/

[h=3]Download[/h] Source distribution:

http://pypi.python.org/pypi/pdfminer/

github:

https://github.com/euske/pdfminer/

[h=3]Where to Ask[/h]

Questions and comments:

http://groups.google.com/group/pdfminer-users/

Detalii:

http://www.unixuser.org/~euske/python/pdfminer/index.html

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...