Nytro Posted December 11, 2011 Report Posted December 11, 2011 [h=1]PDFMiner[/h] Python PDF parser and analyzer Homepage Recent Changes PDFMiner API What's It? Download Where to Ask How to Install CJK languages support[*] Command Line Tools pdf2txt.py dumppdf.py PDFMiner API[*] Changes[*] TODO[*] Related Projects[*] Terms and Conditions [h=2]What's It?[/h] PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis. [h=3]Features[/h] Written entirely in Python. (for version 2.4 or newer) Parse, analyze, and convert PDF documents. PDF-1.7 specification support. (well, almost) CJK languages and vertical writing scripts support. Various font types (Type1, TrueType, Type3, and CID) support. Basic encryption (RC4) support. PDF to HTML conversion (with a sample converter web app). Outline (TOC) extraction. Tagged contents extraction. Reconstruct the original layout by grouping text chunks. PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf. Online Demo: (pdf -> html conversion webapp) http://pdf2html.tabesugi.net:8080/ [h=3]Download[/h] Source distribution: http://pypi.python.org/pypi/pdfminer/ github: https://github.com/euske/pdfminer/ [h=3]Where to Ask[/h] Questions and comments: http://groups.google.com/group/pdfminer-users/ Detalii:http://www.unixuser.org/~euske/python/pdfminer/index.html Quote