PDFMiner works by first taking the content of a PDF file and converting it to a more malleable format like HTML.
From there, text and data is extracted and analyzed, and based on the predefined rules separated and presented to the user or sent to other more powerful data analysis tools.
If text analysis is not what you intend to do, you can easily configure PDFMiner to simply extract or just convert PDF data as well.
Its functions can work separately from one another and allow a wider usage spectrum thanks to it.
Features:
- 100% Python code, no C or C++
- Parse PDFs
- Analyze PDFs
- Convert PDFs to other formats
- ToC extractor
- Get only tagged content
- Support for a large number of text PDF features
- Support for a large number of font types inside PDFs
- Basic encryption (RC4) support
What is new in this release:
- PDFDocument.initialize() method is removed and no longer needed. A password is given as an argument of a PDFDocument constructor.
What is new in version 20110515:
- API changes.
- LTPolygon class was renamed as LTCurve.
What is new in version 20110227:
- Bug fixes and layout analysis improvements.
What is new in version 20101226:
- A couple of bugfixes and minor improvements.
What is new in version 20101017:
- A couple of bugfixes and a minor improvement.
What is new in version 20100424:
- Bugfixes and tiny improvements on TOC extraction.
Requirements:
- Python 2.4 up to 3
Limitations:
- PDFMiner can be 20 times slower than C/C++-based software.
Comments not found