Apache Tika

Software Screenshot:
Apache Tika
Software Details:
Version: 1.9 updated
Upload Date: 20 Jul 15
Distribution Type: Freeware
Downloads: 89

Rating: 5.0/5 (Total Votes: 1)

Apache Tika was developed as a low-level toolkit for searching content inside other files.

Tika doesn't do much on its own being a simple library, but it can be integrated in more powerful tools like search engines, digital asset management systems or CMSs to provide a fully-functional in-file search system.

The library can access just the file's header for quick overall file information, or it can go really deep and search even in the file's body for various types of data, in text or binary format.

A wide range of file types are supported and Tika can also be used with other programming languages thanks to a series of third-party bindings and wrappers.

What is new in this release:

  • This release includes bug fixes and new features including a new Tesseract OCR Parser; a new GDAL Parser; more supported formats, and overall improvements in Tika stability.

What is new in version 1.8:

  • This release includes bug fixes and new features including a new Tesseract OCR Parser; a new GDAL Parser; more supported formats, and overall improvements in Tika stability.

What is new in version 1.7:

  • This release includes bug fixes and new features including a new Tesseract OCR Parser; a new GDAL Parser; more supported formats, and overall improvements in Tika stability.

What is new in version 1.6:

  • This release includes bug fixes and new features including a new Translation API, more supported formats, and overall improvements in Tika stability.

What is new in version 1.5:

  • Fixed bug in handling of embedded file processing in PDFs.
  • Added SourceCodeParser to support java, Groovy, C++ files.
  • Updated Tika Server to support multipart/form-data payloads.
  • Updated Tika Server to CXF 2.7.8.
  • Updated Tika Server to accept requests over wildcard addresses.
  • Added option to use alternate NonSequentialPDFParser.
  • Content from PDF AcroForms is now extracted.
  • Fixed invalid asterisks from master slide in PPT.
  • Added test cases to confirm handling of auto-date in PPT and PPTX.

What is new in version 1.4:

  • Removed a test HTML file with a poorly chosen GPL text in it.
  • Improvements to tika-server to allow it to produce text/html and text/xml content.
  • Improvements were made to the Compressor Parser to handle g'zipped files that require the decompressConcatenated option set to true.
  • Addressed a typographic error that was preventing from detection of awk files.

What is new in version 1.2:

  • Apache Tika 1.2 contains a number of improvements and bug fixes.

What is new in version 1.0:

  • Apache Tika 1.0 contains a number of improvements and bug fixes.

What is new in version 0.9:

  • This release includes several important bug fixes and new features.

What is new in version 0.8:

  • Language identification is now dynamically configurable, managed via a config file loaded from the classpath.
  • Tika now supports parsing Feeds by wrapping the underlying Rome library.
  • A quick-start guide for Tika parsing was contributed.
  • An approach for plumbing through XHTML attributes was added.
  • Media type hierarchy information is now taken into account when selecting the best parser for a given input document.
  • Support for parsing common scientific data formats including netCDF and HDF4/5 was added.
  • Unit tests for Windows have been fixed, allowing TestParsers to complete.

What is new in version 0.7:

  • MP3 file parsing was improved, including Channel and SampleRate extraction and ID3v2 support. Further, audio parsing mime detection was also improved for the MIDI format.
  • Tika no longer relies on X11 for its RTF parsing functionality.
  • A Thread-safe bug in the AutoDetectParser was discovered and addressed.
  • Upgrade to PDFBox 1.0.0. The new PDFBox version improves PDF parsing performance and fixes a number of text extraction issues.

Requirements:

  • Java 6 or higher

Similar Software

Timecop
Timecop

24 Feb 15

configstore
configstore

9 Apr 16

Watch.JS
Watch.JS

5 Jun 15

YSS
YSS

13 Apr 15

Other Software of Developer Apache Software Foundation

Apache BookKeeper
Apache BookKeeper

13 Apr 15

Apache Streams
Apache Streams

13 Apr 15

Apache Pig
Apache Pig

20 Jul 15

Comments to Apache Tika

Comments not found
Add Comment
Turn on images!