Tuesday, July 7, 2015

Using Tika in Elasticsearch

Using Tika in Elasticsearch

Tika offers rich feature set and eases complication on dealing these challenges MIME type and language detections and metadata tasks involved with content extraction.

Elasticsearch is a search engine, capable of full-text based of Apache Lucene.  Elasticsearch doesn't provide content extraction but it is a very capable tool for indexing documents, store JSON documents for searching.  Tika and Elasticsearch are written in JAVA which makes Tika a natural fit to provide content for indexing.

This example simply uses Tika as data ingestion tool to feed extract text from digital document to be indexed and stored as JSON document in Elasticsearch.