Apache Tika: Difference between revisions

Tika
	Tika logo
Developer(s)	Apache Software Foundation
Stable release	1.12 / February 16, 2016; 8 years ago
Repository	gitbox.apache.org/repos/asf/tika.git ;
Written in	Java
Operating system	Cross-platform
Type	Search and index API
License	Apache License 2.0
Website	tika.apache.org

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 05:49, 16 April 2016

Apache Tika is a content detection and analysis framework stewarded at the Apache Software Foundation.^[1] The project originated as part of the Apache Nutch codebase and was later separated to make it more extensible and usable by Content management systems, Web crawlers, and information retrieval systems. Tika was started in 2007 by Jérôme Charron, Chris Mattmann and Jukka Zitting.^[2] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action".

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. In addition, Tika provides content extraction, metadata extraction and language identification capabilities.

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),^[3] by NASA and academic researchers^[4] by major content management systems including Drupal,^[5] and Alfresco (software)^[6] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016^[7] Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore Shell corporations. The leaked documents and project to analyze them is referred to as the Panama Papers.

References

^ "Apache Tika". Retrieved 2016-04-15.
^ "Tika Proposal". Retrieved 2016-04-15.
^ "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO®". FICO® | Decisions. Retrieved 2016-04-15.
^ "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.
^ "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. Retrieved 2016-04-15.
^ "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. Retrieved 2016-04-15.
^ Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.

[1] "Apache Tika". Retrieved 2016-04-15.

[2] "Tika Proposal". Retrieved 2016-04-15.

[3] "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO®". FICO® | Decisions. Retrieved 2016-04-15.

[4] "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.

[5] "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. Retrieved 2016-04-15.

[6] "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. Retrieved 2016-04-15.

[7] Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

@@ Line 31: / Line 31: @@
 ==References==
 {{Reflist}}
+{{Apache}}
+[[Category:Apache Software Foundation|PDFBox]]
+[[Category:Java platform]]
+[[Category:Free software programmed in Java (programming language)]]
+[[Category:Java (programming language) libraries]]
+[[Category:Software using the Apache license]]
 {{Uncategorized|date=April 2016}}

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive Bluesky iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category