Understanding information content with Apache TikaThis collection includes projects I am interested in. For Hadoop project, I have several separated posts.
A detail list of all projects at: http://projects.apache.org/indexes/quick.html
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.