SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
RDMS dominates several decades. But new needs highly challenge the “old man”.
- big or huge data production in some application, like web or biology. The data scale is petabyte instead of gigabyte per day.
- high throughput (real time response) and computation intensive application
- cloudy or distributed storage
- Google BigTable (Paper)
- Amazon SimpleDB: Amazon SimpleDB is a highly available, scalable, and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests, and Amazon SimpleDB does the rest.
- Apache Cassandra (former Facebook): The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
Understanding information content with Apache TikaThis collection includes projects I am interested in. For Hadoop project, I have several separated posts.
A detail list of all projects at: http://projects.apache.org/indexes/quick.html
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these subprojects:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- HDFS: A distributed file system that provides high throughput access to application data.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce: A software framework for distributed processing of large data sets on compute clusters.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper: A high-performance coordination service for distributed applications.
Other related projects:
From database point, to analyze data, query is the key. Complex query cross multiple dataset needs join (like join table in database). Hive, Hbase and Pig both provide such capabilities. But Hive is more SQL-friendly and Pig is more script-orientated. Let’s look at some “old” discussions:
From Tom White’s 2008 blog:
- Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
- Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
- Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
Following sections are excerpted from book “Hadoop the definition guide”
Comparison Pig Latin with Databases
Having seen Pig in action, it might seem that Pig Latin is similar to SQL. The presenceof such operators as GROUP BY and DESCRIBE reinforces this impression. However,there are several differences between the two languages, and between Pig and RDBMSs in general.
The most significant difference is that Pig Latin is a data flow programming language,whereas SQL is a declarative programming language. In other words, a Pig Latin programis a step-by-step set of operations on an input relation, in which each step is a single transformation. By contrast, SQL statements are a set of constraints that taken together define the output. In many ways, programming in Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a declarativestatement into a system of steps.
RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed aboutthe data that it processes: you can define a schema at runtime, but it’s optional. Essentially,it will operate on any source of tuples (although the source should supportbeing read in parallel, by being in multiple files, for example), where a UDF is used toread the tuples from their raw representation. The most common representation is atext file with tab-separated fields, and Pig provides a built-in load function for thisformat. Unlike with a traditional database, there is no data import process to load thedata into the RDBMS. The data is loaded from the filesystem (usually HDFS) as thefirst step in the processing.
Pig’s support for complex, nested data structures differentiates it from SQL, whichoperates on flatter data structures. Also, Pig’s ability to use UDFs and streaming operatorsthat are tightly integrated with the language and Pig’s nested data structuresmakes Pig Latin more customizable than most SQL dialects.
There are several features to support online, low-latency queries that RDBMSs havethat are absent in Pig, such as transactions and indexes. As mentioned earlier, Pig does not support random reads or queries in the order of tens of milliseconds. Nor does itsupport random writes, to update small portions of data; all writes are bulk, streamingwrites, just like MapReduce.
Hive is a subproject of Hadoop that sits between Pig and conventional RDBMSs. LikePig, Hive is designed to use HDFS for storage, but otherwise there are some significant differences. Its query language, Hive QL, is based on SQL, and anyone who is familiarwith SQL would have little trouble writing queries in Hive QL. Like RDBMSs, Hivemandates that all data be stored in tables, with a schema under its management; however,it can associate a schema with preexisting data in HDFS, so the load step is optional.Hive does not support low-latency queries, a characteristic it shares with Pig.
- Deriving new business insights with Big Data: “When data exists in this quantity, one of the processing limitations is that it takes a significant amount of time to move the data. Apache Hadoop has emerged to address these concerns with its unique approach of moving the work to the data and not the other way around“
- Distributed computing with Linux and Hadoop