Apache Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these subprojects:

  • Hadoop Common: The common utilities that support the other Hadoop subprojects.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • HDFS: A distributed file system that provides high throughput access to application data.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • MapReduce: A software framework for distributed processing of large data sets on compute clusters.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

Other related projects:

  • Nutch
    • Website

From database point, to analyze data, query is the key. Complex query cross multiple dataset needs join (like join table in database).  Hive, Hbase and Pig both provide such capabilities. But Hive is more SQL-friendly and Pig is more script-orientated. Let’s look at some “old” discussions:

From Tom White’s 2008 blog:

  • Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
  • Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
  • Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
View more presentations from Kevin Weil.

Following sections are excerpted from book “Hadoop the definition guide”

Comparison Pig Latin with Databases

Having seen Pig in action, it might seem that Pig Latin is similar to SQL. The presenceof such operators as GROUP BY and DESCRIBE reinforces this impression. However,there are several differences between the two languages, and between Pig and RDBMSs in general.

The most significant difference is that Pig Latin is a data flow programming language,whereas SQL is a declarative programming language. In other words, a Pig Latin programis a step-by-step set of operations on an input relation, in which each step is a single transformation. By contrast, SQL statements are a set of constraints that taken together define the output. In many ways, programming in Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a declarativestatement into a system of steps.

RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed aboutthe data that it processes: you can define a schema at runtime, but it’s optional. Essentially,it will operate on any source of tuples (although the source should supportbeing read in parallel, by being in multiple files, for example), where a UDF is used toread the tuples from their raw representation. The most common representation is atext file with tab-separated fields, and Pig provides a built-in load function for thisformat. Unlike with a traditional database, there is no data import process to load thedata into the RDBMS. The data is loaded from the filesystem (usually HDFS) as thefirst step in the processing.

Pig’s support for complex, nested data structures differentiates it from SQL, whichoperates on flatter data structures. Also, Pig’s ability to use UDFs and streaming operatorsthat are tightly integrated with the language and Pig’s nested data structuresmakes Pig Latin more customizable than most SQL dialects.

There are several features to support online, low-latency queries that RDBMSs havethat are absent in Pig, such as transactions and indexes. As mentioned earlier, Pig does not support random reads or queries in the order of tens of milliseconds. Nor does itsupport random writes, to update small portions of data; all writes are bulk, streamingwrites, just like MapReduce.

Hive is a subproject of Hadoop that sits between Pig and conventional RDBMSs. LikePig, Hive is designed to use HDFS for storage, but otherwise there are some significant differences. Its query language, Hive QL, is based on SQL, and anyone who is familiarwith SQL would have little trouble writing queries in Hive QL. Like RDBMSs, Hivemandates that all data be stored in tables, with a schema under its management; however,it can associate a schema with preexisting data in HDFS, so the load step is optional.Hive does not support low-latency queries, a characteristic it shares with Pig.

—————————————————————————————-

HBase is a distributed column-oriented database built on top of HDFS. HBase is theHadoop application to use when you require real-time read/write random-access tovery large datasets.

Although there are countless strategies and implementations for database storage andretrieval, most solutions—especially those of the relational variety—are not built withvery large scale and distribution in mind. Many vendors offer replication and partitioningsolutions to grow the database beyond the confines of a single node but theseadd-ons are generally an afterthought and are complicated to install and maintain. Theyalso come at some severe compromise to the RDBMS feature set. Joins, complex queries,triggers, views, and foreign-key constraints become prohibitively expensive to runon a scaled RDBMS or do not work at all.

HBase vs. RDBMS
HBase and other column-oriented databases are often compared to more traditional and popular relational databases or RDBMSs. Although they differ dramatically in their implementations and in what they set out to accomplish, the fact that they are potential solutions to the same problems means that despite their enormous differences, the comparison is a fair one to make.

As described previously, HBase is a distributed, column-oriented data storage system.
It picks up where Hadoop left off by providing random reads and writes on top of
HDFS. It has been designed from the ground up with a focus on scale in every direction:
tall in numbers of rows (billions), wide in numbers of columns (millions), and to be
horizontally partitioned and replicated across thousands of commodity nodes automatically. The table schemas mirror the physical storage, creating a system for efficient data structure serialization, storage, and retrieval. The burden is on the application developer to make use of this storage and retrieval in the right way.
Strictly speaking, an RDBMS is a database that follows Codd’s 12 Rules. Typical
RDBMSs are fixed-schema, row-oriented databases with ACID properties and a sophisticated SQL query engine. The emphasis is on strong consistency, referential integrity, abstraction from the physical layer, and complex queries through the SQL language. You can easily create secondary indexes, perform complex inner and outer joins, count, sum, sort, group, and page your data across a number of tables, rows, and
columns.
For a majority of small- to medium-volume applications, there is no substitute for the
ease of use, flexibility, maturity, and powerful feature set of available open source
RDBMS solutions like MySQL and PostgreSQL. However, if you need to scale up in
terms of dataset size, read/write concurrency, or both, you’ll soon find that the conveniences of an RDBMS come at an enormous performance penalty and make distribution inherently difficult. The scaling of an RDBMS usually involves breaking Codd’s rules, loosening ACID restrictions, forgetting conventional DBA wisdom, and on the way losing most of the desirable properties that made relational databases so convenient in the first place.
Online Documents:
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s