HBase in nutshell

It will be confusing if you read the “HBase The Definitive Guide”, which state HBase is Column-oriented storage. But in logic, it is a nested HashMap. And it is the open source implementation of Google BigTable. The “Data Model” is:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

sparse, sorted, multidimensional Map

It is nested Key – Value pair. For a sample table:

hbase(main):051:0> scan 'testtable'
ROW                          COLUMN+CELL
user1                       column=colfam1:name, timestamp=1331592953239, value=value-1
user2                       column=colfam1:age, timestamp=1331592973284, value=value-2
user2                       column=colfam1:gender, timestamp=1331593007379, value=value-3

Turn this into a  nested HashMap (if you use Perl):

{
user1: {
  colfam1: {
    name: value-1
  }
}
user2: {
  colfam1: {
    age: value-2
    gender: value-3
   }
 }
}

Here, column family and column name are just different level at the map.

“Columns in HBase are grouped into column families. All column members of a column family have a common prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.”[3]

Sort is a very unique property of HBase.

All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).[3]

A row can have any number of columns in each column family. Thinking it is a Map instead of a table will make it very clear.

persistent and distributed

all data is saved into file system not in the memory. File  system can be either Hadoop’s Distributed File System (HDFS) or Amazon’s Simple Storage Service (S3), which can protect single-point-of-failure, query performance and very big storage size.

CAP Theorem

There are three primary concerns you must balance when choosing a data management system:

  • Consistency means that each client always has the same view of the data.
  • Availability means that all clients can always read and write.
  • Partition tolerance means that the system works well across physical network partitions.

According to the CAP Theorem, you can only pick two.

Reference:

  1. http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  2. http://hadoop-hbase.blogspot.com/2011/12/introduction-to-hbase.html
  3. http://hbase.apache.org/book/
  4. http://blog.nahurst.com/visual-guide-to-nosql-systems
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s