HTablePool example in java

As the name suggests, the pool makes different Htable instances share resources like Zookeeper connection.
Suppose you have following table:

hbase(main):002:0> scan 'blogposts'
ROW                                       COLUMN+CELL
post1                                    column=image:bodyimage, timestamp=1333409506149, value=image2.jpg
post1                                    column=image:header, timestamp=1333409504678, value=image1.jpg
post1                                    column=post:author, timestamp=1333409504583, value=The Author
post1                                    column=post:body, timestamp=1333409504642, value=This is a blog post
post1                                    column=post:title, timestamp=1333409504496, value=Hello World
1 row(s) in 7.1920 seconds

Java Example Code:

import java.io.IOException;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.FutureTask;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.HTablePool;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;

import org.junit.*;
import static org.junit.Assert.*;

public class HTablePoolTest {

protected static String TEST_TABLE_NAME = "blogposts";
protected static String ROW1_STR = "post1";
protected static String COLFAM1_STR = "image";
protected static String QUAL1_STR = "bodyimage";

private final static byte[] ROW1 = Bytes.toBytes(ROW1_STR);
private final static byte[] COLFAM1 = Bytes.toBytes(COLFAM1_STR);
private final static byte[] QUAL1 = Bytes.toBytes(QUAL1_STR);

private final static int MAX = 10;
private static HTablePool pool;

@Before
public void runBeforeClass() throws IOException {
Configuration conf = HBaseConfiguration.create();
pool = new HTablePool(conf, MAX);

HTableInterface[] tables = new HTableInterface[10];
for (int n = 0; n < MAX; n++) {
tables[n] = pool.getTable(TEST_TABLE_NAME);
}
for (HTableInterface table : tables) {
table.close();
}
}

@Test
public void testHTablePool() throws IOException, InterruptedException,
ExecutionException {

Callable<Result> callable = new Callable<Result>() {
public Result call() throws Exception {
return get();
}
};

FutureTask<Result> task1 = new FutureTask<Result>(callable);

FutureTask<Result> task2 = new FutureTask<Result>(callable);

Thread thread1 = new Thread(task1, "THREAD-1");
thread1.start();
Thread thread2 = new Thread(task2, "THREAD-2");
thread2.start();

Result result1 = task1.get();
System.out.println("Thread1: "
+ Bytes.toString(result1.getValue(COLFAM1, QUAL1)));
assertEquals(Bytes.toString(result1.getValue(COLFAM1, QUAL1)),
"image2.jpg");

Result result2 = task2.get();
System.out.println("Thread2: "
+ Bytes.toString(result2.getValue(COLFAM1, QUAL1)));
assertEquals(Bytes.toString(result2.getValue(COLFAM1, QUAL1)),
"image2.jpg");
}

private Result get() {
HTableInterface table = pool.getTable(TEST_TABLE_NAME);
Get get = new Get(ROW1);
try {
Result result = table.get(get);
return result;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
return null;
} finally {
try {
table.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

}

Reference:

Java Example Code using HBase Data Model Operations

Please refer to the updated version: https://autofei.wordpress.com/2017/05/23/updated-java-example-code-using-hbase-data-model-operations/

The code is based on HBase version 0.92.1

The four primary data model operations are Get, Put, Scan, and Delete. Operations are applied via HTable instances.

First you need to install HBase. For testing, you can install it at a single machine by following this post.

Create a Java project inside Eclips and following libraries into ‘lib’ subdirectory are necessary:

hadoop@ubuntu:~/workspace/HBase$ tree lib
lib
├── commons-configuration-1.8.jar
├── commons-lang-2.6.jar
├── commons-logging-1.1.1.jar
├── hadoop-core-1.0.0.jar
├── hbase-0.92.1.jar
├── log4j-1.2.16.jar
├── slf4j-api-1.5.8.jar
├── slf4j-log4j12-1.5.8.jar
└── zookeeper-3.4.3.jar

Libraries locations

  • copy hbase-0.92.1.jar from HBase installation directory
  • copy rest jar files from “lib” subdirectory of HBase installation directory

Then you need to copy your HBase configuration hbase-site.xmlfile from “conf” subdirectory of HBase installation directory into the Java project directory.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/hbase</value>
</property>
</configuration>

The whole directory looks like:

hadoop@ubuntu:~/workspace/HBase$ tree
.
├── bin
│   └── HBaseConnector.class
├── hbase-site.xml
├── lib
│   ├── commons-configuration-1.8.jar
│   ├── commons-lang-2.6.jar
│   ├── commons-logging-1.1.1.jar
│   ├── hadoop-core-1.0.0.jar
│   ├── hbase-0.92.1.jar
│   ├── log4j-1.2.16.jar
│   ├── slf4j-api-1.5.8.jar
│   ├── slf4j-log4j12-1.5.8.jar
│   └── zookeeper-3.4.3.jar
└── src
└── HBaseConnector.java

Open a terminal

  • start HBase in terminal: bin/start-hbase.sh
  • start HBase shell: bin/hbase shell
  • create a table: create ‘myLittleHBaseTable’, ‘myLittleFamily’

Now you can run the code:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseConnector {
public static void main(String[] args) throws IOException {
// You need a configuration object to tell the client where to connect.
// When you create a HBaseConfiguration, it reads in whatever you've set
// into your hbase-site.xml and in hbase-default.xml, as long as these
// can be found on the CLASSPATH
Configuration config = HBaseConfiguration.create();

// This instantiates an HTable object that connects you to the
// "myLittleHBaseTable" table.
HTable table = new HTable(config, "myLittleHBaseTable");

// To add to a row, use Put. A Put constructor takes the name of the row
// you want to insert into as a byte array. In HBase, the Bytes class
// has utility for converting all kinds of java types to byte arrays. In
// the below, we are converting the String "myLittleRow" into a byte
// array to use as a row key for our update. Once you have a Put
// instance, you can adorn it by setting the names of columns you want
// to update on the row, the timestamp to use in your update, etc.
// If no timestamp, the server applies current time to the edits.
Put p = new Put(Bytes.toBytes("myLittleRow"));

// To set the value you'd like to update in the row 'myLittleRow',
// specify the column family, column qualifier, and value of the table
// cell you'd like to update. The column family must already exist
// in your table schema. The qualifier can be anything.
// All must be specified as byte arrays as hbase is all about byte
// arrays. Lets pretend the table 'myLittleHBaseTable' was created
// with a family 'myLittleFamily'.
p.add(Bytes.toBytes("myLittleFamily"), Bytes.toBytes("someQualifier"),
Bytes.toBytes("Some Value"));

// Once you've adorned your Put instance with all the updates you want
// to make, to commit it do the following
// (The HTable#put method takes the Put instance you've been building
// and pushes the changes you made into hbase)
table.put(p);

// Now, to retrieve the data we just wrote. The values that come back
// are Result instances. Generally, a Result is an object that will
// package up the hbase return into the form you find most palatable.
Get g = new Get(Bytes.toBytes("myLittleRow"));
Result r = table.get(g);
byte[] value = r.getValue(Bytes.toBytes("myLittleFamily"), Bytes
.toBytes("someQualifier"));
// If we convert the value bytes, we should get back 'Some Value', the
// value we inserted at this location.
String valueStr = Bytes.toString(value);
System.out.println("GET: " + valueStr);

// Sometimes, you won't know the row you're looking for. In this case,
// you use a Scanner. This will give you cursor-like interface to the
// contents of the table. To set up a Scanner, do like you did above
// making a Put and a Get, create a Scan. Adorn it with column names,
// etc.
Scan s = new Scan();
s.addColumn(Bytes.toBytes("myLittleFamily"), Bytes
.toBytes("someQualifier"));
ResultScanner scanner = table.getScanner(s);
try {
// Scanners return Result instances.
// Now, for the actual iteration. One way is to use a while loop
// like so:
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
// print out the row we found and the columns we were looking
// for
System.out.println("Found row: " + rr);
}

// The other approach is to use a foreach loop. Scanners are
// iterable!
// for (Result rr : scanner) {
// System.out.println("Found row: " + rr);
// }
} finally {
// Make sure you close your scanners when you are done!
// Thats why we have it inside a try/finally clause
scanner.close();
}
}
}

Another great Java example from [4]:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.MasterNotRunningException;
import org.apache.hadoop.hbase.ZooKeeperConnectionException;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseTest {

	private static Configuration conf = null;
	/**
	 * Initialization
	 */
	static {
		conf = HBaseConfiguration.create();
	}

	/**
	 * Create a table
	 */
	public static void creatTable(String tableName, String[] familys)
			throws Exception {
		HBaseAdmin admin = new HBaseAdmin(conf);
		if (admin.tableExists(tableName)) {
			System.out.println("table already exists!");
		} else {
			HTableDescriptor tableDesc = new HTableDescriptor(tableName);
			for (int i = 0; i < familys.length; i++) {
				tableDesc.addFamily(new HColumnDescriptor(familys[i]));
			}
			admin.createTable(tableDesc);
			System.out.println("create table " + tableName + " ok.");
		}
	}

	/**
	 * Delete a table
	 */
	public static void deleteTable(String tableName) throws Exception {
		try {
			HBaseAdmin admin = new HBaseAdmin(conf);
			admin.disableTable(tableName);
			admin.deleteTable(tableName);
			System.out.println("delete table " + tableName + " ok.");
		} catch (MasterNotRunningException e) {
			e.printStackTrace();
		} catch (ZooKeeperConnectionException e) {
			e.printStackTrace();
		}
	}

	/**
	 * Put (or insert) a row
	 */
	public static void addRecord(String tableName, String rowKey,
			String family, String qualifier, String value) throws Exception {
		try {
			HTable table = new HTable(conf, tableName);
			Put put = new Put(Bytes.toBytes(rowKey));
			put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes
					.toBytes(value));
			table.put(put);
			System.out.println("insert recored " + rowKey + " to table "
					+ tableName + " ok.");
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	/**
	 * Delete a row
	 */
	public static void delRecord(String tableName, String rowKey)
			throws IOException {
		HTable table = new HTable(conf, tableName);
		List<Delete> list = new ArrayList<Delete>();
		Delete del = new Delete(rowKey.getBytes());
		list.add(del);
		table.delete(list);
		System.out.println("del recored " + rowKey + " ok.");
	}

	/**
	 * Get a row
	 */
	public static void getOneRecord (String tableName, String rowKey) throws IOException{
        HTable table = new HTable(conf, tableName);
        Get get = new Get(rowKey.getBytes());
        Result rs = table.get(get);
        for(KeyValue kv : rs.raw()){
            System.out.print(new String(kv.getRow()) + " " );
            System.out.print(new String(kv.getFamily()) + ":" );
            System.out.print(new String(kv.getQualifier()) + " " );
            System.out.print(kv.getTimestamp() + " " );
            System.out.println(new String(kv.getValue()));
        }
    }
	/**
	 * Scan (or list) a table
	 */
	public static void getAllRecord (String tableName) {
        try{
             HTable table = new HTable(conf, tableName);
             Scan s = new Scan();
             ResultScanner ss = table.getScanner(s);
             for(Result r:ss){
                 for(KeyValue kv : r.raw()){
                    System.out.print(new String(kv.getRow()) + " ");
                    System.out.print(new String(kv.getFamily()) + ":");
                    System.out.print(new String(kv.getQualifier()) + " ");
                    System.out.print(kv.getTimestamp() + " ");
                    System.out.println(new String(kv.getValue()));
                 }
             }
        } catch (IOException e){
            e.printStackTrace();
        }
    }

	public static void main(String[] agrs) {
		try {
			String tablename = "scores";
			String[] familys = { "grade", "course" };
			HBaseTest.creatTable(tablename, familys);

			// add record zkb
			HBaseTest.addRecord(tablename, "zkb", "grade", "", "5");
			HBaseTest.addRecord(tablename, "zkb", "course", "", "90");
			HBaseTest.addRecord(tablename, "zkb", "course", "math", "97");
			HBaseTest.addRecord(tablename, "zkb", "course", "art", "87");
			// add record baoniu
			HBaseTest.addRecord(tablename, "baoniu", "grade", "", "4");
			HBaseTest.addRecord(tablename, "baoniu", "course", "math", "89");

			System.out.println("===========get one record========");
			HBaseTest.getOneRecord(tablename, "zkb");

			System.out.println("===========show all record========");
			HBaseTest.getAllRecord(tablename);

			System.out.println("===========del one record========");
			HBaseTest.delRecord(tablename, "baoniu");
			HBaseTest.getAllRecord(tablename);

			System.out.println("===========show all record========");
			HBaseTest.getAllRecord(tablename);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

Reference:

  1. http://hbase.apache.org/docs/current/api/index.html
  2. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/package-summary.html
  3. http://hbase.apache.org/book/data_model_operations.html
  4. http://lirenjuan.iteye.com/blog/1470645

Installing HBase on a Single Ubuntu Box

HBase needs Hadoop HDFS and Zookeeper together in a production cluster. So you should install Hadoop and Zookeeper first. I will add a separate post for this later. But for testing, a stand-alone running is enough.

But it is very helpful to follow or read this post to warm yourself up: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/.

At least for best practice, you should create a dedicated user hduser for all Hadoop related work and install latest Java (now Oracle Java 7).

Steps:

  • download hbase from http://hbase.apache.org/, I download “hbase-0.92.1.tar.gz”
  • open a terminal, and type following commands
    cd /usr/local
    sudo tar -zxf /home/hadoop/Downloads/hbase-0.92.1.tar.gz (change it to your path)
    sudo chown -R hduser:hadoop hbase-0.92.1/
    sudo ln -s hbase-0.92.1 hbase
    sudo chown -R hduser:hadoop hbase
    
  • edit conf/hbase-env.sh, to set correct Java path, for example, in my system: export JAVA_HOME=/usr/lib/jvm/java-7-oracle
  • To avoid error like this: “… Unable to find a viable location to assign region …”, chang 127.0.1.1 to 127.0.0.1 in /etc/hosts
  • start Hbase in terminal:
    cd hbase
    su hduser
    hduser@ubuntu:/usr/local/hbase$ bin/start-hbase.sh
    starting master, logging to /usr/local/hbase/bin/../logs/hbase-hduser-master-ubuntu.out
    
    hduser@ubuntu:/usr/local/hbase$ bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Type "exit<RETURN>" to leave the HBase Shell
    Version 0.92.1, r1298924, Fri Mar  9 16:58:34 UTC 2012
    
    hbase(main):001:0>
    

But, one problem of this configuration is that your hbase table is save to /tmp/hbase-${user.name} which means you’ll lose all your data whenever your server reboots (Most operating systems clear /tmp on restart). So you might want to edit conf/hbase-site.xml and set the directory you want HBase to write to, hbase.rootdir.

You can write to a local folder. Edit the conf/hbase-site.xml, replace the file path to your location:

</pre>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/hbase</value>
</property>
</configuration>
<pre>

Check Hbase status at: http://localhost:60030/rs-status

you can save table into HDFS, this will be discussed in later post.

Reference:

  1. http://hbase.apache.org/book/quickstart.html

Integrating Hive and HBase (keep updating…)

“Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.”

As a data warehouse, pulling out data from different database is a basic requirement as part of Extract, transform and load (ETL).

“HBase is the Hadoop database.”

So it is very nature to have this idea: Hive can operate HBase, as storage target or data source.

Hive storage is based on Hadoop‘s underlying append-only filesystem (HDFS). It is very good to store static data. At the same time, HBase is good for dynamic data with support of Create, Read, Update and Delete (CRUD).

Use Case

Reference:

HBase in nutshell

It will be confusing if you read the “HBase The Definitive Guide”, which state HBase is Column-oriented storage. But in logic, it is a nested HashMap. And it is the open source implementation of Google BigTable. The “Data Model” is:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

sparse, sorted, multidimensional Map

It is nested Key – Value pair. For a sample table:

hbase(main):051:0> scan 'testtable'
ROW                          COLUMN+CELL
user1                       column=colfam1:name, timestamp=1331592953239, value=value-1
user2                       column=colfam1:age, timestamp=1331592973284, value=value-2
user2                       column=colfam1:gender, timestamp=1331593007379, value=value-3

Turn this into a  nested HashMap (if you use Perl):

{
user1: {
  colfam1: {
    name: value-1
  }
}
user2: {
  colfam1: {
    age: value-2
    gender: value-3
   }
 }
}

Here, column family and column name are just different level at the map.

“Columns in HBase are grouped into column families. All column members of a column family have a common prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.”[3]

Sort is a very unique property of HBase.

All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).[3]

A row can have any number of columns in each column family. Thinking it is a Map instead of a table will make it very clear.

persistent and distributed

all data is saved into file system not in the memory. File  system can be either Hadoop’s Distributed File System (HDFS) or Amazon’s Simple Storage Service (S3), which can protect single-point-of-failure, query performance and very big storage size.

CAP Theorem

There are three primary concerns you must balance when choosing a data management system:

  • Consistency means that each client always has the same view of the data.
  • Availability means that all clients can always read and write.
  • Partition tolerance means that the system works well across physical network partitions.

According to the CAP Theorem, you can only pick two.

Reference:

  1. http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  2. http://hadoop-hbase.blogspot.com/2011/12/introduction-to-hbase.html
  3. http://hbase.apache.org/book/
  4. http://blog.nahurst.com/visual-guide-to-nosql-systems

All about PostgreSQL

This is a temporary list for my old PostgreSQL writings and I will try to repost them (at least some) here soon.

Set up Postgresql under Ubuntu

Just install the Ubuntu 10.04 LTS  “the Lucid Lynx”, through Package Manager you can search “postgresql” and install 8.4.5 version. Then search “pgadmin” the GUI tool to manage the postgresql instead of command line. To connect the database, you need to set up a password for the default user “postgres”.  Follow theses commands:

  1. sudo -u postgres psql postgres (connect to database)
  2. \password postgres (change password)

Open pgadmin III 

https://help.ubuntu.com/community/PostgreSQL