Install and run Mahout at single Linux box

This post show you how to install and run Mahout at a stand-alone Linux box.

Prerequisites for Building Mahout

  • Java JDK >=1.6
  • Maven
  • SVN

Steps:

  • svn co http://svn.apache.org/repos/asf/mahout/trunk
  • change directory to the checked out directory
  • mvn install
  • change to the core directory
  • mvn compile
  • mvn  install
  • change to the examples directory
  • mvn compile
  • mvn  install

Download test data from: http://www.grouplens.org/node/73. Please download “MovieLens 1M” one.

Run test example

Note: replace the test data file path to yours.

  • mvn -e exec:java -Dexec.mainClass=”org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner” -Dexec.args=”-i /home/hduser/trunk/examples/ml-1m/ratings.dat
    + Error stacktraces are turned on.
    [INFO] Scanning for projects...
    [INFO] Searching repository for plugin with prefix: 'exec'.
    [INFO] ------------------------------------------------------------------------
    [INFO] Building Mahout Examples
    [INFO]    task-segment: [exec:java]
    [INFO] ------------------------------------------------------------------------
    [INFO] Preparing exec:java
    [INFO] No goals needed for project - skipping
    [INFO] [exec:java {execution: default-cli}]
    12/03/28 14:08:33 INFO file.FileDataModel: Creating FileDataModel for file /tmp/ratings.txt
    12/03/28 14:08:33 INFO file.FileDataModel: Reading file info...
    12/03/28 14:08:34 INFO file.FileDataModel: Processed 1000000 lines
    12/03/28 14:08:34 INFO file.FileDataModel: Read lines: 1000209
    12/03/28 14:08:35 INFO model.GenericDataModel: Processed 6040 users
    12/03/28 14:08:35 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of GroupLensDataModel
    12/03/28 14:08:35 INFO model.GenericDataModel: Processed 1753 users
    12/03/28 14:08:36 INFO slopeone.MemoryDiffStorage: Building average diffs...
    12/03/28 14:09:36 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 1719 users
    12/03/28 14:09:36 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 1719 tasks in 1 threads
    12/03/28 14:09:36 INFO eval.StatsCallable: Average time per recommendation: 343ms
    12/03/28 14:09:36 INFO eval.StatsCallable: Approximate memory used: 448MB / 798MB
    12/03/28 14:09:36 INFO eval.StatsCallable: Unable to recommend in 0 cases
    12/03/28 14:09:43 INFO eval.StatsCallable: Average time per recommendation: 7ms
    12/03/28 14:09:43 INFO eval.StatsCallable: Approximate memory used: 510MB / 798MB
    12/03/28 14:09:43 INFO eval.StatsCallable: Unable to recommend in 13 cases
    12/03/28 14:09:52 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.7149488038906546
    12/03/28 14:09:52 INFO grouplens.GroupLensRecommenderEvaluatorRunner: 0.7149488038906546
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESSFUL
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 1 minute 26 seconds
    [INFO] Finished at: Wed Mar 28 14:09:53 PDT 2012
    [INFO] Final Memory: 53M/761M
    [INFO] ------------------------------------------------------------------------
    

Creating a simple recommender

Create a Maven project

mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.autofei -DartifactId=mahoutrec

This creates an empty project called mahoutrec with the package namespace com.autofei. Now change to the mahoutrec directory. You can try out the new project by running:

mvn compile
mvn exec:java -Dexec.mainClass="com.autofei.App"

Set the project dependencies
edit pom.xml, remember to change your Mahout version, in my case, it is 0.7-SNAPSHOT. an example file:

<?xml version="1.0"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<modelVersion>4.0.0</modelVersion>
<parent>
<artifactId>mahout</artifactId>
<groupId>org.apache.mahout</groupId>
<version>0.7-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<groupId>com.autofei</groupId>
<artifactId>mahoutrec</artifactId>
<version>1.0-SNAPSHOT</version>
<name>mahoutrec</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.7-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.7-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.7-SNAPSHOT</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
</dependencies>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
</project>

Test data
Put these data into a file dummy-bool.csv under datasets directory

#userId,itemId
1,3
1,4
2,44
2,46
3,3
3,5
3,6
4,3
4,5
4,11
4,44
5,1
5,2
5,4

Create a java file under src/main/java/com/autofei/, named UnresystBoolRecommend.java:

 package com.autofei;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.List;
import java.io.IOException;

import org.apache.commons.cli2.OptionException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;

public class UnresystBoolRecommend {

public static void main(String... args) throws FileNotFoundException, TasteException, IOException, OptionException {

// create data source (model) - from the csv file
File ratingsFile = new File("datasets/dummy-bool.csv");
DataModel model = new FileDataModel(ratingsFile);

// create a simple recommender on our data
CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));

// for all users
for (LongPrimitiveIterator it = model.getUserIDs(); it.hasNext();){
long userId = it.nextLong();

// get the recommendations for the user
List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, 10);

// if empty write something
if (recommendations.size() == 0){
System.out.print("User ");
System.out.print(userId);
System.out.println(": no recommendations");
}

// print the list of recommendations for each
for (RecommendedItem recommendedItem : recommendations) {
System.out.print("User ");
System.out.print(userId);
System.out.print(": ");
System.out.println(recommendedItem);
}
}
}
}

Run the code

  • mvn compile
  • mvn exec:java -Dexec.mainClass="com.autofei.UnresystBoolRecommend"
    
    [INFO] Scanning for projects...
    [INFO] Searching repository for plugin with prefix: 'exec'.
    [INFO] ------------------------------------------------------------------------
    [INFO] Building mahoutrec
    [INFO]    task-segment: [exec:java]
    [INFO] ------------------------------------------------------------------------
    [INFO] Preparing exec:java
    [INFO] No goals needed for project - skipping
    [INFO] [exec:java {execution: default-cli}]
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    User 1: RecommendedItem[item:5, value:1.0]
    User 2: RecommendedItem[item:5, value:1.0]
    User 2: RecommendedItem[item:3, value:1.0]
    User 3: no recommendations
    User 4: no recommendations
    User 5: RecommendedItem[item:5, value:1.0]
    User 5: RecommendedItem[item:3, value:1.0]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESSFUL
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 3 seconds
    [INFO] Finished at: Wed Mar 28 16:18:31 PDT 2012
    [INFO] Final Memory: 14M/35M
    [INFO] ------------------------------------------------------------------------

From now, you can test other algorithm inside Mahout.

Reference:

Installing HBase on a Single Ubuntu Box

HBase needs Hadoop HDFS and Zookeeper together in a production cluster. So you should install Hadoop and Zookeeper first. I will add a separate post for this later. But for testing, a stand-alone running is enough.

But it is very helpful to follow or read this post to warm yourself up: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/.

At least for best practice, you should create a dedicated user hduser for all Hadoop related work and install latest Java (now Oracle Java 7).

Steps:

  • download hbase from http://hbase.apache.org/, I download “hbase-0.92.1.tar.gz”
  • open a terminal, and type following commands
    cd /usr/local
    sudo tar -zxf /home/hadoop/Downloads/hbase-0.92.1.tar.gz (change it to your path)
    sudo chown -R hduser:hadoop hbase-0.92.1/
    sudo ln -s hbase-0.92.1 hbase
    sudo chown -R hduser:hadoop hbase
    
  • edit conf/hbase-env.sh, to set correct Java path, for example, in my system: export JAVA_HOME=/usr/lib/jvm/java-7-oracle
  • To avoid error like this: “… Unable to find a viable location to assign region …”, chang 127.0.1.1 to 127.0.0.1 in /etc/hosts
  • start Hbase in terminal:
    cd hbase
    su hduser
    hduser@ubuntu:/usr/local/hbase$ bin/start-hbase.sh
    starting master, logging to /usr/local/hbase/bin/../logs/hbase-hduser-master-ubuntu.out
    
    hduser@ubuntu:/usr/local/hbase$ bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Type "exit<RETURN>" to leave the HBase Shell
    Version 0.92.1, r1298924, Fri Mar  9 16:58:34 UTC 2012
    
    hbase(main):001:0>
    

But, one problem of this configuration is that your hbase table is save to /tmp/hbase-${user.name} which means you’ll lose all your data whenever your server reboots (Most operating systems clear /tmp on restart). So you might want to edit conf/hbase-site.xml and set the directory you want HBase to write to, hbase.rootdir.

You can write to a local folder. Edit the conf/hbase-site.xml, replace the file path to your location:

</pre>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/hbase</value>
</property>
</configuration>
<pre>

Check Hbase status at: http://localhost:60030/rs-status

you can save table into HDFS, this will be discussed in later post.

Reference:

  1. http://hbase.apache.org/book/quickstart.html

Integrating Hive and HBase (keep updating…)

“Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.”

As a data warehouse, pulling out data from different database is a basic requirement as part of Extract, transform and load (ETL).

“HBase is the Hadoop database.”

So it is very nature to have this idea: Hive can operate HBase, as storage target or data source.

Hive storage is based on Hadoop‘s underlying append-only filesystem (HDFS). It is very good to store static data. At the same time, HBase is good for dynamic data with support of Create, Read, Update and Delete (CRUD).

Use Case

Reference:

Prepackaged Hadoop Distribution List

Cloudera’s Distribution Including Apache Hadoop (CDH)

CDH3 Update 3 Packaging

To view the overall release notes for CDH3 Update 3 (CDH3U3), click here.

CDH3 Project Package Version Tarball Version Patch Level
Apache Hadoop 0.20 hadoop-0.20.2+923.195 hadoop-0.20.2-cdh3u3.tar.gz 923.195
Apache Flume flume-0.9.4+25.40 flume-0.9.4-cdh3u3.tar.gz 25.40
Apache HBase hbase-0.90.4+49.137 hbase-0.90.4-cdh3u3.tar.gz 49.137
Apache Hive hive-0.7.1+42.36 hive-0.7.1-cdh3u3.tar.gz 42.36
Apache Hue hue-1.2.0.0+114.20 hue-1.2.0-cdh3u3.tar.gz 114.20
Apache Mahout mahout-0.5+9.3 mahout-0.5-cdh3u3.tar.gz 9.3
Apache Oozie oozie-2.3.2+27.12 oozie-2.3.2-cdh3u3.tar.gz 27.12
Apache Pig pig-0.8.1+28.26 pig-0.8.1-cdh3u3.tar.gz 28.26
Apache Sqoop sqoop-1.3.0+5.68 sqoop-1.3.0-cdh3u3.tar.gz 5.68
Apache Whirr whirr-0.5.0+4.8 whirr-0.5.0-cdh3u3.tar.gz 4.8
Apache ZooKeeper zookeeper-3.3.4+19 zookeeper-3.3.4-cdh3u3.tar.gz 19

CDH4 Beta 1 Packaging

To view the overall release notes for CDH4 Beta 1 (CDH4B1), click here.

CDH4 Project Package Version Tarball Version Patch Level
Apache Hadoop 0.23 hadoop-0.23.0+161 hadoop-0.23.0-cdh4b1.tar.gz 161
Apache HBase hbase-0.92.0+8 hbase-0.92.0-cdh4b1.tar.gz 8
Apache Hive hive-0.8.0+20 hive-0.8.0-cdh4b1.tar.gz 20
Apache MRv1 mr1-0.20.2+1163 mr1-0.23.0-mr1-cdh4b1.tar.gz 1163
Apache Pig pig-0.9.2+12 pig-0.9.2-cdh4b1.tar.gz 12
Apache Sqoop sqoop-1.4.0+13 sqoop-1.4.0-cdh4b1.tar.gz 13
Apache ZooKeeper zookeeper-3.4.1+7 zookeeper-3.4.1-cdh4b1.tar.gz 7

Download:https://ccp.cloudera.com/display/SUPPORT/Downloads

Hortnworks Data Platform (HDP)

You need to register “Technology Preview Program” first. Link:http://hortonworks.com/technology/techpreview/

MapR M3 or M5

Versions M3 Edition M5 Edition
Complete, Tested, Stable, Integrated
Hive, Pig, HBase, Sqoop, Mahout, Flume, and more
3x Performance
Direct Access NFS™
Unlimited Scalability
Realtime Data Flows
MapR Heatmap™
Provisioning Control
Built-in Monitoring
Dependable, Reliable
Alerts and Alarms
Lockless Storage Services
Job Tracker HA
Distributed Namenode HA™
NFS Multinode HA
Mirroring
Snapshots
Data Placement Control
Support

 

Download:http://www.mapr.com/download (need register)

HBase in nutshell

It will be confusing if you read the “HBase The Definitive Guide”, which state HBase is Column-oriented storage. But in logic, it is a nested HashMap. And it is the open source implementation of Google BigTable. The “Data Model” is:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

sparse, sorted, multidimensional Map

It is nested Key – Value pair. For a sample table:

hbase(main):051:0> scan 'testtable'
ROW                          COLUMN+CELL
user1                       column=colfam1:name, timestamp=1331592953239, value=value-1
user2                       column=colfam1:age, timestamp=1331592973284, value=value-2
user2                       column=colfam1:gender, timestamp=1331593007379, value=value-3

Turn this into a  nested HashMap (if you use Perl):

{
user1: {
  colfam1: {
    name: value-1
  }
}
user2: {
  colfam1: {
    age: value-2
    gender: value-3
   }
 }
}

Here, column family and column name are just different level at the map.

“Columns in HBase are grouped into column families. All column members of a column family have a common prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.”[3]

Sort is a very unique property of HBase.

All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).[3]

A row can have any number of columns in each column family. Thinking it is a Map instead of a table will make it very clear.

persistent and distributed

all data is saved into file system not in the memory. File  system can be either Hadoop’s Distributed File System (HDFS) or Amazon’s Simple Storage Service (S3), which can protect single-point-of-failure, query performance and very big storage size.

CAP Theorem

There are three primary concerns you must balance when choosing a data management system:

  • Consistency means that each client always has the same view of the data.
  • Availability means that all clients can always read and write.
  • Partition tolerance means that the system works well across physical network partitions.

According to the CAP Theorem, you can only pick two.

Reference:

  1. http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  2. http://hadoop-hbase.blogspot.com/2011/12/introduction-to-hbase.html
  3. http://hbase.apache.org/book/
  4. http://blog.nahurst.com/visual-guide-to-nosql-systems