[Revisit] Eclipse Hadoop plug-in under Ubuntu Linux

I had a post at: http://webpages.uncc.edu/~fxu/Programming/Eclipse%20hadoop%20plug-in.htm

Set up the Hadoop

First of all, you should set up the Hadoop properly especial the listening port. Check [5] for more details. My hadoop is running on single node model. So go to section “Pseudo-Distributed Operation”, and follow the steps. My hadoop is under “/usr/local/hadoop”.

change conf/core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

change conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

change conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Install Java 1.6

Default Java for Ubuntu 8 is gij 1.5. You can install Java 1.6 from “Synaptic Package Management”. After that, you should modify command “java” to new version.

hadoop@ubuntu:~$ java -version
 java version "1.5.0"
 gij (GNU libgcj) version 4.3.2
Copyright (C) 2007 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions. There is  		NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR  		PURPOSE.
 hadoop@ubuntu:~$ ls -l /usr/bin/java
 lrwxrwxrwx 1 root root 22 2010-01-21 00:07 /usr/bin/java ->  		/etc/alternatives/java
 hadoop@ubuntu:~$ ls -l /usr/bin/javac
 lrwxrwxrwx 1 root root 23 2010-01-21 00:07 /usr/bin/javac ->  		/etc/alternatives/javac
 hadoop@ubuntu:~$ ls /usr/lib/jvm/ -l
 total 8
 drwxr-xr-x 7 root root 4096 2010-01-21 12:19 java-1.5.0-gcj-4.3-1.5.0.0
 lrwxrwxrwx 1 root root 19 2010-01-21 00:06 java-6-sun ->  		java-6-sun-1.6.0.14
 drwxr-xr-x 8 root root 4096 2010-01-21 00:06 java-6-sun-1.6.0.14
 lrwxrwxrwx 1 root root 26 2010-01-21 12:18 java-gcj ->  		java-1.5.0-gcj-4.3-1.5.0.0

hadoop@ubuntu:~$ sudo rm /usr/bin/javac
 hadoop@ubuntu:~$ sudo rm /usr/bin/java
 hadoop@ubuntu:~$ sudo ln -s /usr/lib/jvm/java-6-sun/bin/javac /usr/bin/javac
 hadoop@ubuntu:~$ sudo ln -s /usr/lib/jvm/java-6-sun/bin/java /usr/bin/java
 hadoop@ubuntu:~$ java -version
 java version "1.6.0_14"
 Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
 Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode)

Install Eclipse 3.5

You can directly install Eclipse 3.2 version from “Synaptic Package Management”, but this does not match the plug-in requirement. Manually download 3.5 from [1] and check [2] on how to install it. In my case, I install Eclipse 3.5 under “/home/hadoop/bin/packages/eclipse3.5”.

If you are using Ubuntu 9, you can directly install 3.5 from “Synaptic Package Management”.

Install Eclipse plug-in

Please check the official document [3] to have a brief idea. Then download the plug-in from [4]. I download it to the desktop. Copy the file into Eclipse plgu-in directory using command terminal:

Type: cp /home/hadoop/Desktop/hadoop-0.20.1-eclipse-plugin.jar /home/hadoop/bin/packages/eclipse3.5/plugins/

Set up Eclipse

Open eclipse, go to manu “Window -> Preferences”

  • Set up the “Hadoop map/Reduce” to know where is the hadoop installation.

  • set up “Installed JREs” to use the Java 1.6

Now you can start hadoop manually or it starts automatically.

go to “Window -> open perspective -> Other..”,  choose “Map/Reduce”.

Right click in the perspective, and choose “New Hadoop Location…”. The port number is coming from the configuration file.

You also can directly operate on hadoop FS through “DFS Locations”.

There are several updates for that post:

  • I am using Ubuntu 10.04, Eclipse 3.5.2 and hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar now
  • if you install the eclipse from package management, there is no directory “/home/hadoop/bin/packages/eclipse3.5/plugins/“. Instead, you should copy the jar file downloaded to “/usr/lib/eclipse/plugins/” with sudo
  • check out the hfs and map/reduce ports number from “conf/core-site.xml” and “conf/mapred-site.xml”, where are 54310 and 54311 respectively.

Data analysis in Excel

Excel is a common tool we use it very frequently. But some power functions are hidden inside until we mine them out. (I will keep this post updated.)

FREQUENCY function [1]

“this useful function can analyse a series of values and summarise them into a number of specified ranges.” Or in a simple word: frequency distribution. Once you have the distribution, you can draw Histograms[2] to show it.

Reference:

  1. http://www.meadinkent.co.uk/xlfreq.htm
  2. http://www.treeplan.com/BetterHistogram_20041117_1555.htm

Install PostgreSQL 9.0 from source code at Ubuntu

System: Ubuntu 10.04 and PostgreSQL 9.0.3

Download the source code from: http://www.postgresql.org/ftp/source/

Unzip the source and go to a directory. All followings happen via command line terminal. So open a terminal:

hadoop@fxu-t60:~/Downloads/postgresql-9.0.3$ ./configure
....
....
configure: error: readline library not found
If you have readline already installed, see config.log for details on the failure.  It is possible the compiler isn't looking in the proper directory.
Use --without-readline to disable readline support.

Install readline package to fix this:

sudo apt-get install libreadline6-dev

Next problem you might see:

<pre>hadoop@fxu-t60:~/Downloads/postgresql-9.0.3$ ./configure
....
....</pre>
configure: error: zlib library not found
If you have zlib already installed, see config.log for details on the
failure.  It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.

Install zlib package to fix this:

sudo apt-get install zlib1g-dev

Now everything is ready to go:

./configure
sudo make
sudo make install

Then create the data directory to store all data files and do some configuration

sudo mkdir /usr/local/pgsql/data
sudo adduser postgres
sudo chown postgres /usr/local/pgsql/data
su - postgres

Then initialize the database:

postgres@fxu-t60:~$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".

fixing permissions on existing directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.

Success. You can now start the database server using:

 /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
or
 /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start

Then create log directory and a sample database:

postgres@fxu-t60:~$ cd /usr/local/pgsql/data/
postgres@fxu-t60:/usr/local/pgsql/data$ mkdir log
postgres@fxu-t60:/usr/local/pgsql/data$ /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data >/usr/local/pgsql/data/log/logfile 2>&1 &
[1] 16002
postgres@fxu-t60:/usr/local/pgsql/data$ cd ..
postgres@fxu-t60:/usr/local/pgsql$ bin/createdb mydb
postgres@fxu-t60:/usr/local/pgsql$ bin/psql mydb
psql (9.0.3)
Type "help" for help.

mydb=#

(optional) change postgres user password,type: sudo passwd postgres
(optional) To start the server automatically at every boot, add: su -c ‘/usr/local/pgsql/bin/pg_ctl start -l /usr/local/pgsql/data/log/logfile -D /usr/local/pgsql/data’ postgres to your /etc/rc.local file

All about PostgreSQL

This is a temporary list for my old PostgreSQL writings and I will try to repost them (at least some) here soon.

[Revisit] Intall Hadoop and Hive on Ubuntu as a single-node Hadoop cluster

I installed old 0.17 version at least one year ago. But thing changes so fast so the new 0.20 version introduces new API (I like the new one, very clean and easy to understand). Also Ubuntu reaches to 10.04 LTS – the Lucid Lynx – release. I keep using Cloudear VM for years (Save lots of time if you want to have a good first before jump into it and for demonstration). It handles all “dirty works” like installation and configuration for us. But this time I will build everything from scratch inside my  new SSD drive.

I do not want to simply copy the steps from other blogs here. So Please refer to the first link for a detail instructions (This is really great!). I try to add some tips or trouble shootings here:


fxu@fxu-t60:~$ sudo update-java-alternatives -s java-6-sun
update-alternatives: error: no alternatives for mozilla-javaplugin.so.
update-alternatives: error: no alternatives for xulrunner-1.9-javaplugin.so.
update-alternatives: error: no alternatives for mozilla-javaplugin.so.
update-alternatives: error: no alternatives for xulrunner-1.9-javaplugin.so.

You can use install new firefox plug-in

sudo apt-get install sun-java6-plugin

To solve another error, please refer to link 3.

You also need to setup JAVA_HOME and PATH variable. Open your $HOME/.bash_profile or /etc/profile (system wide) configuration. Open your .bash_profile file:

$ vi $HOME/.bash_profile

Append following line:

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export PATH=$PATH:$JAVA_HOME/bin

My Ubuntu does not install the ssh server yet, type following command:

sudo apt-get install openssh-server

Type “exit” to quit the ssh login session.

But you can not directly run hadoop command via terminal unless you specify /usr/local/hadoop/bin/hadoop. To solve this and also give hadoop root privilege:

sudo adduser hadoop admin

export PATH=$PATH:/usr/local/hadoop/bin

Install the Hadoop is straightforward:

hadoop@fxu-t60:/usr/local$ sudo tar xzf /home/hadoop/Desktop/Software/hadoop-0.18.3.tar.gz
sudo chown -R hadoop:hadoop hadoop-0.18.3/

How to start Hadoop after reboot your system?
assume Hadoop is installed at /usr/local/hadoop
Login the system as user hadoop
type:

cd /usr/local/hadoop/bin/
./start-all.sh

./hadoop fs -ls
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2011-02-17 12:08 /user/hadoop/wordcount-output
-rw-r--r--   1 hadoop supergroup         44 2011-02-17 12:07 /user/hadoop/wordcountTest.txt

I install Hive at “/usr/local/hive” from a Stable Release not the source code, and run it using user “hadoop” not “root”. By default hive uses a directory called “/user/hive/warehouse”. You can change it via editing “/usr/local/hive/conf/conf/hive-default.xml”. I keep the default one. Add hadoop path before you can run it:

export PATH=$PATH:/usr/local/hadoop/bin

When I try to run hive, I got following error:

hadoop@fxu-t60:/usr/local/hive$ bin/hive
Invalid maximum heap size: -Xmx4096m
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine.
hadoop@fxu-t60:/usr/local/hive$ bin/hive -- service hiveserver
Invalid maximum heap size: -Xmx4096m
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine.

To solve this, modify “hive/bin/ext/util/execHiveCmd.sh” HADOOP_HEAPSIZE=4096 to a proper size according to your machine and there is no need to create related directory in HDFS before a table can be created in Hive 0.6 now.

hadoop@fxu-t60:/usr/local/hive$ vi bin/ext/util/execHiveCmd.sh
hadoop@fxu-t60:/usr/local/hive$ bin/hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201102270944_934721366.txt
hive> create table pokes (foo INT, bar STRING);
OK
Time taken: 7.305 seconds
hive>

reference:

  1. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  2. http://www.hackido.com/2010/05/install-hadoop-and-hive-on-ubuntu-lucid.html
  3. http://ubuntuforums.org/showthread.php?t=831235
  4. http://wiki.apache.org/hadoop/Hive/GettingStarted

How to insert source code in Google Docs or WordPress

For Google Docs like document, go to this website: http://colorer.sourceforge.net/php/ input what you want and select the proper language. The output is colored with nice style. Simply copy them into your document.

For wordpress, just wrap your code with tag

To accomplish the above, just wrap your code in these tags "sourcecode" directly in the “visual” edit interface (HTML is not necessary), most common languages are supported, like java, php, python, etc. The attributes for this tag are: language, For more: http://en.support.wordpress.com/code/posting-source-code/

A sample:

package org.myorg;

public class test {

    public static void main(String[] args) throws Exception {
        String arg = "Hello World!";
        System.out.println(arg);
    }
}