java.io.FileNotFoundException using Distributed Cache with Eclipse plug-in

Sometime, it is very useful to distribute some file cross nodes for a task. A classical case is JOIN with a small size metadata file. The file can be local or HDFS. 

If you use this Java snippet: DistributedCache.addCacheFile(new URI(“/model/conf/txn_header”), conf); We assume the file is at HDFS, but actually it will look at local file system and generate java.io.FileNotFoundException if you use Eclipse plug-in (no problem with Hadoop command line). To solve  this, please add Hadoop configuration into job config:

conf.addResource(new Path(“/usr/local/hadoop/conf/core-site.xml”));
conf.addResource(new Path(“/usr/local/hadoop/conf/hdfs-site.xml”));

Reference: http://blog.rajeevsharma.in/2009/06/using-hdfs-in-java-0200.html

java.lang.ClassNotFoundException at Map-Reduce Job

The problem is I have exported all third party libraries into the lib directory in the jar file, but I still receive java.lang.ClassNotFoundException. As we know, all libraries need to be distributed to all task nodes. In general, there are several ways to put the jar libraries:

  1. package into jar file
  2. use  “-libjars” command line option of the “hadoop jar “
  3. load libraries into nodes manually

The first method works well for manually installed Hadoop system, but failed at CDH4. My guess is the CDH has its own wrapper for Hadoop. So the simple solution is the second method: hadoop jar Model.jar hadoop.mr.ElementDriver -libjars lib/your_library.jar /model/input /model/output_element. Using “,” if you have more than one library.

Reference:

  1. http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/