The problem is: we have an existing java project with the back end storage as MySQL. The data is too big to process in memory now. The solution is to use Hadoop.
How to in a nutshell:
- load data from MySQL into Hive using Sqoop
- perform JOIN on different tables using Hive
- write Java code to process join result using Hadoop
- analyze the output using other tools like R
Partitioner decides which mapper output goes to which reduer based on mapper output key. In general, different key is in different group (Iterator at the reducer side). But sometimes, we want different key is in the same group. This is the time for Output Value Grouping Comparator, which is used to group mapper output. For easy understanding, think this is the group by condition in SQL. I will give a detail example for time serial analysis later. Output Key Comparator is used during sort stage for the mapper output key.
The above looks pretty straight forward. But there is one thing to remember: if you use setOutputValueGroupingComparator, all the key in the same group at reducer side will be same now even they are not the same at the mapper output.
You can download the example from: https://www.assembla.com/spaces/autofei_public/documents
- record.txt is the input (three fields, year, an random number, place)
- MaxTemperatureUsingSecondarySort.java is the main hadoop code
- IntPair.java is the mapper output key object
- output.txt is the output
You will notice that number for the same year is the same now, the max one.
Note: the code is modified from book “Hadoop The Tefinitive Guide”