setPartitionerClass, setOutputKeyComparatorClass and setOutputValueGroupingComparator

Partitioner decides which mapper output goes to which reduer based on mapper output key. In general, different key is in different group (Iterator at the reducer side). But sometimes, we want different key is in the same group. This is the time for Output Value Grouping Comparator, which  is used to group mapper output. For easy understanding, think this is the group by condition in SQL. I will give a detail example for time serial analysis later. Output Key Comparator is used during sort stage for the mapper output key.

The above looks pretty straight forward. But there is one thing to remember:  if you use setOutputValueGroupingComparator, all the key in the same group at reducer side will be same now even they are not the same at the mapper output.

You can download the example from:

  • record.txt is the input (three fields, year, an random number, place)
  • is the main hadoop code
  • is the mapper output key object
  • output.txt is the output

You will notice that number for the same year is the same now, the max one.

Note: the code is modified from book “Hadoop The Tefinitive Guide”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s