Find top K records using MapReduce streaming

The problem is that MapReduce job outputs the records with the top K values rather than only the maximum.

The Python script named as “AttributeTopK.py”, followed by two parameters: the field index of the input CSV file, the number of top K .

#!/usr/bin/python

import sys

index = int(sys.argv[1])
topK = int(sys.argv[2])
top = []

for line in sys.stdin:
 fields = line.strip().split(",")
 if fields[index].isdigit():
 top.append(int(fields[index]))
else:
 topSet = set(top)
 top = list(topSet)
 top.sort()
 top = top[-topK:]
 for v in top:
 print v

Run the code

hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -D mapred.reduce.tasks=1 -input /home/hadoop/apat -output /home/hadoop/temp2 -mapper 'AttributeTopK.py 8 3' -reducer 'AttributeTopK.py 0 3' -file ~/temp/patents/AttributeTopK.py

Actually there are several similar top K questions under different scenarios:

  1. solve top K in one machine with enough memory
  2. solve top K in one machine without enough memory
  3. solve top K in one machine but data from several distributed data source
  4. solve top K in MapReduce

The python code can solve question 1 and 4.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s