Find top K records using MapReduce streaming

The problem is that MapReduce job outputs the records with the top K values rather than only the maximum.

The Python script named as “”, followed by two parameters: the field index of the input CSV file, the number of top K .


import sys

index = int(sys.argv[1])
topK = int(sys.argv[2])
top = []

for line in sys.stdin:
 fields = line.strip().split(",")
 if fields[index].isdigit():
 topSet = set(top)
 top = list(topSet)
 top = top[-topK:]
 for v in top:
 print v

Run the code

hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -D mapred.reduce.tasks=1 -input /home/hadoop/apat -output /home/hadoop/temp2 -mapper ' 8 3' -reducer ' 0 3' -file ~/temp/patents/

Actually there are several similar top K questions under different scenarios:

  1. solve top K in one machine with enough memory
  2. solve top K in one machine without enough memory
  3. solve top K in one machine but data from several distributed data source
  4. solve top K in MapReduce

The python code can solve question 1 and 4.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s