The problem is that MapReduce job outputs the records with the top K values rather than only the maximum.
The Python script named as “AttributeTopK.py”, followed by two parameters: the field index of the input CSV file, the number of top K .
#!/usr/bin/python import sys index = int(sys.argv) topK = int(sys.argv) top =  for line in sys.stdin: fields = line.strip().split(",") if fields[index].isdigit(): top.append(int(fields[index])) else: topSet = set(top) top = list(topSet) top.sort() top = top[-topK:] for v in top: print v
Run the code
hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -D mapred.reduce.tasks=1 -input /home/hadoop/apat -output /home/hadoop/temp2 -mapper 'AttributeTopK.py 8 3' -reducer 'AttributeTopK.py 0 3' -file ~/temp/patents/AttributeTopK.py
Actually there are several similar top K questions under different scenarios:
- solve top K in one machine with enough memory
- solve top K in one machine without enough memory
- solve top K in one machine but data from several distributed data source
- solve top K in MapReduce
The python code can solve question 1 and 4.