The goal is to extract IP address from input files. The script reads each line and uses regular expression to extract IP address.
#!/usr/bin/python import re import sys # Use "(?:" to suppress capturing parenethesis. # Use "\." to match a dot (suppress regex meaning of ".") # "{3}": Look for 3 iterations (3 iterations of a dot followed by a number). my_regex = r'[0-9]+(?:\.[0-9]+){3}' for line in sys.stdin: IPs = re.findall(my_regex, line) if len(IPs) == 1: print IPs[0]
Run the code:
hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -D mapred.reduce.tasks=1 -input /home/hadoop/google -output /home/hadoop/google-ip -mapper 'googleIP.py' -file ~/Desktop/data/TLD/googleIP.py