Web traffic measurement using MapReduce streaming

Question: “Take a web server log file and write a Streaming program with the Aggregate package to find the hourly traffic to that site.”

I will use “Aggregate package” with Python and Perl streaming. Input log file is like:

fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
#!/usr/bin/python

import sys
import re

# find 2007:04:54:20
my_regex = r'[0-9]{4}(?:\:[0-9]{2}){3}'

for line in sys.stdin:
logtime = re.findall(my_regex, line)
if len(logtime) == 1:
fields = logtime[0].split(":")
print "LongValueSum:" + fields[1] + "\t" + "1"

Run the code:

hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -input /home/hadoop/weblog -output /home/hadoop/weblog-output -mapper 'WebTraffic.py' -file ~/Desktop/HadoopInAction/WebTraffic.py -reducer aggregate
#!/usr/bin/perl

while($line = <STDIN>){
$line =~ m/(\d{4}(?:\:\d{2}){3})/;

if($1){
@fields = split(/:/, $1);
print "LongValueSum:$fields[1]\t1\n";
}
}

Run the code:

hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -input /home/hadoop/weblog -output /home/hadoop/weblog-output-pl -mapper 'WebTraffic.pl' -file ~/Desktop/HadoopInAction/WebTraffic.pl -reducer aggregate
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s