Inner product of two sparse vectors using MapReduce streaming

Given two vectors, X = [x1, x2, …] and Y = [y1, y2, …], their inner product is Z = x1 * y1 + x2 * y2 + … .
Two vectors are usually given in a sparse representation:
Vector One

1,0.4
3,0.6
7,0.6
4,0.7
10,0.9
12,0.8
20,0.1

Vector Two

1,0.4
3,0.1
6,0.6
4,0.7
10,0.9
12,0.2
21,0.1

Mapper streaming code in Perl:

#!/usr/bin/perl

while($line=<STDIN>){
@fields = split(/,/, $line);
if($fields[0] && $fields[1]){
print "$fields[0]\t$fields[1]";
}
}

Reducer streaming code in Perl:

#!/usr/bin/perl

$lastKey="";
$product=1;
$count=0;

while($line=<STDIN>){
@fields=split(/\t/, $line);
$key = $fields[0];
$value = $fields[1];
if($lastKey ne "" && $key ne $lastKey){
if($count==2){
print "$lastKey\t$product\n";
}
$product=$value;
$lastKey=$key;
$count=1;
}
else{
$product=$product*$value;
$lastKey=$key;
$count++;
}
}
#the last key
if($count==2){
print "$lastKey\t$product\n";
}

Run the code:

hadoop@fxu-t60:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar  -D mapred.reducer.tasks=1 -input /home/hadoop/vectors -output /home/hadoop/vectors-output -mapper "VectorProduct.pl" -file ~/Desktop/HadoopInAction/VectorProduct.pl -reducer "VectorProductReducer.pl" -file ~/Desktop/HadoopInAction/VectorProductReducer.pl

Output

1    0.16
10    0.81
12    0.16
3    0.06
4    0.49

One more post-processing code is needed to do the sum of all values. Or use this reducer code to get the sum directly:

#!/usr/bin/perl

$lastKey="";
$product=1;
$count=0;
$sum=0;

while($line=<STDIN>){
@fields=split(/\t/, $line);
$key = $fields[0];
$value = $fields[1];
if($lastKey ne "" && $key ne $lastKey){
if($count==2){
$sum=$sum+$value;
}
$product=$value;
$lastKey=$key;
$count=1;
}
else{
$product=$product*$value;
$lastKey=$key;
$count++;
}
}
#the last key
if($count==2){
$sum=$sum+$value;
}

print $sum;

Here is only showing how to do product of two vectors. If there are more than two, giving each vector an ID might solve the problem with little modification. This is actually the numerator to compute the  Cosine similarity.

Advertisements

One thought on “Inner product of two sparse vectors using MapReduce streaming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s