# Inner product of two sparse vectors using MapReduce streaming

Given two vectors, X = [x1, x2, …] and Y = [y1, y2, …], their inner product is Z = x1 * y1 + x2 * y2 + … .
Two vectors are usually given in a sparse representation:
Vector One

```1,0.4
3,0.6
7,0.6
4,0.7
10,0.9
12,0.8
20,0.1```

Vector Two

```1,0.4
3,0.1
6,0.6
4,0.7
10,0.9
12,0.2
21,0.1```

Mapper streaming code in Perl:

```#!/usr/bin/perl

while(\$line=<STDIN>){
@fields = split(/,/, \$line);
if(\$fields && \$fields){
print "\$fields\t\$fields";
}
}

```

Reducer streaming code in Perl:

```#!/usr/bin/perl

\$lastKey="";
\$product=1;
\$count=0;

while(\$line=<STDIN>){
@fields=split(/\t/, \$line);
\$key = \$fields;
\$value = \$fields;
if(\$lastKey ne "" && \$key ne \$lastKey){
if(\$count==2){
print "\$lastKey\t\$product\n";
}
\$product=\$value;
\$lastKey=\$key;
\$count=1;
}
else{
\$product=\$product*\$value;
\$lastKey=\$key;
\$count++;
}
}
#the last key
if(\$count==2){
print "\$lastKey\t\$product\n";
}```

Run the code:

`hadoop@fxu-t60:/usr/local/hadoop\$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar  -D mapred.reducer.tasks=1 -input /home/hadoop/vectors -output /home/hadoop/vectors-output -mapper "VectorProduct.pl" -file ~/Desktop/HadoopInAction/VectorProduct.pl -reducer "VectorProductReducer.pl" -file ~/Desktop/HadoopInAction/VectorProductReducer.pl`

Output

```1    0.16
10    0.81
12    0.16
3    0.06
4    0.49
```

One more post-processing code is needed to do the sum of all values. Or use this reducer code to get the sum directly:

```#!/usr/bin/perl

\$lastKey="";
\$product=1;
\$count=0;
\$sum=0;

while(\$line=<STDIN>){
@fields=split(/\t/, \$line);
\$key = \$fields;
\$value = \$fields;
if(\$lastKey ne "" && \$key ne \$lastKey){
if(\$count==2){
\$sum=\$sum+\$value;
}
\$product=\$value;
\$lastKey=\$key;
\$count=1;
}
else{
\$product=\$product*\$value;
\$lastKey=\$key;
\$count++;
}
}
#the last key
if(\$count==2){
\$sum=\$sum+\$value;
}

print \$sum;
```

Here is only showing how to do product of two vectors. If there are more than two, giving each vector an ID might solve the problem with little modification. This is actually the numerator to compute the  Cosine similarity.

1. ghafoor says: