Saturday, August 5, 2017

Apache Spark ML Sparse Vector vs Dense Vector

Sparse Vector vs Dense Vector


import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});


Sparse vectors are when you have a lot of values in the vector as zero. While a dense vector is when most of the values in the vector are non zero.

If you have to create a sparse vector from the dense vector you specified, use the following syntax:

Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});
The below link is pretty useful for getting a good understanding of the overall concept.

What is a Sparse Vector

A vector is a one-dimensional array of elements. So in a programming language, an implementation of a vector is as a one-dimensional array. A vector is said to be sparse when many elements of a have zero values. And when we write programs it will not be a good idea from storage perspective to store all these zero values in the array.

So the best way of representation of a sparse vector will be by just specifying the location and value.

Example: 3 1.2 2800 6.3 6000 10.0 50000 5.7

This denotes at position:

3 the value is 1.2
2800 holds value 6.3
6000 holds 10.0
50000 holds value 5.7
When you use the sparse vector in a programming language you will also need to specify a size. In the above example the size of the sparse vector is 4.

Representation of Sparse Vector in Spark

The Vector class of org.apache.spark.mllib.linalg has multiple methods to create your own dense and sparse Vectors.

The most simple way of creating one is by using:

sparse(int size, int[] indices, double[] values)

This method creates a sparse vector where the first argument is the size, second the indices where a value exists and the last one is the values on these indices.

Rest of the elements of this vector have values zero.

Example:

Let’s say we have to create the following vector {0.0, 2.0, 0.0, 0.0, 4.0}. By using the sparse vector API of Spark this can be created as stated below:

Vector sparseVector = Vectors.sparse(5, new int[] {1, 4}, new double[] {2.0, 4.0});

If the same vector can also be created using the dense vector API

Vector denseVector = Vectors.dense(0.0, 2.0, 0.0, 0.0, 4.0);

0 comments:

Post a Comment