Implementing DBSCAN from Distance Matrix in Rust

We will implement the DBSCAN clustering algorithm in Rust. As its input, the algorithm will take a distance matrix rather than a set of points or feature vectors. This will make the implemented algorithm useful in situations when the dataset is not formed by points or when features cannot be easily extracted. The complete source code, including comments and tests, is available on GitHub.

Introduction

DBSCAN is a popular data clustering algorithm. It allows you to group a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Each such group is called a cluster, and the task of grouping them is called clustering. The objects can be either points in a multi-dimensional space, or practically any objects that can be compared together. For example, you may have thousands of unknown files and you would like them to be grouped by their similarity. This may speed up their analysis because one can analyze multiple similar files at once rather than inspecting them one by one without any connection.

To cluster files, you have to be able to compare them. You can relate them by size, but this is not reliable because even if two files have the same size, it does not say anything about the similarity of their content. A more useful way of comparing files may be through the use of fuzzy hashing. I wrote about this in my previous blog post. The idea is that for each file, we compute a fuzzy hash that represents its content. Then, there is a function that accepts two such hashes and returns their distance. The distance may be a number between 0 (identical files) and 100 (unrelated files).

Now, what you basically have is a distance matrix. For every pair of input files, the matrix tells you how similar the files are. Next, we need a way of grouping similar files together. This is what a clustering algorithm is for. One of the popular clustering algorithms is DBSCAN, which we will implement in the present blog post.

Essentially, the DBSCAN algorithm works as follows. You need to give it two parameters: eps and min_points. The first of them, eps, is the maximum distance between two objects for them to be in the same group. It prescribes how dense you want the clusters to be (lower eps → higher density). The second one, min_points, is the minimal number of objects in a group. Its value effectively specifies the minimal size of clusters you want to have. The algorithm visits all input objects (their collection is called a dataset). For each object, the algorithm detects which objects are close to it (closer than eps). If their number is big enough (equal or greater to min_points), the algorithm starts to form a cluster. It recursively visits all objects that are close to the original object, and for each object, it tries to extend the cluster by detecting which objects are similar to that object. In a way, the construction of clusters resembles the breadth-first search (BFS) algorithm.

The DBSCAN algorithm has many advantages. It is simple, does not require one to specify the number of clusters in the dataset a priori (as opposed to k-means), and can be used to find arbitrarily shaped clusters. Of course, there are also some disadvantages. For example, the algorithm cannot cluster datasets with large differences in densities, since the eps parameter is fixed and cannot be chosen appropriately for all clusters.

The input dataset can be represented in a variety of ways. When you have points, you can represent them as their coordinates (for example, in the Cartesian coordinate system). When each object can be described by a sequence of numbers, you can use feature vectors. Then, each object will be effectively represented as a vector in an n-dimensional space, where n is the number of features. This is a generalization of the previously mentioned point coordinates. Last, but certainly not least, you can use a distance matrix, described above. This is useful in situations when the dataset is not formed by points or when features cannot be easily extracted, like in our example with clustering of files based on their fuzzy hashes.

An example of a result from DBSCAN clustering over a set of points in space can be seen below. It was taken from Wikipedia.




As you can see, the algorithm found two clusters of points in a two-dimensional space.

Let me conclude the introduction by stating the motivation behind implementing the DBSCAN algorithm from a distance matrix in Rust. The currently existing implementations of DBSCAN in Rust (rusty_machine::learning::dbscan and cogset::Dbscan) assume the objects in the input dataset to be represented as points (or, more generally, feature vectors). However, as we have seen, there may be situations when we need to cluster data that cannot be naturally represented as points or feature vectors. This is where distance matrices become handy.

All right, lets begin.

Representing a Distance Matrix

Before we start with the implementation of DBSCAN, we need a way of representing a distance matrix. We will assume the distance to be metric, i.e. satisfying the usual properties you would expect from a distance. Most importantly, the matrix will be symmetric (the distance from A to B is the same as the distance from B to A). This allows us to use less memory to store it because we need to save only circa half of the matrix.

The implementation of the matrix will be based on this great article. Refer to it if you want to know more details about the formulae used later in the present post.

To represent the matrix, we need to store its size and data:

#[derive(Debug)]
pub struct SymmetricMatrix<T> {
    size: usize,
    data: Vec<T>,
}

The matrix is generic to allow values of different types to be stored inside it. For example, if your distances are only from 0 to 255, you can use u8 instead of u32, which will require less memory. The #[derive(Debug)] line enables us to print the matrix via println!("{:?}", matrix), which is useful during debugging because we do not have to write a custom formatter (it will be created automatically by the compiler). To store the data, we use a vector, which guarantees contiguous layout in memory.

Next, we need to provide an implementation of the matrix:

impl<T> SymmetricMatrix<T> where T: Default + Copy {
    // Methods...
}

The where T: part specifies requirements for the generic type T. We require the type to have a default value, accessible via T::default(), and to be copyable. Numeric types satisfy these requirements. For example, the default value of i32 is 0, and integers can be copied.

To create a matrix, we provide a new() method:

pub fn new(size: usize) -> Self {
    SymmetricMatrix {
        size: size,
        data: vec![T::default(); (size + 1) * size / 2],
    }
}

It allocates the space that is needed to store the data and initializes all values to the default value (e.g. 0 when T is an integral type). Since our matrix is symmetric, we need to store only circa half of it. Indeed, the other half would be identical.

Since it is useful to enable users to obtain the size of the matrix (its number of rows and columns), we provide a size() method:

pub fn size(&self) -> usize {
    self.size
}

Finally, we need a way of accessing the data in the matrix. For this, we provide two methods, get() and set():

pub fn get(&self, row: usize, col: usize) -> T {
    let index = self.index_for(row, col);
    self.data[index]
}

pub fn set(&mut self, row: usize, col: usize, value: T) {
    let index = self.index_for(row, col);
    self.data[index] = value;
}

In the two methods above, we need to compute a proper index into our one-dimensional data storage and use it to access the matrix. The implementation of index_for() will be given shortly. Note that instead of providing two methods, get() and set(), we could have implemented the Index and IndexMut traits. Then, instead of e.g.

matrix.get(i, j)

we would write

matrix[(i, j)]

Currently, the parentheses around the indexes are mandatory. Indeed, Rust provides no syntactic sugar that “desugars” foo[a, b] to foo[(a, b)], so the API with get() and set() seems more readable to me.

We end the implementation with the definition of index_for():

fn index_for(&self, row: usize, col: usize) -> usize {
    if col > row {
        col * (col + 1) / 2 + row
    } else {
        row * (row + 1) / 2 + col
    }
}

Again, please refer to this article for more details behind the used formulae.

Implementing the DBSCAN Algorithm

Having our implementation of a distance matrix, we can continue to implement the DBSCAN algorithm itself. The implementation will be based on the pseudocode on Wikipedia.

To represent DBSCAN, we will need several attributes:

#[derive(Debug)]
pub struct DBSCAN<T> {
    eps: T,
    min_points: usize,
    clusters: Vec<Option<usize>>,
    visited: Vec<bool>,
    current_cluster: usize,
}

Let me explain them, one by one. To unify the terminology, we will call all objects in the input dataset points, even if they are, actually, not points per se (as we have seen, they may be arbitrary entities, like files).

  • eps is the maximum distance between two points for them to be in the same neighborhood. It prescribes how dense you want the clusters to be (lower eps → higher density).
  • min_points is the minimal number of points in a neighborhood for a point to be considered as a core point. Its value effectively specifies the minimal size of clusters you want to have.
  • clusters is a mapping of points into their respective clusters. When a point P belongs to cluster X, clusters[P] will be set to Some(X). Otherwise, when a point Q does not belong to any cluster, clusters[Q] will be None. Each point is represented by its index in the input distance matrix, which is effectively an integer.
  • visited is an indicator whether a point has been visited by the algorithm.
  • current_cluster is a counter denoting the number of the currently created cluster. After a cluster is created, it is incremented.

Lets start with the implementation:

impl<T> DBSCAN<T>
where T: Default + Copy + PartialOrd
{
    // Methods...
}

The requirements on T are the two requirements on SymmetricMatrix plus PartialOrd. We need PartialOrd for the algorithm to be able to compare the distance between two points with eps.

To create a DBSCAN instance, we provide a new() method:

pub fn new(eps: T, min_points: usize) -> Self {
    DBSCAN {
        eps: eps,
        min_points: min_points,
        clusters: Vec::new(),
        visited: Vec::new(),
        current_cluster: 0,
    }
}

It simply initializes the values in the structure to their default values. The user only has to provide eps and min_points.

To create clusters from a given distance matrix, we provide a perform_clustering() method:

pub fn perform_clustering(&mut self,
                          matrix: &SymmetricMatrix<T>) -> &Vec<Option<usize>> {
    self.clusters = vec![None; matrix.size()];
    self.visited = vec![false; matrix.size()];
    self.current_cluster = 0;

    // Heart of the algorithm...

    self.clusters.as_ref()
}

It initializes the internal attributes, performs the algorithm, and returns a reference to the created clusters.

Lets implement the heart of the algorithm. You can follow the pseudocode on Wikipedia.

for point in 0..matrix.size() {
    if self.visited[point] {
        continue;
    }

    self.visited[point] = true;
    let neighbors = self.region_query(matrix, point);
    if neighbors.len() >= self.min_points {
        self.expand_cluster(matrix, point, neighbors);
        self.current_cluster += 1;
    }
}

We iterate over all points in the input dataset. This effectively boils down to iterating over all row (or column) indexes. We want to process only points which have not yet been visited, so we skip visited points. After marking the point as visited, we compute all points which are in its neighborhood. We use a method of the same name as in the pseudocode: region_query(). Its implementation will be given shortly. Then, if the neighborhood is large enough (based on the user-supplied value of min_points), we create a cluster based on this point by calling expand_cluster(). It will be described later.

The region_query() method is straightforward:

fn region_query(&self,
                matrix: &SymmetricMatrix<T>,
                point: usize) -> VecDeque<usize> {
    let mut neighbors = VecDeque::new();
    for other_point in 0..matrix.size() {
        let dist = matrix.get(point, other_point);
        if dist <= self.eps {
            neighbors.push_back(other_point);
        }
    }
    neighbors
}

We initialize a queue into which we will store the neighbors. VecDeque is Rust’s implementation of a queue. The reason behind using a queue will be apparent once we see the body of expand_cluster(). Then, we iterate over all points in the input dataset and enqueue those whose distance from the original point is at most eps. The found neighbors are then returned.

Finally, we need to implement expand_cluster(), which creates a cluster from the given point and its neighborhood:

fn expand_cluster(&mut self,
                  matrix: &SymmetricMatrix<T>,
                  point: usize,
                  mut neighbors: VecDeque<usize>) {
    self.clusters[point] = Some(self.current_cluster);

    while let Some(other_point) = neighbors.pop_front() {
        if !self.visited[other_point] {
            self.visited[other_point] = true;
            let mut other_neighbors = self.region_query(matrix, other_point);
            if other_neighbors.len() >= self.min_points {
                neighbors.append(&mut other_neighbors);
            }
        }
        if self.clusters[other_point].is_none() {
            self.clusters[other_point] = Some(self.current_cluster);
        }
    }
}

We start by putting the original point into the current cluster (for the first call of expand_cluster(), self.current_cluster is 0, then for the second call it is 1 etc.). Then, we iterate over each point in the neighborhood, until there are neighbors to process. We only care about points which we have not seen yet, so we skip visited points. For each neighbor, we obtain its neighbors, and put them into the queue. Finally, for each neighbor, if it does not belong to any cluster yet, we put it into the currently constructed cluster. In a way, the construction of the cluster resembles the breadth-first search (BFS) algorithm.

Complete Source Code

The complete source code, including comments and tests, is available on GitHub.

Leave a Comment.