Cluster Analysis from K-Means Scan

Identifying clusters in data can empower your decision-making process for your business. Leveraging clustering algorithms to analyze patterns in the data helps identify segments or clusters. These different segments can then be characterized based on their similarities or differences. Your business can use cluster analysis to identify distinct groups of customers, sales transactions, or even fraudulent behavior.

When applied to geospatial data, such as latitude and longitude, clustering extends its application by not only identifying groups with similar profiles, but also whose members are within a specified location or region. These geographical patterns are useful for solving many location-based issues such as demand forecasting, personalized branch marketing, route optimization, and fraud prevention.

Machine Learning and Geolocation Data

Three of the most widely used machine learning algorithms for clustering geospatial data are defined below:

  • Partition-Based – This algorithm divides the data into k subsets by generating virtual/actual centroids and calculates their distance against the rest of the datapoints. This category includes the well-known K-means algorithm and its variants (e.g., K-mean, K-modes, K-medoids, and Fuzzy C-means).
  • Hierarchical – This algorithm projects the similarity between datapoints on a hierarchical structure to achieve the clustering purpose. This hierarchical structure could be agglomerative (bottom-up) or divisive (top-down) depending on the approach. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) and DIANA (Divisive Analysis) are included in this category.
  • Density-Based – This algorithm primarily uses distance to the nearest point to separate regions with higher concentration from lower density areas. In this category, DBSCAN (Density-based spatial clustering of applications with noise) is the most widely used algorithm.

So how do you choose the one that best fits your business needs? This blog will highlight and compare the K-means, DBSCAN, and HDBSCAN (a hybrid of Density-based and agglomerative hierarchical clustering) algorithms when applied to geospatial data.

Geospatial Clustering Algorithm Comparisons

Table 1 below describes the pros and cons of each type of clustering algorithm.

Table 1: Comparison between K-Means, DBSCAN and Agglomerative Hierarchical Clustering

Cluster Shape

One of the biggest differences between K-means and DBSCAN clustering is that K-means assumes the shape of the clusters to be spherical, while the clusters resulting from DBSCAN are not limited to a spherical shape but could be irregular/arbitrary. Agglomerative hierarchical clustering stays in the middle as it depends on the similarity/distance metric and the linkage method used, based on which the results are highly variable.

Clusters found by KMeans Analysis
Clusters found by DBSCAN Analysis
Clusters found by Agglomerative Clustering Analysis

Images source: Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation

DBSCAN requires two parameters that enable it to be more flexible with the shape of its clusters: epsilon and min_points. Epsilon represents the maximum distance between two points to be considered as neighbor; min_points determines the number of neighbors needed to be considered a true cluster based on epsilon.

Although flexible on the shape of clusters, DBSCAN also has its limits. In cases where the clusters tend to have similar densities or the density of the whole dataset varies a lot, DBSCAN may have difficulties differentiating between them.

To address this short-coming, a hybrid clustering algorithm called Hierarchical-DBSCAN (HDBSCAN) has been introduced. HDBSCAN is an extended version of DBSCAN by defining a mutual reachability distance and blending in the hierarchical concepts. This approach helps to differentiate datapoints with varying density by projecting the density into a hierarchical tree structure. In addition, the use of epsilon is optional in this algorithm leaving min_points as the only required parameter. As a result, the speed and scalability are boosted when compared back to DBSCAN.

Outlier Handling

Another main difference between these three algorithms is how they deal with outliers. K-means leverages the distance between each datapoint with centroids. Since outliers are typically far away from the centroids, they have a significant impact upon the centroid position before the model converges. Thus, this algorithm is extremely sensitive to outliers before it stabilizes.

In contrast, DBSCAN does not necessarily include every single data point in the clusters. Instead, it identifies outliers based on the parameters given and excludes them from the clusters. For such reasons, DBSCAN may produce better clustering results when dealing with noisy datasets.

As for hierarchical clustering, it can detect outliers to some extent while performing pair-wise similarity/distance calculations. Since the outliers are most likely less similar (or far away in distance) from the other datapoints, they would be the last ones merged into the clusters and could be manually removed afterwards.

Real-world Application

For this example, we are trying to identify the hot zones for deceptive practice crimes this year in the northern area of Chicago (Data source: City of Chicago Crime Data | City of Chicago | Data Portal | https://data.cityofchicago.org/Public-Safety/City-of-Chicago-Crime-Data/v9q9-3dm2).

In this data set, deceptive practice crime refers to actions such are fraud, identity theft, embezzlement, and forgery. All three clustering algorithms (K-Means, DBSCAN, HDBSCAN) were utilized and the results are shown below.

Clustering result from DBSCAN

As shown in Figure 1, DBSCAN identified one major cluster in the middle (yellow dots) with some smaller surrounding clusters, while HDBSCAN generated more mini clusters using the same parameters

Clustering result from HDBSCAN

In HDBSCAN, most of the outliers (identified by black dots) are located at the border of two or more clusters, where in DBSCAN outliers are points that are furthest away from any clusters based on distance (epsilon).

Clustering result from K-Means

Meanwhile, K-means defines several distinct clusters with all datapoints included since the algorithm does not identify outliers itself (see Figure 3).

Results from HDBSCAN or K-means would be more useful if the purpose of the analysis is targeted crime prevention, depending on whether outliers are of concern. On the other hand, the results from DBSCAN would prove more valuable if the goal is to identify areas with a higher rate of deceptive crimes than lower ones say, for general public awareness. Thus, it really depends on your business need to determine the best fit of the clustering algorithm.

Conclusion

All three machine learning algorithms have their pros and cons, so understanding their strengths will help better identify the right one for your business. In general, for large datasets, K-means and HDBSCAN are good options to start with. Implement K-means when you assume the cluster to be globular, or switch to HDBSCAN if the outliers are your main worry. On the other hand, if your dataset is small and you are aiming at condensed patterns, then DBSCAN is our first recommendation.

There are numerous types of clustering algorithms, finding which one is the best for the type of problem you are solving is the key. Our team at Scalesology is here to help. Contact us and let’s work together to scale your business with the right data analytics methodology.