[ad_1]
Agglomerative Clustering Example — Step by Step
In our step-by-step example, we are going to use a fictional dataset with 5 customers:
Let’s imagine that we run a shop with 5 customers and we wanted to group these customers based on their similarities. We have two variables that we want to consider: the customer’s age and their annual income.
The first step of our agglomerative clustering consists of creating pairwise distances between all our data points. Let’s do just that by representing each data point by their coordinate in a [x, y] format:
- Distance between [60, 30] and [60, 55]: 25.0
- Distance between [60, 30] and [30, 75]: 54.08
- Distance between [60, 30] and [41, 100]: 72.53
- Distance between [60, 30] and [38, 55]: 33.30
- Distance between [60, 55] and [30, 75]: 36.06
- Distance between [60, 55] and [41, 100]: 48.85
- Distance between [60, 55] and [38, 55]: 22.0
- Distance between [30, 75] and [41, 100]: 27.31
- Distance between [30, 75] and [38, 55]: 21.54
- Distance between [41, 100] and [38, 55]: 45.10
Although we can use any type of distance metric we want, we’ll use euclidean due to its simplicity. From the pairwise distances we’ve calculated above, which distance is the smallest one?
The distance between middle aged customers that make less than 90k dollars a year — customers on coordinates [30, 75] and [38, 55]!
Reviewing the formula for euclidean distance between two arbitrary points p1 and p2:
Let’s visualize our smallest distance on the 2-D plot by connecting the two customers that are nearer:
The next step of hierarchical clustering is to consider these two customers as our first cluster!
Next, we are going to calculate the distances between the data points, again. But this time, the two customers that we’ve grouped into a single cluster will be treated as a single data point. For instance, consider the red point below that positions itself in the middle of the two data points:
In summary, for the next iterations of our hierarchical solution, we won’t consider the coordinates of the original data points (emojis) but the red point (the average between those data points). This is the standard way to calculate distances on the average linkage method.
Other methods we can use to calculate distances based on aggregated data points are:
- Maximum (or complete linkage): considers the farthest data point in the cluster related to the point we are trying to aggregate.
- Minimum (or single linkage): considers the closest data point in the cluster related to the point we are trying to aggregate.
- Ward (or ward linkage): minimizes the variance in the clusters with the next aggregation.
Let me do a small break on the step-by-step explanation to delve a bit deeper into the linkage methods as they are crucial in this type of clustering. Here’s a visual example of the different linkage methods available in hierarchical clustering, for a fictional example of 3 clusters to merge:
In the sklearn implementation, we’ll be able to experiment with some of these linkage methods and see a significant difference in the clustering results.
Returning to our example, let’s now generate the distances between all our new data points — remember that there are two clusters that are being treated as a single one from now on:
- Distance between [60, 30] and [60, 55]: 25.0
- Distance between [60, 30] and [34, 65]: 43.60
- Distance between [60, 30] and [41, 100]: 72.53
- Distance between [60, 55] and [34, 65]: 27.85
- Distance between [60, 55] and [41, 100]: 48.85
- Distance between [34, 65] and [41, 100]: 35.69
Which distance has the shortest path? It’s the path between data points on coordinates [60, 30] and [60, 55]:
The next step is, naturally, to join these two customers into a single cluster:
With this new landscape of clusters, we calculate pairwise distances again! Remember that we are always considering the average between data points (due to the linkage method we chose) in each cluster as reference point for the distance calculation:
- Distance between [60, 42.5] and [34, 65]: 34.38
- Distance between [60, 42.5] and [41, 100]: 60.56
- Distance between [34, 65] and [41, 100]: 35.69
Interestingly, the next data points to aggregate are the two clusters as they lie on coordinates [60, 42.5] and [34, 65]:
Finally, we finish the algorithm by aggregating all data points in a single big cluster:
With this in mind, where do we exactly stop? It’s probably not a great idea to have a single big cluster with all data points, right?
To know where we stop, there’s some heuristic rules we can use. But first, we need to get familiar with another way of visualizing the process we’ve just done — the dendrogram:
On the y-axis, we have the distances that we’ve just calculated. On the x-axis, we have each data point. Climbing from each data point, we reach an horizontal line — the y-axis value of this line states the total distance that will connect the data points on the edges.
Remember the first customers we’ve connected in a single cluster? What we’ve seen in the 2D plot matches the dendrogram as those are exactly the first customers connected using an horizontal line (climbing the dendrogram from below):
The horizontal lines represent the merging process we’ve just done! Naturally, the dendrogram ends in a big horizontal line that connects all data points.
As we just got familiar with the Dendrogram, we’re now ready to check the sklearn implementation and use a real dataset to understand how we can select the appropriate number of clusters based on this cool clustering method!
Sklearn Implementation
For the sklearn implementation, I’m going to use the Wine Quality dataset available here.
wine_data = pd.read_csv('winequality-red.csv', sep=';')
wine_data.head(10)
This dataset contains information about wines (particularly red wines) with different characteristics such as citric acid, chlorides or density. The last column of the dataset gives respect to the quality of the wine, a classification done by a jury panel.
As hierarchical clustering deals with distances and we’re going to use the euclidean distance, we need to standardize our data. We’ll start by using a StandardScaler
on top of our data:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
wine_data_scaled = sc.fit_transform(wine_data)
With our scaled dataset, we can fit our first hierarchical clustering solution! We can access hierarchical clustering by creating an AgglomerativeClustering
object:
average_method = AgglomerativeClustering(n_clusters = None,
distance_threshold = 0,
linkage = 'average')
average_method.fit(wine_data_scaled)
Let me detail the arguments we are using inside the AgglomerativeClustering:
n_clusters=None
is used as a way to have the full solution of the clusters (and where we can produce the full dendrogram).distance_threshold = 0
must be set in thesklearn
implementation for the full dendrogram to be produced.linkage = ‘average’
is a very important hyperparameter. Remember that, in the theoretical implementation, we’ve described one method to consider the distances between newly formed clusters.average
is the method that considers the average point between every new formed cluster in the calculation of new distances. In thesklearn
implementation, we have three other methods that we also described:single
,complete
andward
.
After fitting the model, it’s time to plot our dendrogram. For this, I’m going to use the helper function provided in the sklearn
documentation:
from scipy.cluster.hierarchy import dendrogramdef plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
If we plot our hierarchical clustering solution:
plot_dendrogram(average_method, truncate_mode="level", p=20)
plt.title('Dendrogram of Hierarchical Clustering - Average Method')
The dendrogram is not great as our observations seem to get a bit jammed. Sometimes, the average
, single
and complete
linkage may result in strange dendrograms, particularly when there are strong outliers in the data. The ward
method may be appropriate for this type of data so let’s test that method:
ward_method = AgglomerativeClustering(n_clusters = None,
distance_threshold = 0,
linkage = 'ward')
ward_method.fit(wine_data_scaled)plot_dendrogram(ward_method, truncate_mode="level", p=20)
Much better! Notice that the clusters seem to be better defined according to the dendrogram. The ward method attempts to divide clusters by minimizing the intra-variance between newly formed clusters (https://online.stat.psu.edu/stat505/lesson/14/14.7) as we’ve described on the first part of the post. The objective is that for every iteration the clusters to be aggregated minimize the variance (distance between data points and new cluster to be formed).
Again, changing methods can be achieved by changing the linkage
parameter in the AgglomerativeClustering
function!
As we’re happy with the look of the ward
method dendrogram, we’ll use that solution for our cluster profilling:
Can you guess how many clusters we should choose?
According to the distances, a good candidate is cutting the dendrogram on this point, where every cluster seems to be relatively far from each other:
The number of vertical lines that our line crosses are the number of final clusters of our solution. Choosing the number of clusters is not very “scientific” and different number of clustering solutions may be achieved, depending on business interpretation. For example, in our case, cutting off our dendrogram a bit above and reducing the number of clusters of the final solution may also be an hypothesis.
We’ll stick with the 7 cluster solution, so let’s fit our ward
method with those n_clusters
in mind:
ward_method_solution = AgglomerativeClustering(n_clusters = 7,
linkage = 'ward')
wine_data['cluster'] = ward_method_solution.fit_predict(wine_data_scaled)
As we want to interpret our clusters based on the original variables, we’ll use the predict method on the scaled data (the distances are based on the scaled dataset) but add the cluster to the original dataset.
Let’s compare our clusters using the means of each variable conditioned on the cluster
variable:
wine_data.groupby([‘cluster’]).mean()
Interestingly, we can start to have some insights about the data — for example:
- Low quality wines seem to have a large value of
total sulfur dioxide
— notice the difference between the highest average quality cluster and the lower quality cluster:
And if we compare the quality
of the wines in these clusters:
Clearly, on average, Cluster 2 contains higher quality wines.
Another cool analysis we can do is performing a correlation matrix between clustered data means:
This gives us some good hints of potential things we can explore (even for supervised learning). For example, on a multidimensional level, wines with higher sulphates
and chlorides
may get bundled together. Another conclusion is that wines with higher alcohol proof tend to be associated with higher quality wines.
Source link