[ad_1]
Each data has different importance
Cluster analysis (or clustering) is a data analysis technique that explores and groups a set of vectors (or data points) in such a way that vectors in the same cluster are more similar to one another than to those in other clusters. Clustering algorithms are widely used in numerous applications, e.g., data analysis, pattern recognition, and image processing.
This article reviews a new clustering algorithm based on the method of Projection onto Convex Sets (POCS), called POCS-based clustering algorithm. The original paper was introduced in IWIS2022 and the source code has also been released on Github.
A convex set is defined as a set of data points in which a line segment connecting any two points x1 and x2 in the set is completely subsumed in this set. Following that definition of a convex set, empty set ∅, singleton set, line segment, hyperplane, and Euclidian ball are considered to be convex sets. A data point is also considered to be a convex set as it is a singleton set (a set with exactly one element). That directs to a new path that the concept of POCS can be applied to clustering data points.
Let me briefly review the concept of POCS (without equations). The method of POCS can be roughly divided into 2 forms: Alternating and Parallel.
Alternating POCS
Starting from an arbitrary point in the data space, the alternating projections from this point onto two (or more) intersecting convex sets will converge to a point within the intersection of the sets. A graphical illustration is shown below.
When the convex sets do not intersect, the alternating projections will converge to greedy limit cycles which are dependent on the orders of the projections.
Parallel POCS
Different from the alternating form, the parallel form of POCS performs projections simultaneously from the data point onto all the convex sets and each projection has a weight of importance. For two non-empty intersection convex sets, similarly to the alternating version, the parallel projections converge to a point in the intersection of the sets.
In the case of non-intersecting convex sets, the projections will converge to a minimization solution. The main idea of the POCS-based clustering algorithm came up from this property.
For more details of POCS, you can visit the original paper and/or some other recommended papers (with available pdf files):
Taking advantage of the convergence property of the parallel POCS method, the authors proposed a very simple yet effective (to some extent) clustering algorithm. The algorithm operates in a similar spirit to the classical K-Means algorithm, but there is a difference in the way each one treats each data point, i.e., the K-Means algorithm treats each data point with equally weighted importance, however, the POCS-based clustering algorithm, on the other hand, treats each data point with a different weight of importance which is directly proportional to the distance from the data point to the cluster prototype. That’s it!
The pseudocode of the algorithm is shown below:
The authors examined the performance of the POCS-based clustering algorithm on some public benchmark datasets from the website Clustering basic benchmark. The descriptions of these datasets are summarized in the table below.
In the paper, the author compared the performance of the POCS-based clustering algorithm against other conventional clustering methods including the K-Means and Fuzzy C-Means algorithms. The evaluations in terms of execution time and clustering error are summarized in the following tables.
The visual clustering results are also illustrated in the following figure.
For more details, you can drop by the original paper here.
Let’s play with this algorithm on a very simple dataset. For the sake of simplicity, one can install the released package of the algorithm using this command:
pip install pocs-based-clustering
First, let’s import several necessary packages and create a simple dataset with 5000 data points centered around 10 clusters:
# Import packages
import time
import matplotlib.pyplot as pltfrom sklearn.datasets import make_blobs
from pocs_based_clustering.tools import clustering
# Generate a simple dataset
num_clusters = 10
X, y = make_blobs(n_samples=5000, centers=num_clusters, \
cluster_std=0.5, random_state=0)
plt.figure(figsize=(8,8))
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()
Now, perform clustering using the built-in function and display the result:
# POSC-based Clustering Algorithm
centroids, labels = clustering(X, num_clusters, 100)# Display results
plt.figure(figsize=(8,8))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=100, c='red')
plt.show()
In this post, I have briefly reviewed a simple yet effective clustering technique based on the method of Projection onto Convex Sets (POCS), called the POCS-based Clustering Algorithm. This algorithm takes advantage of the convergence property of POCS to apply to clustering tasks and has realized feasible improvements to some extent. The effectiveness of this algorithm has been validated on some benchmark datasets. The original paper can be found on arXiv (preprint) or IEEE Xplore (published paper). The code was also released on Github.
I am so glad to welcome you to my Facebook page for sharing things regarding Machine Learning: Diving Into Machine Learning. Further notable posts from me can also be found here:
Thanks for spending time!
Source link