Note of data science training EP 10: Cluster – collecting and clustering
How to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers.
One of the classic problem for data scientists is clustering or grouping. For example, we have to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers. How can we do?
Clustering
For that problem, this is introduced, the module sklearn.cluster
.
Preparing
This time, we have a dataset named “make_blobs” from sci-kit learn dataset.
Try a simple scatter graph and there are 3 groups actually, aren’t they?
DBSCAN
DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise”. It works like these.
- Give x as a distance.
- Pick y dots and find the core point among those dots.
- Find other dots within x radius from the core point of y dots. If any, create a group then update the core point of the group.
- Finished when all dots has its own group.
Now we start from creating a DBSCAN object with 2 parameters:
eps
(epsilon) as the distance x.min_samples
as the minimum dots or the number y.
After that, we use .fit_predict()
and the result is in .labels_
.
Here we use pd.unique()
to check all groups in the model.
Change eps
and min_samples
and we can distinguish the result.
K-means
K-means is the popular one as it is easy to use. This requires a number of group and it’s done.
Firstly, we want 3 groups and we have 3 groups now.
Use .cluster_centers_
to find the center of each group.
Let’s try to find 5 groups.
Interesting.
OPTICS
The last one is Optics standing for “Ordering Points To Identify the Clustering Structure”.
This is similar to DBSCAN but not requires epsilon. It is suit for large datasets and trade-off for long run time.
Try change min_samples
.
Metrics measurement
Now it’s assessment time. There are 3 main scores for the clustering models.
- Silhouette score
Determines distances within a cluster and between clusters.
Best at 1 and worst at -1. - Davies-Bouldin score
Calculates dispersion of each cluster and distance between clusters.
Best at 0 and the higher is the worse. - Calinski-Harabasz Score
Find a ratio between dispersion in each cluster and between-cluster.
The higher is the better.
1
2
3
4
5
6
7
8
9
10
from sklearn import metrics
# Silhouette score
metrics.silhouette_score(dataframe, clustering.labels_)
# Davies-Bouldin score
metrics.davies_bouldin_score(dataframe, clustering.labels_)
# Calinski-Harabasz Score
metrics.calinski_harabasz_score(dataframe, clustering.labels_)
Hope this is useful as the grouping problems are much popular in many industries.
Let’s see what’s next.
See ya~