Post

Note of data science training EP 10: Cluster – collecting and clustering

How to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers.

Note of data science training EP 10: Cluster – collecting and clustering
In this series

One of the classic problem for data scientists is clustering or grouping. For example, we have to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers. How can we do?


Clustering

For that problem, this is introduced, the module sklearn.cluster.

import


Preparing

This time, we have a dataset named “make_blobs” from sci-kit learn dataset.

make blob

Try a simple scatter graph and there are 3 groups actually, aren’t they?

plt


DBSCAN

DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise”. It works like these.

  1. Give x as a distance.
  2. Pick y dots and find the core point among those dots.
  3. Find other dots within x radius from the core point of y dots. If any, create a group then update the core point of the group.
  4. Finished when all dots has its own group.

Now we start from creating a DBSCAN object with 2 parameters:

  • eps (epsilon) as the distance x.
  • min_samples as the minimum dots or the number y.

After that, we use .fit_predict() and the result is in .labels_.

Here we use pd.unique() to check all groups in the model.

DBSCAN

Change eps and min_samples and we can distinguish the result.

update params


K-means

K-means is the popular one as it is easy to use. This requires a number of group and it’s done.

Firstly, we want 3 groups and we have 3 groups now.

k-means

Use .cluster_centers_ to find the center of each group.

cluster center

Let’s try to find 5 groups.

k-means 5 groups

Interesting.

cluster center 5 groups


OPTICS

The last one is Optics standing for “Ordering Points To Identify the Clustering Structure”.

This is similar to DBSCAN but not requires epsilon. It is suit for large datasets and trade-off for long run time.

optics

Try change min_samples.

optics change param


Metrics measurement

Now it’s assessment time. There are 3 main scores for the clustering models.

  1. Silhouette score
    Determines distances within a cluster and between clusters.
    Best at 1 and worst at -1.
  2. Davies-Bouldin score
    Calculates dispersion of each cluster and distance between clusters.
    Best at 0 and the higher is the worse.
  3. Calinski-Harabasz Score
    Find a ratio between dispersion in each cluster and between-cluster.
    The higher is the better.
1
2
3
4
5
6
7
8
9
10
from sklearn import metrics

# Silhouette score
metrics.silhouette_score(dataframe, clustering.labels_)

# Davies-Bouldin score
metrics.davies_bouldin_score(dataframe, clustering.labels_)

# Calinski-Harabasz Score
metrics.calinski_harabasz_score(dataframe, clustering.labels_)

metrics


Hope this is useful as the grouping problems are much popular in many industries.

Let’s see what’s next.

See ya~


References

This post is licensed under CC BY 4.0 by the author.