prev: Note of data science training EP 9: NetworkX – Map of Marauder in real world

One of the classic problem for data scientists is clustering or grouping. For example, we have to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers. How can we do?

Clustering

For that problem, this is introduced, the module sklearn.cluster.

Preparing

This time, we have a dataset named make_blobs from sci-kit learn dataset.

Try a simple scatter graph and there are 3 groups actually, aren’t they?

DBSCAN

DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise”. It works like these.

  1. Give x as a distance.
  2. Pick y dots and find the core point among those dots.
  3. Find other dots within x radius from the core point of y dots. If any, create a group then update the core point of the group.
  4. Finished when all dots has its own group.

Now we start from creating a DBSCAN object with 2 parameters:

  • eps (epsilon) as the distance x
  • min_samples as the minimum dots or the number y

After that, we use .fit_predict() and the result is in .labels_.

Here we use pd.unique() to check all groups in the model.

Change eps and min_samples and we can distinguish the result.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

K-means

K-means is the popular one as it is easy to use. This requires a number of group and it’s done.

Firstly, we want 3 groups and we have 3 groups now.

Use .cluster_centers_ to find the center of each group.

Let’s try to find 5 groups.

Interesting.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

OPTICS

The last one is Optics standing for “Ordering Points To Identify the Clustering Structure”. This is similar to DBSCAN but not requires epsilon. It is suit for large datasets and trade-off for long run time.

Try change min_samples.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html

Metrics measurement

Now it’s assessment time. There are 3 main scores for the clustering models.

  1. Silhouette score
    Determines distances within a cluster and between clusters. Best at 1 and worst at -1.
  2. Davies-Bouldin score
    Calculates dispersion of each cluster and distance between clusters. Best at 0 and the higher is the worse.
  3. Calinski-Harabasz Score
    Find a ratio between dispersion in each cluster and between-cluster. The higher is the better.
from sklearn import metrics
# Silhouette score
metrics.silhouette_score(dataframe, clustering.labels_)
# Davies-Bouldin score
metrics.davies_bouldin_score(dataframe, clustering.labels_)
# Calinski-Harabasz Score
metrics.calinski_harabasz_score(dataframe, clustering.labels_)

Hope this is useful as the grouping problems are much popular in many industries.

Let’s see what’s next.

See ya~

next: Note of data science training EP 11: NLP & Spacy – Languages are borderless