Clustering with k-means

This page explains unsupervised clustering with generated points and with the Iris dataset, including centroids, cluster labels, elbow evaluation, and silhouette score.

What you should be able to do

Create visible groups of points.
Train KMeans and read centroids.
Predict cluster membership for existing and new points.
Evaluate cluster count and compactness.

Reusable patterns

n_clusters is the number of groups requested from k-means.
fit trains the model; predict assigns cluster labels.
Silhouette score compares how well-separated and compact the clusters are.

Clustering

Clustering is part of unsupervised learning.

Position-based clustering: k-means

Listing 1. Define data so the expected number of clusters is visible

# define data so the expected number of clusters is visible
import numpy as np
import matplotlib.pyplot as plt

data = np.random.uniform(0, 50, (100, 2)) # generates a matrix of random numbers between 0 and 50 with 100 rows and 2 columns

data[35:60] += 60 # takes rows 35 to 59 and moves them from the [0, 50] area to approximately [60, 110]
data[61:] += 130 # takes rows 61 to the end and moves them from the [0, 50] area to approximately [130, 180]

Note: the row with index 60 is not moved by either of the two shift commands, so it stays in the original area. This can cause one point to be closer to the first cluster than to the middle cluster. This is visible later in the cluster labels.

Three groups of points are created:

some points around values 0 to 50.
some points around values 60 to 110.
some points around values 130 to 180.

Listing 2. Display all points

# display all points
plt.scatter(data[:, 0], data[:, 1])
# data[:, 0] means take all rows and the first column (0). These are the x-coordinates of all points
# data[:, 1] means take all rows and the second column (1). These are the y-coordinates of all points

Expected text output or note

<matplotlib.collections.PathCollection at 0x7980bac85880>

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Using the k-means class from scikit-learn

Listing 3. Define the k-means model

# define the k-means model
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 1) # n_init controls how many times k-means runs with random centroids

Listing 4. Train the k-means model

# train the k-means model
kmeans.fit(data)

Expected text output or note

KMeans(n_clusters=3, n_init=1, random_state=42)

Listing 5. Get the centroid data

# get the centroid data
centroids = kmeans.cluster_centers_
centroids # 3 rows are 3 centroids and 2 columns are the x and y coordinates of each centroid

Expected text output or note

array([[ 84.30574109,  82.90962839],
       [153.38048927, 155.86865121],
       [ 23.07598232,  21.9415302 ]])

Listing 6. Predict groups based on centroids

# predict groups based on centroids
# predict assigns each point to a cluster
cluster_labels = kmeans.predict(data)
cluster_labels

Expected text output or note

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Listing 7. Plot clusters and centroids

# plot clusters and centroids
plt.scatter(data[:, 0], data[:, 1], c = cluster_labels, cmap = "cool") # c = cluster_labels means each point color depends on the cluster label
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "X", color = "r", s = 200) # s = 200 increases marker size
plt.show()

Expected text output or note

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Prediction for a new value

Listing 8. Predict the cluster for point [297, 103]

# predict the cluster for point [297, 103]
kmeans.predict([[297, 103]])

Expected text output or note

array([1], dtype=int32)

Evaluation methods

Listing 9. Kelbowvisualizer from yellowbrick

# KElbowVisualizer from Yellowbrick
from yellowbrick.cluster import KElbowVisualizer

visualizer = KElbowVisualizer(kmeans, k = (2, 11)) # cluster counts from 2 to 10 are tested
visualizer.fit(data)

visualizer.show()

Expected text output or note

<Figure size 800x550 with 2 Axes>

<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

[visual output omitted; run the code to display the image or chart]

Silhouette score

Listing 10. Calculate the silhouette score

# calculate the silhouette score
from sklearn import metrics
score = metrics.silhouette_score(data, cluster_labels, metric = "euclidean") # data -> original data, cluster_labels -> cluster labels, euclidean -> uses Euclidean distance

print("Silhouette Score: %.3f" % score) # "%.3f" % score formats the number to three decimal places

Expected text output or note

Silhouette Score: 0.695

Displaying the silhouette chart

Listing 11. Display the silhouette chart for the chosen cluster count

from yellowbrick.cluster import SilhouetteVisualizer
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 1)
visualizer = SilhouetteVisualizer(kmeans, colors = "yellowbrick")
visualizer.fit(data)
visualizer.show()

Expected text output or note

<Figure size 800x550 with 1 Axes>

<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 100 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

[visual output omitted; run the code to display the image or chart]

Practice task. apply k-means to the Iris dataset.

Listing 12. Load the iris dataset

# load the Iris dataset
from sklearn import datasets
sklearn_dataset = datasets.load_iris()
data = sklearn_dataset.data

Listing 13. Create and train k-means

# create and train k-means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 5)
kmeans.fit(data)

# cluster prediction
cluster_labels = kmeans.labels_
# centroids
centroids = kmeans.cluster_centers_

# visualization
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 6))
plt.scatter(data[:, 0], data[:, 1], c = cluster_labels, cmap = "cool", s = 50)
plt.scatter(centroids[:, 0], centroids[:, 1], c = "red", marker = "X", s = 200)
plt.title("K-means clustering on the Iris dataset")
plt.show()

Expected text output or note

<Figure size 800x600 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Back to overview

Python Data Foundations Documentation

Clustering with k-means

Clustering

Position-based clustering: k-means

Listing 1. Define data so the expected number of clusters is visible

Listing 2. Display all points

Listing 3. Define the k-means model

Listing 4. Train the k-means model

Listing 5. Get the centroid data

Listing 6. Predict groups based on centroids

Listing 7. Plot clusters and centroids

Listing 8. Predict the cluster for point [297, 103]

Listing 9. Kelbowvisualizer from yellowbrick

Listing 10. Calculate the silhouette score

Listing 11. Display the silhouette chart for the chosen cluster count

Listing 12. Load the iris dataset

Listing 13. Create and train k-means