Clustering with k-means
This page explains unsupervised clustering with generated points and with the Iris dataset, including centroids, cluster labels, elbow evaluation, and silhouette score.
What you should be able to do
- Create visible groups of points.
- Train KMeans and read centroids.
- Predict cluster membership for existing and new points.
- Evaluate cluster count and compactness.
Reusable patterns
- n_clusters is the number of groups requested from k-means.
- fit trains the model; predict assigns cluster labels.
- Silhouette score compares how well-separated and compact the clusters are.
Clustering
Clustering is part of unsupervised learning.
Position-based clustering: k-means
Listing 1. Define data so the expected number of clusters is visible
# define data so the expected number of clusters is visible
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(0, 50, (100, 2)) # generates a matrix of random numbers between 0 and 50 with 100 rows and 2 columns
data[35:60] += 60 # takes rows 35 to 59 and moves them from the [0, 50] area to approximately [60, 110]
data[61:] += 130 # takes rows 61 to the end and moves them from the [0, 50] area to approximately [130, 180]Note: the row with index 60 is not moved by either of the two shift commands, so it stays in the original area. This can cause one point to be closer to the first cluster than to the middle cluster. This is visible later in the cluster labels.
Three groups of points are created:
- some points around values 0 to 50.
- some points around values 60 to 110.
- some points around values 130 to 180.
Listing 2. Display all points
# display all points
plt.scatter(data[:, 0], data[:, 1])
# data[:, 0] means take all rows and the first column (0). These are the x-coordinates of all points
# data[:, 1] means take all rows and the second column (1). These are the y-coordinates of all pointsExpected text output or note
<matplotlib.collections.PathCollection at 0x7980bac85880>
<Figure size 640x480 with 1 Axes>
[visual output omitted; run the code to display the image or chart]Using the k-means class from scikit-learn
Listing 3. Define the k-means model
# define the k-means model
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 1) # n_init controls how many times k-means runs with random centroidsListing 4. Train the k-means model
# train the k-means model
kmeans.fit(data)Expected text output or note
KMeans(n_clusters=3, n_init=1, random_state=42)Listing 5. Get the centroid data
# get the centroid data
centroids = kmeans.cluster_centers_
centroids # 3 rows are 3 centroids and 2 columns are the x and y coordinates of each centroidExpected text output or note
array([[ 84.30574109, 82.90962839],
[153.38048927, 155.86865121],
[ 23.07598232, 21.9415302 ]])Listing 6. Predict groups based on centroids
# predict groups based on centroids
# predict assigns each point to a cluster
cluster_labels = kmeans.predict(data)
cluster_labelsExpected text output or note
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)Listing 7. Plot clusters and centroids
# plot clusters and centroids
plt.scatter(data[:, 0], data[:, 1], c = cluster_labels, cmap = "cool") # c = cluster_labels means each point color depends on the cluster label
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "X", color = "r", s = 200) # s = 200 increases marker size
plt.show()Expected text output or note
<Figure size 640x480 with 1 Axes>
[visual output omitted; run the code to display the image or chart]Prediction for a new value
Listing 8. Predict the cluster for point [297, 103]
# predict the cluster for point [297, 103]
kmeans.predict([[297, 103]])Expected text output or note
array([1], dtype=int32)Evaluation methods
Listing 9. Kelbowvisualizer from yellowbrick
# KElbowVisualizer from Yellowbrick
from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(kmeans, k = (2, 11)) # cluster counts from 2 to 10 are tested
visualizer.fit(data)
visualizer.show()Expected text output or note
<Figure size 800x550 with 2 Axes>
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
[visual output omitted; run the code to display the image or chart]Silhouette score
Listing 10. Calculate the silhouette score
# calculate the silhouette score
from sklearn import metrics
score = metrics.silhouette_score(data, cluster_labels, metric = "euclidean") # data -> original data, cluster_labels -> cluster labels, euclidean -> uses Euclidean distance
print("Silhouette Score: %.3f" % score) # "%.3f" % score formats the number to three decimal placesExpected text output or note
Silhouette Score: 0.695Displaying the silhouette chart
Listing 11. Code listing 11
from yellowbrick.cluster import SilhouetteVisualizer
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 1)
visualizer = SilhouetteVisualizer(kmeans, colors = "yellowbrick")
visualizer.fit(data)
visualizer.show()Expected text output or note
<Figure size 800x550 with 1 Axes>
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 100 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
[visual output omitted; run the code to display the image or chart]Practice task. apply k-means to the Iris dataset.
Listing 12. Load the iris dataset
# load the Iris dataset
from sklearn import datasets
sklearn_dataset = datasets.load_iris()
data = sklearn_dataset.dataListing 13. Create and train k-means
# create and train k-means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 42, n_init = 5)
kmeans.fit(data)
# cluster prediction
klasteri = kmeans.labels_
# centroids
centroids = kmeans.cluster_centers_
# visualization
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 6))
plt.scatter(data[:, 0], data[:, 1], c = klasteri, cmap = "cool", s = 50)
plt.scatter(centroids[:, 0], centroids[:, 1], c = "red", marker = "X", s = 200)
plt.title("K-means klasterizacija Iris dataseta")
plt.show()Expected text output or note
<Figure size 800x600 with 1 Axes>
[visual output omitted; run the code to display the image or chart]