Lecture 04: K-Means and Probability – CS 189, Fall 2025

In this lecture notebook, we will explore k-means clustering and some of the basic probability calculations from lecture.

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly import figure_factory as ff
colors = px.colors.qualitative.Plotly
px.defaults.width = 800
from ipywidgets import HBox
import numpy as np
pd.set_option('plotting.backend', 'plotly')
In [2]:
# make the images folder if it doesn't exist
import os
if not os.path.exists("images"):
    os.makedirs("images")

# Uncomment for HTML Export
import plotly.io as pio
pio.renderers.default = "notebook_connected"

The Bike Dataset

Here we will apply k-means clustering to the bike dataset to explore the distribution of length and speed of Prof. Gonzales' bike rides.

In [3]:
bikes = pd.read_csv("speed_length_data.csv")
bikes.head()
Out[3]:
Speed Length
0 9.699485 16.345721
1 11.856724 23.159855
2 7.608359 25.434056
3 14.280750 25.867538
4 12.004195 7.105725
In [4]:
bikes.plot.scatter(x='Speed', y='Length', title='Speed vs Length of Bike Segments', 
                   height=800)

Scikit-Learn K-Means Clustering

We have data for 4 bikes. Let's try k-means clustering using scikit-learn with 4 clusters.

In [5]:
from sklearn.cluster import KMeans
# Create a KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)
# Fit the model to the data
kmeans.fit(bikes[['Speed', 'Length']])
# Get the cluster labels
bikes['scikit k-means'] = kmeans.predict(bikes[['Speed', 'Length']]).astype(str)

We can visualize the clustering

In [6]:
fig = px.scatter(
    bikes, x='Speed', y='Length', color='scikit k-means',
    title='K-Means Clustering of Bike Segments',
    height=800)
fig.add_scatter(
    x=kmeans.cluster_centers_[:,0],
    y=kmeans.cluster_centers_[:,1],
    mode='markers',
    marker=dict(color='black', size=10),
    name='Centroids'
)
#fig.write_image("images/bike_kmeans.pdf", scale=2, height=800, width=700)





Return to Lecture





Implementing the K-Means Clustering Algorithm

Lloyd's algorithm for K-means clustering has three key elements: initialization, assignment, and update. We will implement a function for each.

Initialization

Here we use the basic Forgy method of randomly selecting k data points as the initial cluster centers.

In [7]:
def initialize_centers(x, k):
    """Randomly select k unique points from x to use as initial centers."""
    ind = np.random.choice(np.arange(x.shape[0]), k, replace=False)
    return x[ind]
In [8]:
k = 4
x = bikes[['Speed', 'Length']].to_numpy()
centers = initialize_centers(x, k)
centers
Out[8]:
array([[ 8.94348446, 24.05257377],
       [12.7764257 , 18.58972398],
       [13.18110706, 28.41900961],
       [10.5910518 , 15.32649042]])

Assignment

In this step, we assign each data point to the nearest cluster center.

In [9]:
def compute_assignments(x, centers):
    """Assign each point in x to the nearest center."""
    distances = np.linalg.norm(x[:, np.newaxis] - centers, axis=2)
    return np.argmin(distances, axis=1)
In [10]:
assignments = compute_assignments(x, centers)
assignments
Out[10]:
array([3, 0, 0, 2, 3, 3, 3, 0, 0, 0, 3, 0, 0, 3, 3, 3, 3, 2, 3, 1, 0, 3,
       1, 3, 3, 0, 3, 2, 2, 3, 0, 3, 3, 0, 0, 0, 1, 3, 0, 1, 3, 3, 3, 0,
       3, 0, 3, 2, 2, 3, 0, 0, 0, 0, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 1, 2,
       3, 0, 3, 0, 0, 3, 3, 0, 0, 0, 0, 3, 1, 3, 2, 0, 3, 3, 3, 3, 0, 0,
       0, 3, 3, 0, 0, 2, 0, 3, 2, 3, 3, 3, 3, 0, 1, 2, 0, 3, 3, 0, 3, 3,
       3, 3, 0, 0, 0, 0, 0, 3, 0, 2, 0, 0, 3, 3, 3, 3, 0, 0, 3, 2, 3, 3,
       3, 3, 0, 1, 2, 0, 3, 0, 0, 3, 3, 3, 3, 3, 0, 2, 3, 3, 0, 3, 3, 3,
       0, 3, 0, 0, 3, 0, 3, 0, 0, 2, 3, 0, 3, 3, 3, 2, 0, 3, 2, 3, 0, 3,
       0, 3, 0, 3, 0, 3, 0, 0, 1, 1, 0, 2, 2, 3, 3, 0, 0, 0, 0, 3, 0, 0,
       0, 0, 0, 3, 3, 0, 0, 3, 3, 0, 3, 3, 2, 0, 0, 1, 0, 3, 1, 0, 0, 0,
       0, 2, 3, 3, 3, 3, 2, 3, 0, 0, 0, 2, 2, 1, 3, 0, 1, 3, 1, 3, 0, 0,
       0, 1, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 3, 2, 1, 0, 2, 3, 0, 0, 3, 2,
       0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 1, 3, 3, 2, 3,
       3, 2, 3, 2, 3, 3, 0, 1, 3, 2, 0, 0, 0, 3, 3, 2, 2, 0, 0, 0, 2, 1,
       0, 3, 3, 3, 3, 0, 0, 0, 3, 3, 3, 3, 2, 0, 0, 3, 0, 0, 2, 0, 3, 1,
       3, 0, 3, 3, 3, 3, 0, 0, 3, 3, 1, 3, 3, 3, 3, 0, 0, 3, 1, 0, 2, 0,
       0, 3, 3, 0, 3, 2, 0, 2, 3, 0, 3, 2, 1, 3, 0, 0, 3, 3, 3, 3, 3, 0,
       3, 3, 0, 3, 0, 1, 3, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 3, 3, 0, 3, 0,
       0, 3, 0, 0, 3, 0, 2, 0, 3, 0, 3, 3, 0, 3, 3, 0, 0, 2, 0, 3, 3, 3,
       0, 0, 0, 0, 3, 3, 3, 2, 3, 0, 3, 3, 3, 0, 0, 3, 3, 0, 3, 0, 0, 3,
       3, 0, 1, 3, 0, 0, 0, 0, 3, 3, 0, 2, 1, 0, 3, 1, 3, 1, 3, 3, 3, 0,
       0, 2, 2, 1, 3, 0, 3, 0, 3, 0, 3, 0, 2, 0, 3, 2, 0, 2, 0, 0, 1, 0,
       2, 2, 3, 3, 0, 0, 3, 2, 3, 3, 3, 3, 2, 3, 0, 0, 0, 2, 3, 0, 0, 3,
       0, 0, 0, 0, 0, 3, 0, 0, 3, 3, 3, 0, 0, 3, 2, 3, 0, 0])

Update Centers

In this step, we update the cluster centers by computing the mean of all points assigned to each center.

In [11]:
def update_centers(x, assignments, k):
    """Update centers based on the current assignments."""
    return np.array([x[assignments == i].mean(axis=0) for i in range(k)])
In [12]:
centers = update_centers(x, assignments, k)
centers
Out[12]:
array([[ 9.94615559, 24.01761893],
       [11.40447103, 19.54414142],
       [14.72567631, 27.80903952],
       [10.80705086, 11.42630463]])

Lloyd's Algorithm

We put all these pieces together in a loop that continues until the centers no longer change.

In [13]:
def k_means_clustering(x, k, max_iters=100):
    centers = initialize_centers(x, k)
    assignments_old = -np.ones(x.shape[0])
    soln_path = [centers]
    for _ in range(max_iters):
        assignments = compute_assignments(x, centers)
        centers = update_centers(x, assignments, k)
        soln_path.append(centers)
        if np.array_equal(assignments, assignments_old):
            break
        assignments_old = assignments
    return centers, assignments, soln_path
In [14]:
np.random.seed(43)
centers, assignments, soln_path = k_means_clustering(x, k)
len(soln_path)
Out[14]:
11

The following code visualizes the clustering process at each iteration.

In [15]:
### Construct an animation of the k-means algorithm.
### You do not need to understand the code below for the class.
### It is just for making the animation.

## Prepare a giant table with all the data and centers labeled with the iteration.
pts = []
for i, centers in enumerate(soln_path):
    df = bikes[['Speed', 'Length']].copy()
    df['Class'] = compute_assignments(x, centers).astype(str)
    df2 = pd.DataFrame(centers, columns=['Speed', 'Length'])
    df2['Class'] = 'Center'
    df_combined = pd.concat([df, df2], ignore_index=True)
    # I also need the index of each point in center for the animation
    # the index acts as a unique identifier for each point across frames
    df_combined.reset_index(inplace=True)
    # The iteration number tracks the frame in the animation
    df_combined['Iteration'] = i
    pts.append(df_combined)
# I stack all the data into one big table.
frames = pd.concat(pts, ignore_index=True)

## Make the animation
fig = px.scatter(frames, x='Speed', y='Length', color='Class', 
                 animation_group='index',
                 animation_frame='Iteration', title='K-Means Clustering',
                 width=700, height=800)
## The aspect ratio of the plot is missleading.
fig.update_layout(
    xaxis=dict(scaleanchor="y", scaleratio=1),
    yaxis=dict(scaleanchor="x", scaleratio=1)
)
# fig.write_image("images/bike_kmeans_animation_0.pdf", height=800, width=700)

## Touchup the centers to make them more visible
fig.update_traces(marker=dict(size=12, symbol='x', color='black'), 
                  selector=dict(legendgroup='Center') )
for i, f in enumerate(fig.frames):
    for trace in f.data:
        if trace.name == 'Center':
            trace.update(marker=dict(size=12, symbol='x', color='black'))
    # go.Figure(f.data, f.layout).write_image(
    #     f"images/bike_kmeans_animation_{i+1}.pdf", height=800, width=700)
# fig.write_html("images/bike_kmeans_animation.html",include_plotlyjs='cdn', full_html=True)
fig





Return to Lecture





K-Means on Pixel Data

Let's load an image from a recent bike ride.

In [16]:
from PIL import Image
img = np.array(Image.open("bike2.jpeg"))
print(img.shape)
px.imshow(img)
(1280, 960, 3)

We can think of an image as a collection of pixels, each represented by a color value. In the case of RGB images, each pixel is represented by three values corresponding to the red, green, and blue color channels. These are three dimensional vectors. We can plot these vectors.

In [17]:
image_df = pd.DataFrame(img.reshape(-1,3), columns=['R', 'G', 'B'])
image_df['color'] = ("rgb(" + 
                     image_df['R'].astype(str) + "," + 
                     image_df['G'].astype(str) + "," + 
                     image_df['B'].astype(str) + ")")
fig = go.Figure()
small_image_df = image_df.sample(100000, random_state=42)
fig.add_scatter3d(x=small_image_df['R'], y=small_image_df['G'], z=small_image_df['B'],
                   mode='markers', marker=dict(color=small_image_df['color'], opacity=0.5, size=2))
fig.update_layout(scene=dict(xaxis_title='R', yaxis_title='G', zaxis_title='B'), 
                  width=800, height=800,)
# fig.write_html("images/bike_color_space.html",include_plotlyjs='cdn', full_html=True)
fig
In [18]:
from sklearn.cluster import KMeans
# Apply k-means clustering to the RGB columns
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
image_df['cluster'] = kmeans.fit_predict(image_df[['R', 'G', 'B']])
image_df['cluster'].value_counts()
Out[18]:
cluster
5    273889
6    256235
4    180678
1    164461
3    150634
2    124331
7     49735
0     28837
Name: count, dtype: int64
In [19]:
from plotly.subplots import make_subplots
img_kmeans = (
    kmeans.cluster_centers_[image_df['cluster'].values]
    .reshape(img.shape)
)
# make two linked subplots
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Original Image", "K-Means Compressed Image"),
    specs=[[{"type": "xy"}, {"type": "xy"}]]
)
fig.add_trace(px.imshow(img).data[0], row=1, col=1)
fig.add_trace(px.imshow(img_kmeans).data[0], row=1, col=2)

# 1) Make both subplots share identical ranges
# (use half-pixel bounds to match how Image traces are rendered)
H, W = img.shape[:2]
xrange = [-0.5, W - 0.5]
yrange = [H - 0.5, -0.5]  # origin at top
for c in (1, 2):
    fig.update_xaxes(range=xrange, row=1, col=c)
    fig.update_yaxes(range=yrange, row=1, col=c)

# 2) Lock square pixels and prevent domain stretch
fig.update_yaxes(scaleanchor="x",  scaleratio=1, constrain="domain", row=1, col=1)
fig.update_yaxes(scaleanchor="x2", scaleratio=1, constrain="domain", row=1, col=2)

# Link panning/zooming between the two images (now safe)
fig.update_xaxes(matches="x")
fig.update_yaxes(matches="y")

# Cosmetic
# fig.write_html("images/bike_kmeans_compression.html",include_plotlyjs='cdn', full_html=True)
fig.update_layout(width=900, height=600, margin=dict(t=50, b=30, l=20, r=20))

Choosing the Number of Clusters

We can use the k-means objective to evaluate the quality of clustering for different values of k. The objective (called inertia) is the sum of squared distances from each point to its assigned cluster center. A lower objective indicates better clustering.

A standard method of choosing k is the "elbow method", where we plot the k-means score against k and look for an "elbow" point where the rate of improvement slows down.

In [20]:
scores = pd.DataFrame(columns=['k'])
scores['k'] = [2, 4, 8, 16, 32, 64, 128, 256]
scores.set_index('k', inplace=True)

# Apply k-means clustering to the RGB columns
from sklearn.cluster import KMeans
for k in scores.index:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(image_df[['R', 'G', 'B']])
    # scores.loc[k, 'score'] = kmeans.score(image_df[['R', 'G', 'B']])
    scores.loc[k, 'score'] = kmeans.inertia_ # Negative score
In [21]:
fig = px.line(
    scores, 
    title="K-Means Objective vs Number of Clusters",
    markers=True,
    labels={"index": "Number of Clusters (k)", 
            "value": "K-Means Objective"},
    width=700, height=400
)
fig.update_layout(xaxis_type="log", xaxis_exponentformat='power', showlegend=False)
fig.update_layout(margin=dict(t=50, b=30, l=20, r=20))
# fig.write_image("images/kmeans_score_vs_k.pdf", scale=2, height=400, width=700)
fig





Return to Lecture





Wake Word Detector

In lecture, we derived the probability that a wake word was spoken given the detector detected a wake word. We used Bayes' theorem to do this.

\begin{align*} P(W = 1| D=1) &= \frac{P(D=1|W=1)P(W=1)}{P(D=1)}\\ &=\frac{P(D=1|W=1)P(W=1)}{P(D=1|W=1)P(W=1) + P(D=1|W=0)P(W=0)} \end{align*}

In the following function, we will implement this calculation.

In [22]:
def wake_word_detector(
        p_wake = 0.0001,               # P(wake) prior probability of wake word
        p_detect_g_wake = 0.99,        # P(detect | wake) likelihood of detection given wake
        p_detect_g_nowake = 0.001      # P(detect | no wake) likelihood of detection given no wake
):
    # P(wake | detect) = P(detect | wake) * P(wake) / P(detect)
    p_detect = p_wake * p_detect_g_wake + (1 - p_wake) * p_detect_g_nowake
    p_wake_g_detect = p_detect_g_wake * p_wake / p_detect
    return p_wake_g_detect

wake_word_detector()
Out[22]:
0.09009009009009011

We want to understand what happens if we vary the recall of the detector. The recall (also called sensitivity) is defined as $P(D=1|W=1)$, the probability that the detector detects a wake word when a wake word was actually spoken. We also vary the false positive rate, defined as $P(D=1|W=0)$, the probability that the detector detects a wake word when no wake word was spoken.

$$ P(W = 1| D=1) =\frac{\text{(Recall)} P(W=1)}{\text{(Recall)} P(W=1) + \text{(False Positive Rate)} P(D=1|W=0)P(W=0)} $$

We can see from the above equation that even as recall approaches 1, if the false positive rate is high, the probability that a detected wake word was actually spoken can still be low.

In [23]:
p_detect_g_nowake = np.logspace(-6, -4, 100)
p_wake_g_detect = wake_word_detector(p_detect_g_nowake=p_detect_g_nowake, 
                                     p_detect_g_wake=1.0)
fig = px.line(
    x=p_detect_g_nowake,
    y=p_wake_g_detect,
    title="P(Wake | Detect) vs False Positive Rate for Perfect Recall",
    labels={
        "x": "P(Detect | No Wake) (False Positive Rate)",
        "y": "P(Wake | Detect)"
    },
    log_x=True
)
fig.update_layout( xaxis_exponentformat='power')
#fig.write_image("images/wake_word_detector_fpr.pdf", scale=2, height=500, width=700)
fig
In [24]:
p_detect_g_wake = np.logspace(-0.8, 0, 100)
p_wake_g_detect = wake_word_detector(p_detect_g_wake=p_detect_g_wake,
                                     p_detect_g_nowake=0.0001)
fig = px.line(
    x=p_detect_g_wake,
    y=p_wake_g_detect,
    title="P(Wake | Detect) vs Recall (Sensitivity) for FPR=0.001",
    labels={
        "x": "P(Detect | Wake) Recall",
        "y": "P(Wake | Detect)"
    },
    log_x=True
)
fig.update_layout( xaxis_exponentformat='power')
#fig.write_image("images/wake_word_detector_recall.pdf", scale=2, height=500, width=700)
fig





Return to Lecture