Lecture 03 – CS 189, Fall 2025

In this lecture, we will cover the basic machine learning lifecycle using a hands-on approach with scikit-learn. We will work through each stage of the machine learning lifecycle while also introducing standard machine learning tools and techniques. The machine learning lifecycle consists of four parts:

drawing
In [1]:
import numpy as np 
import pandas as pd
import plotly.express as px
In [2]:
# Uncomment for HTML Export
import plotly.io as pio
pio.renderers.default = "notebook_connected"

The Learning Problem

Suppose we are launching a new fashion trading website where people can upload pictures of clothing they want to trade. We want to help posters identify the clothing in the images. Suppose we have some training data consisting of clothing pictures with labels describing the type of clothing (e.g., "dress", "shirt", "pants").

What data do we have?

  • Labeled training examples.

What do we want to predict?

  • The category label of the clothing in the images. We may want to predict other things as well.

How would we evaluate success?

  • We likely want to measure our prediction accuracy.
  • We may eventually want to improve accuracy on certain high-value classes.

Looking at the Data

A key step that is often overlooked in machine learning projects is understanding the data. This includes exploring the dataset, visualizing the data, and gaining insights into its structure and characteristics.

We will be using the Fashion-MNIST dataset, which is a (now) classic dataset with gray scale 28x28 images of articles of clothing.

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. https://github.com/zalandoresearch/fashion-mnist

This is an alternative to the even more classic MNIST digits dataset, which contains images of handwritten digits.

The following block of code will download the Fashion-MNIST dataset and load it into memory.

In [3]:
# Fetch the Data
import torchvision
data = torchvision.datasets.FashionMNIST(root='data', train=True, download=True)

# Preprocess the data into numpy arrays
images = data.data.numpy().astype(float)
targets = data.targets.numpy() # integer encoding of class labels
class_dict = {i:class_name for i,class_name in enumerate(data.classes)}
labels = np.array([class_dict[t] for t in targets]) # raw class labels
n = len(images)

print("Loaded FashionMNIST dataset with {} samples.".format(n))
print("Classes: {}".format(class_dict))
print("Image shape: {}".format(images[0].shape))
print("Image dtype: {}".format(images[0].dtype))
print("Image 0:\n", images[0])
Loaded FashionMNIST dataset with 60000 samples.
Classes: {0: 'T-shirt/top', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Ankle boot'}
Image shape: (28, 28)
Image dtype: float64
Image 0:
 [[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   1.   0.
    0.  13.  73.   0.   0.   1.   4.   0.   0.   0.   0.   1.   1.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   3.   0.
   36. 136. 127.  62.  54.   0.   0.   0.   1.   3.   4.   0.   0.   3.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   6.   0.
  102. 204. 176. 134. 144. 123.  23.   0.   0.   0.   0.  12.  10.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  155. 236. 207. 178. 107. 156. 161. 109.  64.  23.  77. 130.  72.  15.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   1.   0.  69.
  207. 223. 218. 216. 216. 163. 127. 121. 122. 146. 141.  88. 172.  66.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   1.   1.   1.   0. 200.
  232. 232. 233. 229. 223. 223. 215. 213. 164. 127. 123. 196. 229.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0. 183.
  225. 216. 223. 228. 235. 227. 224. 222. 224. 221. 223. 245. 173.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0. 193.
  228. 218. 213. 198. 180. 212. 210. 211. 213. 223. 220. 243. 202.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   1.   3.   0.  12. 219.
  220. 212. 218. 192. 169. 227. 208. 218. 224. 212. 226. 197. 209.  52.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   6.   0.  99. 244.
  222. 220. 218. 203. 198. 221. 215. 213. 222. 220. 245. 119. 167.  56.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   0.   0.  55. 236.
  228. 230. 228. 240. 232. 213. 218. 223. 234. 217. 217. 209.  92.   0.]
 [  0.   0.   1.   4.   6.   7.   2.   0.   0.   0.   0.   0. 237. 226.
  217. 223. 222. 219. 222. 221. 216. 223. 229. 215. 218. 255.  77.   0.]
 [  0.   3.   0.   0.   0.   0.   0.   0.   0.  62. 145. 204. 228. 207.
  213. 221. 218. 208. 211. 218. 224. 223. 219. 215. 224. 244. 159.   0.]
 [  0.   0.   0.   0.  18.  44.  82. 107. 189. 228. 220. 222. 217. 226.
  200. 205. 211. 230. 224. 234. 176. 188. 250. 248. 233. 238. 215.   0.]
 [  0.  57. 187. 208. 224. 221. 224. 208. 204. 214. 208. 209. 200. 159.
  245. 193. 206. 223. 255. 255. 221. 234. 221. 211. 220. 232. 246.   0.]
 [  3. 202. 228. 224. 221. 211. 211. 214. 205. 205. 205. 220. 240.  80.
  150. 255. 229. 221. 188. 154. 191. 210. 204. 209. 222. 228. 225.   0.]
 [ 98. 233. 198. 210. 222. 229. 229. 234. 249. 220. 194. 215. 217. 241.
   65.  73. 106. 117. 168. 219. 221. 215. 217. 223. 223. 224. 229.  29.]
 [ 75. 204. 212. 204. 193. 205. 211. 225. 216. 185. 197. 206. 198. 213.
  240. 195. 227. 245. 239. 223. 218. 212. 209. 222. 220. 221. 230.  67.]
 [ 48. 203. 183. 194. 213. 197. 185. 190. 194. 192. 202. 214. 219. 221.
  220. 236. 225. 216. 199. 206. 186. 181. 177. 172. 181. 205. 206. 115.]
 [  0. 122. 219. 193. 179. 171. 183. 196. 204. 210. 213. 207. 211. 210.
  200. 196. 194. 191. 195. 191. 198. 192. 176. 156. 167. 177. 210.  92.]
 [  0.   0.  74. 189. 212. 191. 175. 172. 175. 181. 185. 188. 189. 188.
  193. 198. 204. 209. 210. 210. 211. 188. 188. 194. 192. 216. 170.   0.]
 [  2.   0.   0.   0.  66. 200. 222. 237. 239. 242. 246. 243. 244. 221.
  220. 193. 191. 179. 182. 182. 181. 176. 166. 168.  99.  58.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.  40.  61.  44.  72.  41.  35.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]]

Understanding The Raw Features (Images)

How much data do we have?

In [4]:
images.shape
Out[4]:
(60000, 28, 28)

The images are stored in a 60000 by 28 by 28 tensor. This means we have 60000 images, each of which is 28 pixels wide and 28 pixels tall. Each pixel is represented by a single value. What are those values?

In [5]:
counts, bins =  np.histogram(images, bins=255)
fig_pixels = px.bar(x=bins[1:], y=counts,  title="Pixel value distribution", 
       log_y=True, labels={"x":"Pixel value", "y":"Count"})
fig_pixels

It is important to learn how to visualize and work with data. Here we use Plotly express to visualize the image. Note, I am using the 'gray_r' color map to visualize the images, which is a gray scale color map that is reversed (so that black is 1 and white is 0).

In [6]:
px.imshow(images[0], color_continuous_scale='gray_r') 

The following snippet of code visualizes multiple images in a grid. You are not required to understand this code, but it is useful to know how to visualize images in Python.

In [7]:
def show_images(images, max_images=40, ncols=5, labels = None):
    """Visualize a subset of images from the dataset.
    Args:
        images (np.ndarray): Array of images to visualize [img,row,col].
        max_images (int): Maximum number of images to display.
        ncols (int): Number of columns in the grid.
        labels (np.ndarray, optional): Labels for the images, used for facet titles.
    Returns:
        plotly.graph_objects.Figure: A Plotly figure object containing the images.
    """
    n = min(images.shape[0], max_images) # number of images to show
    px_height = 220 # height of each image in pixels
    fig = px.imshow(images[:n, :, :], color_continuous_scale='gray_r', 
                    facet_col = 0, facet_col_wrap=ncols,
                    height = px_height * int(np.ceil(n/ncols)))
    fig.update_layout(coloraxis_showscale=False)
    if labels is not None:
        # Extract the facet number and replace with the label.
        fig.for_each_annotation(lambda a: a.update(text=labels[int(a.text.split("=")[-1])]))
    return fig
In [8]:
show_images(images, 20, labels=labels)

Let's look at a few examples of each class. Here we use Pandas to group images by their labels and sample 2 for each class. You are not required to know Pandas (we won't test you on it), but it is a useful library for data manipulation and analysis and we will use it often in this course.

In [9]:
idx = (
    pd.DataFrame({"labels": labels})
      .groupby("labels", as_index=False)
      .sample(2)
      .index
      .to_numpy())
show_images(images[idx,:,:], labels=labels[idx])

Understanding the Labels

New let's examine the labels. Are they discrete? What is the distribution? Are there missing values or errors?

In the Fashion-MNIST dataset, each image is labeled with a class corresponding to a type of clothing. There are 10 classes in total.

However, it is also important to understand the distribution of labels in the dataset. This can help us identify potential issues such as class imbalance, where some classes have significantly more samples than others.

In [10]:
labels
Out[10]:
array(['Ankle boot', 'T-shirt/top', 'T-shirt/top', ..., 'Dress',
       'T-shirt/top', 'Sandal'], shape=(60000,), dtype='<U11')

The labels are strings (discrete).

What is the distribution of labels?

In [11]:
px.histogram(labels, title="Label distribution")   

There appear to be equal proportion of each type of clothing. We don't have any missing values since all labels are one of the 10 classes (no blank or "missing" label values).

Most real world datasets aren't this balanced or clean. In fact, it's common to see a long tail distribution, where a few classes are very common and many classes are rare.

Reviewing the Learning Setting

Having examined the data we can see that we have a large collection of pairs of features and categorical labels (with 10 classes). This is a supervised learning problem, where the goal is to learn a mapping from the input features (images) to the output labels (categories). Because the labels are discrete this is a classification problem.

It is also worth noting that because the input features are images this is also a computer vision problem. This means that when we get to the model development stage, we will need to consider techniques that are specifically designed for multi-class classification and in particular computer vision.





Return to Slides





Train-Test-Validation Split

We will split the dataset into a training set, a validation set, and a test set. The training set will be used to train the model, while the validation set will be used to tune the model's hyperparameters. The test set will be used to evaluate the model's performance.

Technically, the Fashion-MNIST dataset has a separate test set, but we will demonstrate how to split data in general.

In [12]:
# use sklearn to construct a train test split
from sklearn.model_selection import train_test_split
In [13]:
# Construct the train - test split
images_tr, images_te, labels_tr, labels_te = train_test_split(
    images, labels, test_size=0.2, random_state=42)

# Construct the train - validation split
images_tr, images_val, labels_tr, labels_val = train_test_split(
    images_tr, labels_tr, test_size=0.2, random_state=42)

print("images_tr shape:", images_tr.shape)
print("images_val shape:", images_val.shape)
print("images_te shape:", images_te.shape)
images_tr shape: (38400, 28, 28)
images_val shape: (9600, 28, 28)
images_te shape: (12000, 28, 28)





Return to Slides





Model Design

We have already loaded the Fashion-MNIST dataset in the previous section. Now, we will preprocess the data to make it suitable for training a classification model.

Feature Engineering

Feature Engineering is the process of transforming the raw features into a representation that can be used effectively by machine learning techniques. This will often involve transforming data into vector representations.

Naive Image Featurization

For this example we are working with low-resolution grayscale image data. Here we adopt a very simple featurization approach -- flatten the image. We will convert the 28x28 pixel images into 784-dimensional vectors.

In [14]:
images_tr.shape
Out[14]:
(38400, 28, 28)
In [15]:
def flatten(images):
    return images.reshape(images.shape[0], -1)
In [16]:
X_tr = flatten(images_tr)
In [17]:
X_tr.shape
Out[17]:
(38400, 784)

Standardization

Recall that the pixel intensities are from 0 to 255:

In [18]:
fig_pixels

Let's standardize the pixel intensities to have zero mean and unit variance.

Here we use the sklearn StandardScaler

In [19]:
from sklearn.preprocessing import StandardScaler

# 1. Initialize a StandardScaler object
image_scaler = StandardScaler()

# 2. Fit the scaler
image_scaler.fit(flatten(images_tr))
Out[19]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
copy  True
with_mean  True
with_std  True

What do the mean and variance images tell us about the dataset?

In [20]:
display(px.imshow(image_scaler.mean_.reshape(28,28), 
                  color_continuous_scale='gray_r', title="Mean image"))
display(px.imshow(image_scaler.var_.reshape(28,28), 
                  color_continuous_scale='gray_r', title="Variance image"))

Let's create a generic featurization function that we can reuse for different datasets. Notice that this function uses the image_scaler that we fit to the training data.

In [21]:
def featurizer(images):
    flattened = flatten(images)
    return image_scaler.transform(flattened)

X_tr = featurizer(images_tr)

Our new images look similar to the original images but they have been standardized to have zero mean and unit variance. This should help improve the performance of our machine learning models.

In [22]:
show_images(X_tr.reshape(images_tr.shape), max_images=10, labels=labels_tr)

One-Hot Encoding

We don't need to one-hot encode the features in this dataset so we will briefly demonstrate on another dataset:

In [23]:
df = pd.DataFrame({"color": ["red", "green", "red", "blue", "blue", "yellow", ""]})
df
Out[23]:
color
0 red
1 green
2 red
3 blue
4 blue
5 yellow
6
In [24]:
from sklearn.preprocessing import OneHotEncoder
# 1. Initialize a OneHotEncoder object
ohe = OneHotEncoder()
# 2. Fit the encoder
ohe.fit(df[["color"]])
Out[24]:
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
categories  'auto'
drop  None
sparse_output  True
dtype  <class 'numpy.float64'>
handle_unknown  'error'
min_frequency  None
max_categories  None
feature_name_combiner  'concat'
In [25]:
ohe.categories_
Out[25]:
[array(['', 'blue', 'green', 'red', 'yellow'], dtype=object)]
In [26]:
ohe.transform(df[["color"]]).toarray()
ohe.categories_
Out[26]:
[array(['', 'blue', 'green', 'red', 'yellow'], dtype=object)]

Bag of Words

We also don't need the bag-of-words representation for this dataset, but we will demonstrate it briefly using another dataset.

In [27]:
df['text'] = [
    "Red is a color.",
    "Green is for green food.",
    "Red reminds me of red food.",
    "Blue is my favorite color!",
    "Blue is for Cal!",
    "Yellow is also for Cal!",
    "I forgot to write something."
]
In [28]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Initialize a CountVectorizer object
vectorizer = CountVectorizer()

# 2. Fit the vectorizer
vectorizer.fit(df["text"])
Out[28]:
CountVectorizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
input  'content'
encoding  'utf-8'
decode_error  'strict'
strip_accents  None
lowercase  True
preprocessor  None
tokenizer  None
stop_words  None
token_pattern  '(?u)\\b\\w\\w+\\b'
ngram_range  (1, ...)
analyzer  'word'
max_df  1.0
min_df  1
max_features  None
vocabulary  None
binary  False
dtype  <class 'numpy.int64'>
In [29]:
pd.DataFrame(vectorizer.transform(df["text"]).toarray(), 
             columns=vectorizer.get_feature_names_out())
Out[29]:
also blue cal color favorite food for forgot green is me my of red reminds something to write yellow
0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 1 0 2 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 1 0 1 2 1 0 0 0 0
3 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0
4 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
5 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1
6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0





Return to Slides





Modeling and Optimization

In this section, we will go through the modeling process. We will focus on developing a classification model.

Training (Fitting) a Classifier

We will start with the most basic classifier, the logistic regression model, to demonstrate the classification workflow.

Logistic regression is a linear model that is commonly used for binary and multi-class classification tasks. It is also a good starting point for understanding more complex deep learning models that will be covered later in the course.

Here we use sklearn to fit a logistic regression model to the training data. The LogisticRegression class from sklearn.linear_model is used to create an instance of the model.

The fit method is called on the model instance, passing in the training data and labels. This trains the model to learn the relationship between the input features (flattened images) and the target labels (clothing categories). In scikit-learn, the fit method is used to train any model on the provided data.

In [30]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X=X_tr, y=labels_tr)
/opt/homebrew/Caskroom/mambaforge/base/envs/py_3_11/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:470: ConvergenceWarning:

lbfgs failed to converge after 100 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[30]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty  'l2'
dual  False
tol  0.0001
C  1.0
fit_intercept  True
intercept_scaling  1
class_weight  None
random_state  None
solver  'lbfgs'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  None
l1_ratio  None

Notice that we get a warning that:

plaintext
lbfgs failed to converge after 100 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

This warning indicates that the optimization algorithm used by the logistic regression model did not converge to a solution within the default number of iterations. This can happen if the model is complex or if the data is not well-scaled.

We can change the optimization algorithm used by the logistic regression model. The default algorithm is lbfgs, which is a quasi-Newton method. Other options include newton-cg, sag, and saga. Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm can affect the convergence speed and final performance of the model.

In this class, we will explore variations of stochastic gradient descent like saga. Let's try using the saga algorithm instead of lbfgs to see if it converges faster. Here we will also set the tol parameter since we don't want to wait.

In [31]:
lr_model = LogisticRegression(tol=0.05, solver='saga', random_state=42)
lr_model.fit(X=X_tr, y=labels_tr)
Out[31]:
LogisticRegression(random_state=42, solver='saga', tol=0.05)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty  'l2'
dual  False
tol  0.05
C  1.0
fit_intercept  True
intercept_scaling  1
class_weight  None
random_state  42
solver  'saga'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  None
l1_ratio  None

Parameters

The parameters of a model are the internal variables that the model learns during the training process. For example, in logistic regression, the parameters are the weights assigned to each feature. These weights are adjusted during training to minimize the loss function, which measures how well the model's predictions match the actual labels.

In [32]:
print("model.coef_.shape:", lr_model.coef_.shape)
print("model.intercept_.shape:", lr_model.intercept_.shape)
print(lr_model.coef_)
print(lr_model.intercept_)
model.coef_.shape: (10, 784)
model.intercept_.shape: (10,)
[[ 9.90445492e-05 -6.36010321e-03 -3.85913757e-03 ...  1.49595978e-03
   4.32587736e-03  2.73064427e-03]
 [-1.89699677e-03  6.63981322e-03 -7.47402405e-03 ... -1.80362390e-02
  -1.84524048e-02 -1.52846533e-02]
 [-7.43264082e-04 -4.46484428e-03  1.60046019e-03 ...  5.50850058e-03
   1.38217724e-02  3.68772263e-03]
 ...
 [ 1.77951137e-04 -1.08447436e-03 -8.59128688e-04 ... -5.03797045e-03
  -1.78474348e-03 -2.49799352e-03]
 [-8.24675871e-03 -2.41366751e-03 -7.31344979e-03 ... -1.80925346e-02
  -1.21750176e-02 -1.40790500e-03]
 [-1.69500201e-04  7.44355526e-05 -5.08764217e-03 ... -5.67020645e-03
  -1.32036337e-03 -1.92847951e-03]]
[-0.03276315  0.01498156  0.02486191 -0.00368395  0.04477243 -0.06132416
  0.09267479 -0.03592529 -0.00330317 -0.04029096]

We can also visualize these coefficients. This can help us understand which pixels are most important for each class. We will learn more about this model in the future. (You don't have to understand these details now.)

In [33]:
coeffs = lr_model.coef_
show_images(coeffs.reshape(10, 28, 28), labels=lr_model.classes_)

Neural Networks¶

Neural networks are often the model of choice for image classification tasks. They can learn complex patterns in the data and often outperform simpler models like logistic regression. However, they also require the correct architecture and significant training data and computational resources.

Here we will try a simple neural network with two hidden layers.

In [34]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50), 
    max_iter=100, tol=1e-3, random_state=42)

mlp.fit(X=X_tr, y=labels_tr)
Out[34]:
MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=100, random_state=42,
              tol=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
hidden_layer_sizes  (100, ...)
activation  'relu'
solver  'adam'
alpha  0.0001
batch_size  'auto'
learning_rate  'constant'
learning_rate_init  0.001
power_t  0.5
max_iter  100
shuffle  True
random_state  42
tol  0.001
verbose  False
warm_start  False
momentum  0.9
nesterovs_momentum  True
early_stopping  False
validation_fraction  0.1
beta_1  0.9
beta_2  0.999
epsilon  1e-08
n_iter_no_change  10
max_fun  15000

Hyperparameters

The hyperparameters are the arguments that are set before the training process begins. These include the choice of optimization algorithm, the learning rate, and the number of iterations, among others. Hyperparameters are typically tuned using techniques like cross-validation to find the best combination for a given dataset.

Confusingly these hyperparameters are often referred to as "parameters" in the context of machine learning libraries like sklearn. For example, the LogisticRegression class has hyperparameters like solver, C, and max_iter that can be adjusted to improve model performance.

In [35]:
lr_model
Out[35]:
LogisticRegression(random_state=42, solver='saga', tol=0.05)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty  'l2'
dual  False
tol  0.05
C  1.0
fit_intercept  True
intercept_scaling  1
class_weight  None
random_state  42
solver  'saga'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  None
l1_ratio  None

To evaluate the model we will use the validation dataset.

In [36]:
X_val = featurizer(images_val)

Let's try tuning the regularization parameter C. To make this process more illustrative we will work with a smaller subset of the training data (n=1000). This will allow us to better demonstrate underfitting and overfitting and significantly speed up the training process.

In [37]:
n_small = 1000
X_tr_small = X_tr[:n_small,:]
labels_tr_small = labels_tr[:n_small]

In practice, we need to be careful when tuning regularization parameters against a sample of the data. In this case, more regularization is likely needed for smaller datasets to prevent overfitting.

In [38]:
from sklearn.metrics import log_loss

C_vals = np.logspace(-5, 1, 20)

logprob_tr = []
logprob_val = []
acc_tr = []
acc_val = []

for C in C_vals:
    print("starting training with C =", C)
    model = LogisticRegression(tol=1e-3, random_state=42, C=C)
    model.fit(X=X_tr_small, y=labels_tr_small)
    
    # compute the logprob accuracy
    logprob_tr.append(-log_loss(labels_tr_small, model.predict_proba(X_tr_small), labels=model.classes_))
    logprob_val.append(-log_loss(labels_val, model.predict_proba(X_val), labels=model.classes_))

    # compute the accuracy
    acc_tr.append(np.mean(model.predict(X_tr_small) == labels_tr_small))
    acc_val.append(np.mean(model.predict(X_val) == labels_val))
starting training with C = 1e-05
starting training with C = 2.06913808111479e-05
starting training with C = 4.281332398719396e-05
starting training with C = 8.858667904100833e-05
starting training with C = 0.00018329807108324357
starting training with C = 0.000379269019073225
starting training with C = 0.0007847599703514606
starting training with C = 0.001623776739188721
starting training with C = 0.003359818286283781
starting training with C = 0.0069519279617756054
starting training with C = 0.01438449888287663
starting training with C = 0.029763514416313162
starting training with C = 0.06158482110660261
starting training with C = 0.1274274985703132
starting training with C = 0.26366508987303555
starting training with C = 0.5455594781168515
starting training with C = 1.1288378916846884
starting training with C = 2.3357214690901213
starting training with C = 4.832930238571752
starting training with C = 10.0
In [39]:
df_logprob = pd.DataFrame({
    "C_val": C_vals, 
    "Train": logprob_tr, "Validation": logprob_val,
}).set_index("C_val") 

display(
    px.line(df_logprob, 
        labels={"value": "Avg. Log Prob.", "C_val": "Reg. Parameter C"},
        title="LR Classifier Log Prog vs Reg. Parameter",
        markers=True,
        log_x=True,
        width=800, height=500)
)

df_acc = pd.DataFrame({
    "C_val": C_vals, 
    "Train": acc_tr, "Validation": acc_val
}).set_index("C_val") 

display(
    px.line(df_acc, 
        labels={"value": "Accuracy", "C_val": "Reg. Parameter C"},
        title="LR Classifier Accuracy vs Reg. Parameter",
        markers=True,
        log_x=True,
        width=800, height=500
    )
)

Evaluating the Model

After training the model, we can use it to make predictions on new data. The predict method of the trained model is used to generate predictions based on the input features.

Let's return to our logistic regression model.

In [40]:
lr_model.predict(X_tr[:10,:])
Out[40]:
array(['Ankle boot', 'Coat', 'Ankle boot', 'T-shirt/top', 'Pullover',
       'Ankle boot', 'Dress', 'T-shirt/top', 'Coat', 'Sandal'],
      dtype='<U11')

Do you agree with the predictions? Let's visualize the predictions on a few test images.

In [41]:
show_images(images_tr[:10,:].reshape(10, 28, 28),
            labels = lr_model.predict(X_tr[:10,:]))

Now let's see what the correct labels are for these images.

In [42]:
k = 10
tmp_labels = labels_tr[:k] + " (pred=" + lr_model.predict(X_tr[:k,:]) + ")"
show_images(images_tr[:k,:].reshape(k, 28, 28), labels=tmp_labels)

Predicting Probabilities

Many models can also provide probabilities for each class using the predict_proba method. This is useful for understanding the model's confidence in its predictions. In this class, we will often use a probabilistic framing, where we interpret the output of the model as probabilities of each class.

In [43]:
lr_model.predict_proba(X_tr[:5,:])
Out[43]:
array([[9.51642010e-01, 1.25766964e-03, 8.94327054e-07, 3.11486837e-07,
        2.59469407e-07, 7.35306052e-03, 1.78581053e-07, 3.97435123e-02,
        5.22696365e-07, 1.58056694e-06],
       [1.05752334e-04, 2.47086595e-03, 8.32072100e-01, 3.44094506e-03,
        4.58904789e-02, 4.24209010e-05, 1.13951633e-01, 1.94167275e-05,
        1.55456636e-03, 4.51820449e-04],
       [9.30078717e-01, 1.40564685e-03, 1.91994009e-06, 6.84403412e-07,
        8.73522603e-07, 9.24863395e-03, 8.71969951e-07, 5.92557003e-02,
        1.75317539e-06, 5.19911487e-06],
       [2.05798117e-04, 2.21031435e-05, 6.61559787e-04, 2.37402967e-02,
        2.18046824e-02, 2.07214721e-05, 3.03973688e-02, 1.53740432e-04,
        9.18742226e-01, 4.25150290e-03],
       [2.23839377e-04, 9.60149725e-02, 1.08016253e-01, 2.85455925e-04,
        7.19252331e-01, 1.85483941e-03, 7.24570316e-02, 3.23939232e-04,
        1.53916187e-03, 3.21759981e-05]])

We can visualize these probabilities for the same images we predicted earlier.

In [44]:
k = 10
df = pd.DataFrame(lr_model.predict_proba(X_tr[:k,:]), columns=lr_model.classes_)
bars = px.bar(df, barmode='stack',orientation='v')
bars.update_layout(xaxis_tickmode='array', xaxis_tickvals=np.arange(k))
display(bars)
tmp_labels = labels_tr[:k] + " (pred=" + lr_model.predict(X_tr[:k,:]) + ") img: " + np.arange(k).astype(str)
show_images(images_tr[:k,:].reshape(k, 28, 28), labels=tmp_labels)

Accuracy Metrics and Test Performance

After training the model, we often want to evaluate the model. There are many ways to evaluate a model, and the best method depends on the task and the data. For classification tasks, we often use metrics like accuracy, precision, recall, and F1-score. Let's start with accuracy.

Accuracy is the simplest metric, which measures the proportion of correct predictions out of the total number of predictions.

Let's start by computing the accuracy of our model on the training set.

In [45]:
np.mean(lr_model.predict(X_tr) == labels_tr, axis=0)
Out[45]:
np.float64(0.8488541666666667)

One of the issues with the training set is that the model may have overfit to the training data, meaning it performs well on the training set but poorly on unseen data. Intuitively, this is like practicing on a set of questions and then getting those same questions right on a test, but not being able to answer new questions that are similar but not identical.

To assess the model's performance on unseen data, we will evaluate it on the test set. Recall, the test set is a separate portion of the dataset that was not used during training.

In [46]:
X_te = featurizer(images_te)
In [47]:
np.mean(lr_model.predict(X_te) == labels_te, axis=0)
Out[47]:
np.float64(0.84075)
In [48]:
from sklearn.metrics import accuracy_score

train_acc = accuracy_score(labels_tr, lr_model.predict(X_tr))
val_acc = accuracy_score(labels_val, lr_model.predict(X_val))
test_acc = accuracy_score(labels_te, lr_model.predict(X_te))

print("Train accuracy:", train_acc)
print("Validation accuracy:", val_acc)
print("Test accuracy:", test_acc)
Train accuracy: 0.8488541666666667
Validation accuracy: 0.849375
Test accuracy: 0.84075

The test accuracy is slightly lower than the training accuracy, which is expected. However, the difference is not too large, indicating that the model has not overfit significantly.

Is this accuracy good? What would a random guess yield?

A common way to evaluate a classification model is to compare its accuracy against a baseline. Perhaps the simplest baseline is random guessing, where we randomly assign a class to each image.

What accuracy does random guessing yield?

This would depend on the how frequently each class appears in the test dataset.

In [49]:
np.random.seed(42)
print("Model Accuracy:", np.mean(lr_model.predict(X_val) == labels_val, axis=0))
print("Random Guess Accuracy:", 
      np.mean(np.random.choice(lr_model.classes_, size=len(labels_te)) == labels_te, axis=0))
Model Accuracy: 0.849375
Random Guess Accuracy: 0.10108333333333333

Does our model struggle with any particular class?

In [50]:
isWrong = lr_model.predict(X_val) != labels_val
# make a histogram with frequency of correct and incorrect predictions
fig = px.histogram(labels_val[isWrong], histnorm='percent')
fig.update_layout(xaxis_title="Label", 
                  yaxis_title="Percentage of Incorrect Predictions")
fig.update_xaxes(categoryorder="total descending")

For classification tasks, we often want to look at more than just accuracy. We can use a confusion matrix to visualize the performance of the model across different classes. The confusion matrix shows the number of correct and incorrect predictions for each class.

In [51]:
from sklearn.metrics import confusion_matrix

fig = px.imshow(
    confusion_matrix(labels_val, lr_model.predict(X_val)), 
    color_continuous_scale='Blues'
    )
fig.update_layout(
        xaxis_title="Predicted Label",
        yaxis_title="True Label",
        coloraxis_showscale=False,
        xaxis=dict(tickmode='array', tickvals=np.arange(len(model.classes_)), ticktext=model.classes_),
        yaxis=dict(tickmode='array', tickvals=np.arange(len(model.classes_)), ticktext=model.classes_)
    )   

Last Thoughts

In the homework, you will have a chance to work with this data and use scikit-learn in more depth. We recommend reading the documentation and tutorial on the scikit-learn website as we go through the course. These provide a great resource for understanding the various functions and capabilities of the library as well as machine learning concepts.