\documentclass{article}
%% Import any packages here
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage[shortlabels]{enumitem}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{fancyhdr}
\usepackage[hidelinks]{hyperref}
%% Define any custom commands/operators/environments/etc here
\newcommand{\abs}[1]{\left|#1\right|}
\newcommand{\mat}[1]{\mathbf{#1}}
\renewcommand{\vec}[1]{\boldsymbol{\mathbf{#1}}}
\newcommand{\Question}[1]{\Large \section{ #1 } \normalsize}
\DeclareMathOperator{\cov}{\mathrm{Cov}}
\DeclareMathOperator{\rank}{\mathrm{rank}}
\DeclareMathOperator{\Ima}{\mathrm{Im}}
\DeclareMathOperator{\trac}{\mathrm{trace}}
\DeclareMathOperator{\R}{\mathbb{R}}
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newenvironment{solution}{\color{blue} \smallskip \textbf{Solution:}}{}
\newtheorem*{theorem}{Theorem}
\paperwidth=8.5in
\paperheight=11in
\textheight=9in
\textwidth=7in
\hoffset=-0.25in
\voffset=-1in
\oddsidemargin=0in
\evensidemargin=0in
\parindent=0pt
\parskip=5pt
\itemsep=-1pt
\floatsep=9pt plus 2pt minus 3pt
\intextsep=9pt plus 2pt minus 3pt
\textfloatsep=9pt plus 2pt minus 3pt
\renewcommand{\baselinestretch}{1.1}
%% Page heading
\pagestyle{fancy}
\fancyhf{}
\setlength{\headheight}{30pt}
\lhead{\coursenumber, \semester}
\chead{\title}
\rhead{\name}
\lfoot{\class}
\rfoot{\thepage}
\def\coursenumber{CS 189/289A}
\def\class{Introduction to Machine Learning}
\def\semester{Fall 2024}
\def\name{Your Name} % Replace with your name
\def\title{HW 2}
\begin{document}
\fontsize{12}{15}\selectfont
\Question{Multivariate Gaussians: A review}
Multivariate Gaussian distributions crop up everywhere in machine learning, from priors on model parameters to assumptions on noise distributions.
Being able to manipulate multivariate Gaussians also becomes important for analyzing correlations in data and preprocessing it for better regression and classification.
We want to make sure to first cover the MVG fundamentals here.
Note that the probability density function of a non-degenerate (i.e. the covariance matrix is positive definite and, thus, invertible) multivariate Gaussian RV with mean vector, $\vec{\mu} \in \R^2$, and covariance matrix, $\Sigma \in \R^{2 \times 2}$, is:
\[f(\vec{z}) = \frac{1}{\sqrt{(2\pi)^2 \abs{\Sigma}}} \exp\left(-\frac{1}{2}(\vec{z} - \vec{\mu})^\top \Sigma^{-1} (\vec{z} - \vec{\mu})\right)\]
\begin{enumerate}[(a)]
\item Consider a two dimensional, zero mean random variable $Z = \begin{bmatrix} Z_1 & Z_2 \end{bmatrix}^\top \in \mathbb{R}^2$.
In order for the random variable to be jointly Gaussian, a necessary and sufficient condition which we call the \textit{first characterization} is that
\begin{itemize}
\item $Z_1$ and $Z_2$ are each marginally Gaussian, and
\item $Z_1 | Z_2 = z$ is Gaussian, and $Z_2 | Z_1 = z$ is Gaussian.
\end{itemize}
A \textit{second characterization} of a jointly Gaussian zero mean RV $Z \in \mathbb{R}^2$ is that it can be written as $Z = A X$, where $X \in \mathbb{R}^2$ is a collection of i.i.d. standard normal RVs and $A \in \mathbb{R}^{2 \times 2}$ is a matrix.
Let $X_1$ and $X_2$ be i.i.d. standard normal RVs. Let $U$ denote a binary random variable uniformly that is equal to $1$ with probability $\frac{1}{2}$ and $-1$ with probability $\frac{1}{2}$, independent of everything else.
For each of the below subproblems, complete the following \emph{two} steps:
(1) Using one of the characterizations given above, determine whether the RVs are jointly Gaussian. If using the second characterization, clearly specify the $A$ matrix.
(2) Calculate the covariance matrix of $Z$ (regardless of whether the RVs are jointly Gaussian or not).
\begin{enumerate}[(i.)]
\item $Z_1 = X_1$ and $Z_2 = X_2$.
\item $Z_1 = X_1$ and $Z_2 = X_1 + 2X_2$. If using the first characterization, assume that you already know $(Z_1 | Z_2 = z)$ is Gaussian. % NEW
\item $Z_1 = X_1$ and $Z_2 = -X_1$.
\item $Z_1 = X_1$ and $Z_2 = U X_1$.
\end{enumerate}
\begin{solution}
TODO
\end{solution}
\newpage
\item Show that two Gaussian random variables can be uncorrelated, but not independent (\textit{Hint: use one of the examples in part (a)}). On the other hand, show that two uncorrelated, jointly Gaussian RVs are independent.
\begin{solution}
TODO
\end{solution}
\newpage
\item With the setup in (a), let $Z = V X$, where $V \in \mathbb{R}^{2 \times 2}$, and $Z, X \in \mathbb{R}^2$. What is the covariance matrix $\Sigma_Z$? If $X$ is not a multivariate Gaussian but has the identity matrix $I\in\mathbb R^{2\times 2}$ as its covariance matrix, is your computed $\Sigma_Z$ still the covariance of $Z$?
\begin{solution}
TODO
\end{solution}
\newpage
\item Given a jointly Gaussian zero mean RV $Z = [Z_1\; Z_2]^\top\in \mathbb{R}^2$ with covariance matrix $\Sigma_Z = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12} & \Sigma_{22}\end{bmatrix}$, derive the conditional distribution of $(Z_1 | Z_2 = z)$.
Hint: The following identity may be useful
\[
\begin{bmatrix} a & b \\ b & c\end{bmatrix}^{-1} = \begin{bmatrix} 1 & 0 \\ -\frac{b}{c} & 1\end{bmatrix} \begin{bmatrix} \left(a - \frac{b^2}{c}\right)^{-1} & 0 \\ 0 & \frac{1}{c} \end{bmatrix}
\begin{bmatrix} 1 & -\frac{b}{c} \\ 0 & 1\end{bmatrix}
\]
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\newpage
\Question{Projections and Linear Regression}
We are given $X \in \R^{n \times d}$ where $n > d$ and $\mathrm{rank}(X) = d$. We are also given a vector $y \in \R^n$. Define the orthogonal projection of $y$ onto $\mathrm{range}(X)$ as $P_{\text{range}(X)}(y)$.
\paragraph{\textit{Background on orthogonal projections}} For any finite-dimensional subspace $W$ (here, $\mathrm{range}(X)$) of a vector space $V$ (here, $\mathbb R^n$), any vector $v\in V$ can be decomposed as
\begin{align*}
v = w + u,\qquad w\in W,\quad u \in W^\perp,
\end{align*}
where $W^\perp$ is the orthogonal complement of $W$. Furthermore, this decomposition is unique: if $v=w'+u'$ where $w'\in W,\; u'\in W^\perp$, then $w'=w$ and $u'=u$. These two facts allow us to define $P_W$, the orthogonal projection operator onto $W$. Given a vector $v$ with decomposition $v=w+u$, we define
\begin{align*}
P_W(v) = w.
\end{align*}
It can also be shown using these two facts that $P_W$ is linear. For more information on orthogonal projections, see \url{https://gwthomas.github.io/docs/math4ml.pdf}.
\begin{enumerate}[(a)]
\item Prove that $\displaystyle \ P_{\text{range}(X)}(y) = \argmin_{w \in \mathrm{range}(X)} \|y - w\|_2^2$.
\begin{solution}
TODO
\end{solution}
\newpage
\item An orthogonal projection is a linear transformation. That is, $P_{\text{range}(X)}(y) = Py$ for some projection matrix $P$.
Specifically, given $1 \le d \le n$, a matrix $P \in \R^{n \times n}$ is said to be a rank-$d$ orthogonal projection matrix if
\begin{itemize}
\item $\rank(P) = d$
\item $P = P^{T}$
\item $P^{2} = P$.
\end{itemize}
Prove that $P$ is a rank-$d$ projection matrix if and only if there exists a $U \in \R^{n \times d}$ such that $P = UU^\top$ and $U^{\top}U = I$.
{\bf Hint} Use the eigendecomposition of $P$ to prove the forward direction.
\begin{solution}
TODO
\end{solution}
\newpage
\item
The Singular Value Decomposition theorem states that we can write any matrix $X$ as
\begin{align*}
X = \sum_{i=1}^{\min\{n,d\}} \sigma_{i} u_i v_i^{\top} = \sum_{i:\sigma_i > 0}\sigma_{i} u_i v_i^{\top}
\end{align*}
where $\sigma_{i} \ge 0$, and $\{u_i\}_{i=1}^n$ and $\{v_i\}_{i=1}^d$ are orthonormal bases for $\mathbb R^n$ and $\mathbb R^d$ respectively. Some of the singular values $\sigma_i$ may equal 0, indicating that the associated left and right singular vectors $u_i$ and $v_i$ do not contribute to the sum, but sometimes it is still convenient to include them in the SVD so we have complete orthonormal bases for $\mathbb R^n$ and $\mathbb R^d$ to work with.
Show that
\begin{enumerate}[(i)]
\item $\{u_i : \sigma_i > 0\}$ is an orthonormal basis for the columnspace of $X$
\item Similarly, $\{v_i : \sigma_i > 0\}$ is an orthonormal basis for the row space of of $X$ \\ \emph{Hint: consider $X^\top$.}
\end{enumerate}
\begin{solution}
TODO
\end{solution}
\newpage
\item Let $X \in \mathbb{R}^{n \times d}$ such that $\rank(X) = d$. Prove that $X(X^{T}X)^{-1}X^{T}$ is a rank-$d$ orthogonal projection matrix.
\emph{Hint}: Consider the SVD decomposition of $X$.
\begin{solution}
TODO
\end{solution}
\newpage
\item
Prove that $X(X^{T}X)^{-1}X^{T}$ is a projection onto range($X$).
\begin{solution}
TODO
\end{solution}
\newpage
\item
Show that $w^{*} = (X^{T}X)^{-1}X^{T}y$ is the solution to the optimization problem
\begin{align*}
\argmin_{w} \|y - Xw\|_{2}^{2}
\end{align*}
using only facts proved in this problem.
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\newpage
\Question{Some MLEs}
For this question, assume you observe $n$ (data point, label) pairs $(x_i, y_i)_{i=1}^n$, with $x_i\in \mathbb R^d$ and $y_i\in\mathbb R$ for all $i=1,\ldots, n$. We denote $X$ as the data matrix containing all the data points and $y$ as the label vector containing all the labels:
\begin{align*}
X = \begin{bmatrix} x_1^\top \\\vdots \\ x_n^\top \end{bmatrix} \in \mathbb R^{n\times d}, \qquad y = \begin{bmatrix} y_1 \\ \vdots \\ y_n\end{bmatrix}\in\mathbb R^n.
\end{align*}
\begin{enumerate}[(a)]
\item %%%%%%%%%%%%%%%%%%%%%%%%%%% (A)
Ignoring $y$ for now, suppose we model the data points as coming from a $d$-dimensional Gaussian with diagonal covariance:
\begin{align*}
\forall i = 1, \ldots, n, \quad x_i \stackrel{i.i.d.}{\sim} N(\mu, \Sigma); \quad \Sigma = \begin{bmatrix} \sigma_1^2 & \ldots & 0 \\
\vdots & \ddots & \vdots \\ 0 & \ldots & \sigma_d^2\end{bmatrix}.
\end{align*}
If we consider $\mu\in\mathbb R^d$ and $(\sigma_1^2,\ldots, \sigma_d^2)$, where each $\sigma_i^2 > 0$, to be unknown, the parameter space here is $2d$-dimensional. When we refer to $\Sigma$ as a parameter, we are referring to the $d$-tuple $(\sigma_1^2,\ldots, \sigma_d^2)$, but inside a linear algebraic expression, $\Sigma$ denotes the diagonal matrix $\mathrm{diag}(\sigma_1^2,\ldots,\sigma_d^2)$.
Solve the following problems:
\begin{enumerate}[(i)]
\item Prove that log-likelihood $\ell(\mu, \Sigma) = \log p(X \mid \mu, \Sigma)$ is equal to
\begin{align*}
-\frac{n}{2}\left(d \log (2\pi) - \sum_{j = 1}^{d} \log\left(\frac{1}{\sigma_{j}^{2}}\right)\right) - \frac{1}{2} \sum_{i = 1}^{n}(x_{i} - \mu)^{T}\Sigma^{-1}(x_{i} - \mu).
\end{align*}
\item Find the MLE of $\mu$ assuming $\Sigma$ is known.
\item Find the MLE of $\Sigma$ assuming $\mu$ is known. \\
\textit{Hint:} you can re-parameterize $\sigma_j^2$ by defining $v_j=\frac{1}{\sigma_j^2}$
\item Find the joint MLE of $(\mu, \Sigma)$ in terms of the maximum likelihood estimates computed above.
\end{enumerate}
\begin{solution}
TODO
\end{solution}
\newpage
\item %%%%%%%%%%%%%%%%%%%%%%%%%%% (B)
Suppose that we have a training set $\{(x_{i}, y_{i}) \mid i = 1 \ldots n\}$ of $n$ independent examples but in which the residual terms had different variances. That is, we assume
\begin{align*}
y_{i} \sim N(w^{T}x_{i}, \sigma_{i}^{2}).
\end{align*}
Show that the MLE estimate of $w$ can be found by solving the following optimization problem
\begin{align*}
w_{\text{MLE}} = \argmin_{w} \|A(Xw - y)\|_{2}^{2}.
\end{align*}
Clearly state what the matrix $A$ equals.
\begin{solution}
TODO
\end{solution}
\newpage
\item %%%%%%%%%%%%%%%%%%%%%%%%%%% (C)
Consider the Categorical($\theta_{1}, \theta_{2}, \ldots, \theta_{k})$ distribution. Recall, for categorical distributions, there are two constraints on $\theta_{k}$:
\begin{itemize}
\item $\theta_{k} \geq 0$ for all $k$
\item $\sum_{k = 1}^{K} \theta_{k} = 1$
\end{itemize}
The distribution describes a random process that selects one of the $K$ possible categories, with category $k$ being chosen with probability $\theta_{k}$.
Ignoring the data points $X$, suppose that for all $i$ from 1 to $n$, we sample $y_{i}$ from a categorical distribution:
\begin{align*}
y_i \stackrel{i.i.d.}{\sim} \mathrm{Categorical}(\theta_1,\ldots, \theta_K).
\end{align*}
Compute the MLE of $\theta = (\theta_1,\ldots, \theta_K)$. Use the fact that the KL divergence is nonnegative:
\begin{align*}
\mathrm{KL}(\pi\,\|\, \theta) = \sum_{\omega\in\Omega} \pi(\omega)\log\left(\frac{\pi(\omega)}{\theta(\omega)}\right)\geq 0.
\end{align*}
\begin{solution}
TODO
\end{solution}
\newpage
\item %%%%%%%%%%%%%%%%%%%%%%%%%%% (D)
Again consider $X$ fixed. This time, we suppose that each $y_i$ is binary-valued (0 or 1). We choose to model $y$ as
\begin{align*}
y_i \stackrel{ind.}{\sim} \mathrm{Ber}(s(x_i^\top w))\quad \forall i=1,\ldots, n,
\end{align*}
where $s(z) = \frac{1}{1+e^{-z}}$ is the \textit{sigmoid} function and $\mathrm{Ber}(p)$ denotes the Bernoulli distribution which takes value 1 with probability $p$ and 0 with probability $1-p$.
\begin{enumerate}[(i)]
\item Write down the log-likelihood $\ell(w) = \log p(y\,|\,w)$ and show that finding the MLE of $w$ is equivalent to minimizing the cross entropy between $\mathrm{Ber}(y_i)$ and $\mathrm{Ber}(s(x_i^\top w))$ for each $i$:
\begin{equation}
\label{eq:lrce}
\min_{w\in\mathbb R^d} \sum_{i=1}^n H(\mathrm{Ber}(y_i), \mathrm{Ber}(s(x_i^\top w))).
\end{equation}
\textit{Definition of cross entropy: given two discrete probability distributions $\pi:\Omega\to[0,1]$ and $\theta:\Omega\to[0,1]$ on some outcome space $\Omega$, we define the cross entropy between $\pi$ and $\theta$ as}
\begin{align*}
H(\pi, \theta) = \sum_{\omega\in\Omega} -\pi(\omega)\log\theta(\omega).
\end{align*}
\begin{solution}
TODO
\end{solution}
\item
Show that \eqref{eq:lrce} (and therefore finding the MLE) is equivalent to the following problem:
\begin{equation}
\label{eq:lrlog}
\min_{w\in\mathbb R^d} \sum_{i=1}^n \log(1+\exp(-z_i x_i^\top w))
\end{equation}
where $z_i = 1$ if $y_i=1$ and $z_i = -1$ if $y_i=0$.\\
Note: both \eqref{eq:lrce} and \eqref{eq:lrlog} are referred to as logistic regression.
\begin{solution}
TODO
\end{solution}
\item Let $J(w) = \log(1+\exp(-zx^\top w))$ where, again, $z = 1$ if $y = 1$ and $z = -1$ if $y = 0$ (we are only considering a single $(x, y)$ pair in this subpart). Prove the following:
\begin{enumerate}
\item $J$ is not strictly convex. \\
\emph{Hint: A necessary condition for a twice-differentiable function to be strictly convex is that its Hessian is positive definite.}
\item The gradient descent update rule for minimizing $J(w)$ with learning rate $\epsilon$ is
\begin{align*}
w' = w - \epsilon\left(\frac{1}{1 + e^{-x^{T}w}} - y\right)x
\end{align*}
\end{enumerate}
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\end{enumerate}
\newpage
\Question{Geometry of Ridge Regression}
You recently learned ridge regression and how it differs from ordinary least squares. In this question we will explore the properties of ridge regression in more depth. Recall that the ridge regression problem is given by the following optimization problem:
\begin{align}
\min_{w} \|\vec{y} - \mat{X} \vec{w}\|_2^2 + \nu \|\vec{w}\|_2^2. \label{eq:ridge}
\end{align}
The solution to ridge regression is given by
\begin{align}
\vec{\hat{w}_{r}} = (\mat{X}^\top \mat{X} + \nu \mat{I})^{-1} \mat{X}^\top \vec{y}. \label{eq:ridge_sol}
\end{align}
\begin{enumerate}
\item One reason why we might want to have small weights $\vec{w}$ has to do with the sensitivity of the predictor to its input. Let $\vec{x}$ be a $d$-dimensional list of features corresponding to a new test point. Our predictor is $\vec{w}^\top \vec{x}$. What is an upper bound on how much our prediction could change if we added noise $\vec{\epsilon} \in \mathbb{R}^d$ to a test point's features $\vec{x}$, in terms of $\|\vec{w}\|_2$ and $\|\vec\epsilon\|_2$? \\
\emph{Hint: Use the Cauchy-Schwarz inequality.}
\begin{solution}
TODO
\end{solution}
\newpage
\item Note that in computing $\vec{\hat{w}_r}$, we are trying to invert the matrix $\mat{X}^\top \mat{X} + \nu \mat{I}$ instead of the matrix $\mat{X}^\top \mat{X}$. If $\mat{X}^\top \mat{X}$ has eigenvalues $\sigma_1^2, \ldots, \sigma_d^2$, what are the eigenvalues of $(\mat{X}^\top \mat{X} + \nu \mat{I})^{-1}$? Comment on why adding the regularizer term $\nu \mat{I}$ can improve the inversion operation numerically.
\begin{solution}
TODO
\end{solution}
\newpage
\item Let the number of parameters $d = 4$ and the number of datapoints $n = 6$, and let the eigenvalues of $\mat{X}^\top \mat{X}$ be given by $500$, $10$, $1$, and $0.001$. We must now choose between two regularization parameters $\nu_1 = 50$ and $\nu_2 = 0.1$. Which do you think is a better choice for this problem and why?
\begin{solution}
TODO
\end{solution}
\newpage
\item Another advantage of ridge regression can be seen for under-determined systems. Say we have the data drawn from a $d = 5$ parameter model, but only have $n = 4$ training samples of it, i.e. $\mat{X} \in \R^{4 \times 5}$. Now this is clearly an underdetermined system, since $n < d$. Show that ridge regression with $\nu > 0$ results in a unique solution, whereas ordinary least squares has an infinite number of solutions.
\emph{Hint:} To make this point, it may be helpful to consider $\vec{w} = \vec{w}_0 + \vec{w}^*$ where $\vec{w}_0$ is in the null space of $\mat X$ and $\vec{w}^*$ is a solution.
\begin{solution}
TODO
\end{solution}
\newpage
\item What will the solution to ridge regression \eqref{eq:ridge_sol} converge to if you take the limit $\nu \rightarrow 0$? Your answer should be a simple expression in terms of $\mat{U}, \mat{\Sigma}, \mat{V}, \vec{y}, \text{and } \nu$ where $\mat{X} = \mat{U}\mat{\Sigma}\mat{V}^{T}$ is the SVD of $\mat{X}$.
\begin{solution}
TODO
\end{solution}
\newpage
\item Tikhonov regularization is a general term for ridge regression, where the implicit constraint set takes the form of an ellipsoid instead of a ball. In other words, we solve the optimization problem
\begin{align*}
\vec{w} = \arg\min_{\vec{w}} \|\vec{y} - \mat{X}\vec{w}\|_2^2 + \nu\|\mat{\Gamma}\vec{w}\|_2^2
\end{align*}
for some full rank matrix $\mat{\Gamma} \in \R^{d \times d}$. Derive a closed form solution for $\vec{w}$.
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\newpage
\Question{Robotic Learning of Controls from Demonstrations and Images}
Huey, a home robot, is learning to retrieve objects from a cupboard. The goal is to push obstacle objects out of the way to expose a goal object. Huey's robot trainer, Anne, provides demonstrations via tele-operation. When tele-operating the robot, Anne can look at the images captured by the robot and provide controls to Huey remotely.
During a demonstration, Huey records the RGB images of the scene for each of the $n$ timesteps, $x_1,...,x_{n}$, where $x_i \in \mathbb{R}^{30\times 30\times 3}$ and the controls for his body for each of the $n$ timesteps, $u_1,\ldots,u_{n}$, where $u_i \in \mathbb{R}^3$. The controls correspond to making small changes in the 3D pose (i.e. translation and rotation) of his body. Examples of the data are shown in the figure.
Under an assumption (sometimes called the Markovian assumption) that all that matters for the current control is the current image, Huey can try to learn a linear \emph{policy} $\pi$ (where $\pi \in \mathbb{R}^{2700\times 3}$) which linearly maps image states to controls (i.e. $\pi^\top x =u$). We will now explore how Huey can recover this policy using linear regression.
Note the dimensions in this problem! Previously, you saw linear regression in problems in which the learned weight $w^*$ was a vector and the predicted value $y$ was a scalar. Here, we are predicting 3D controls. This means that the learned policy is a matrix. In essence, we are performing $3$ regressions at the same time, one for each element of the predicted control $u$.
Please stick to {\bf numpy} (and {\bf numpy.linalg}) only for performing any computations in this assignment. We will ask that you edit the file \texttt{robotic\_ridge\_code.py} directly, instead of working in a Python notebook, and submit it to the Gradescope autograder after you are finished. Please don't rename the file, or change any of the function signatures!
\begin{enumerate}[(a)]
\item To get familiar with the structure of the data, \textbf{please visualize the 0th, 10th and 20th images in the training dataset. Also find their corresponding control vectors.} \\
Note: the training and testing images are currently stored as float32 numpy arrays, with pixel values in the range [0.0, 255.0]. You may have to convert to these images to the np.uint8 format to visualize them.
\begin{solution}
TODO
\end{solution}
\newpage
\item Load the $n$ training examples from \texttt{x\_train.p} and compose the matrix $X$, where $X \in \mathbb{R}^{n\times 2700}$. Note that you will need to flatten the images and reduce them to a single vector. The flattened image vector will be denoted by $\bar{x}$ (where $\bar{x} \in \mathbb{R}^{2700\times 1}$). Next, load the $n$ examples from \texttt{y\_train.p} and compose the matrix $U$, where $U \in \mathbb{R}^{n\times 3}$. Try to perform ordinary least squares by forming the matrix $(X^\top X)^{-1}X^\top$ for solving
\[\min_{\pi} \|X\pi-U \|_F\]
in order to learn the optimal \emph{policy} $\pi^* \in \mathbb{R}^{2700 \times 3}$. \textbf{Report what happens as you attempt to do this and explain why}.
\begin{solution}
TODO
\end{solution}
\newpage
\item Now try to perform ridge regression:
\[\min_{\pi} \: \|X\pi - U\|_F^2 + \lambda \|\pi\|_F^2\]
on the dataset for regularization values $\lambda = \lbrace 0.1, 1.0, 10, 100, 1000 \rbrace$. Measure the average squared Euclidean distance for the accuracy of the policy on the training data:
\[\frac{1}{n}\sum_{i = 0}^{n-1} \|\bar{x}_i^T \pi - u_i^\top \|^2_2\]
In the expression above, we are taking the $\ell_2$ norm of a row vector, which here we take to mean the $\ell_2$ norm of the column vector we get by transposing it.
\textbf{Report the training error results for each value of $\lambda$}.
\begin{solution}
TODO
\end{solution}
\newpage
\item Next, we are going to try standardizing the states. For each pixel value in each data point, $\bar{x}$, perform the following operation:
\[\bar{x} \mapsto \frac{\bar{x}}{255} \times 2 - 1\]
We know that the maximum pixel value is $255$, so this operation rescales the data to be in the range $[-1, 1]$.
\textbf{Repeat the previous part and report the average squared training error for each value of $\lambda$}. \\
\begin{solution}
TODO
\end{solution}
\newpage
\item Evaluate both \emph{policies} (i.e. with and without standardization) on the new validation data \texttt{x\_test.p} and \texttt{y\_test.p} for the different values of $\lambda$. \textbf{Report the average squared Euclidean loss and qualitatively explain how changing the values of $\lambda$ affects the performance in terms of bias and variance}.
\begin{solution}
TODO
\end{solution}
\newpage
\item To better understand how standardizing improved the loss function, we are going to evaluate the \emph{condition number} $\kappa$ of the optimization problem above, which is defined as
\[\kappa = \frac{\sigma_{\mbox{max}}(X^TX+\lambda I)}{\sigma_{\mbox{min}}(X^TX+\lambda I)}\]
or the ratio of the maximum singular value to the minimum singular value of the relevant matrix. Roughly speaking, the condition number of the optimization process measures how stable the solution will be when some error exists in the observations. More precisely, given a linear system $Ax=b$, the condition number of the matrix $A$ is the maximum ratio of the relative error in the solution $x$ to the relative error of $b$.
For the regularization value of $\lambda = 100$, {\bf report the condition number with the standardization technique applied and without}.
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\newpage
\Question{Honor Code}
\begin{enumerate}
\item
\textbf{List all collaborators. If you worked alone, then you must explicitly state so.}
\begin{solution}
TODO
\end{solution}
\item
\textbf{Declare and sign the following statement}:
\textit{``I certify that all solutions in this document are entirely my own and that I have not looked at anyone else's solution. I have given credit to all external sources I consulted."}
\textit{Signature} : \hrulefill
While discussions are encouraged, \emph{everything} in your solution must be your (and only your) creation.
Furthermore, all external material (i.e., \emph{anything} outside lectures and assigned
readings, including figures and pictures) should be cited properly.
We wish to remind you that the consequences of academic misconduct are \emph{particularly severe}!
\begin{solution}
TODO
\end{solution}
\end{enumerate}
\end{document}