import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import seaborn as sns
# for creating animations
import matplotlib.animation
from IPython.display import HTML
from IPython.display import Image, Video
Introduction
In this notebook we will walk through the several gradient-based optimization techniques from scratch.
We will start with a simple introduction to Gradient Descent, and work our way up to a vectorized implementation of Mini-Batch Stochastic Gradienet Descent with Momentum, and even ADAM.
In so doing, we will fully explain all the required numpy and linear algebra along the way.
This notebook aims to walk through all of these steps in detail, and by the end, we will generate the following animation which shows how various optimiation algorithms behave on a Multi-variate Polynomial Regression problem:
# styling additions
from IPython.display import HTML
# style = "<style>div.warn{background-color: #fcf2f2;border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
= "<style>div.info{padding: 15px; border: 1px solid transparent; border-left: 5px solid #dfb5b4; border-color: transparent; margin-bottom: 10px; border-radius: 4px; background-color: #fcf8e3; border-color: #faebcc;}</style>"
style HTML(style)
Animations
This notebook can (re)generate animations of optimization procedures, but that can take quite some time. In its place, you can use pre-generated GIFs/MP4s by togglnig this flag:
= False regenerate_anim
Simple Linear Regression
Lets first start by understanding a basic linear regression problem. Say we have the following unknown but true function:
def f(x):
return 3*x + 1
Note: We know it now because were using it to simulate our data. From this true function, we take noisy measurements:
= 100
N = np.linspace(-2,2,N)
x = f(x) + np.random.randn(N) # our zero man gaussian noise
y
; plt.scatter(x,y)
Hypothesis Space
Now this our dataset. Say we want to fit a line to this dataset, so our hypothesis takes the form:
h(x) = mx+b
This model has two parameters that we have to learn: m and b.
Althogh this is not typical in linear regression, we can also call them weights:
h(x) = w_0 + w_1x
Another notational thing that we can do now that will be useful to us later, is introduce a notational convenience: x_0 = 1. This lets us rewrite our model as:
h(x) = w_0 x_0 + w_1x_1
where our original variable x we’re now calling x_1.
Remember! This is just a “trick” of notation, because x_0 is defined to be 1, so we actually have:
\begin{align} h(x) &= w_0 x_0 + w_1x_1 \\ &= w_0 (1) + w_1x_1 \\ &= w_0 + w_1x_1 \\ &= b + mx \end{align}
This notational trick now lets us write out our model in succinct linear algebraic notation as a dot product using vectors:
\begin{align} h(\mathbf{x}) &= \mathbf{w}\cdot\mathbf{x} \\ &= w_0x_0 + w_1x_1 \\ &= \mathbf{w}^T \mathbf{x} \end{align}
Lets quickly remind ourselves about column vectors, row vectors and the dot product:
We usually assume all vectors are written as column vectors, meaning they look like this: \mathbf{a} = \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}
So when we transponse them, they become row vectors: \mathbf{b}^T = \begin{bmatrix} b_0 & b_1 \end{bmatrix}
This lets us write our dot product as a vector-vector multiplication of a column vector and a row vector: \begin{align} \mathbf{a} \cdot \mathbf{b} &= a_0b_0 + a_1b_1 \\ &= \mathbf{a}^T\mathbf{b} \\ &= \mathbf{b}^T\mathbf{a} \end{align}
So lets summarize: we want to learn a linear model for our dataset:
\hat{y}=h(x)=mx+b
meaning we want a slope: m, and an intercept: b.
We have restated this problem in more linear algebraic notation, as wanting to find a 2-dimensional weight vector: \mathbf{w} for our linear model:
h(\mathbf{x}) = \mathbf{w}^T\mathbf{x}
So we can interpret the components of \mathbf{w} as:
\mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \end{bmatrix} = \begin{bmatrix} b \\ m \end{bmatrix} = \begin{bmatrix} intercept \\ slope \end{bmatrix}
Now, lets go one step further with our notation! Note that as we’ve specified it above,
h(\mathbf{x}) = \mathbf{w}^T\mathbf{x}
only specifies a single \hat{y}, for a single example of our dataset: \mathbf{x}.
So, to abuse notation even futher, (trust me, I don’t want to but this is how its almost always presented so we might as well get used to it!) we can make this explicity by saying:
\hat{y}_i = h(\mathbf{x}_i) = \mathbf{w}^T\mathbf{x}_i
We want to generate our estimates for each input example \mathbf{x}_i. We can picture what we want like this:
\begin{align} \hat{y}_1 &= h(\mathbf{x}_i) = \mathbf{w}^T\mathbf{x}_i \\ \hat{y}_2 &= h(\mathbf{x}_2) = \mathbf{w}^T\mathbf{x}_2 \\ \ldots \\ \hat{y}_i &= h(\mathbf{x}_i) = \mathbf{w}^T\mathbf{x}_i \\ \ldots \\ \hat{y}_N &= h(\mathbf{x}_N) = \mathbf{w}^T\mathbf{x}_N \\ \end{align}
I’ve laid out suggesting we can do this with linear algebra, and indeed we can! We want to form a vector of our estimates: \mathbf{\hat{y}} which of size (N \times 1). To do so, we need to calculate the following operation:
\begin{bmatrix} \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_2 \\ \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_N \\ \end{bmatrix}
This is exactly the result of a matrix-vector product of our dataset matrix X, with our weight vector \mathbf{w}!
X\mathbf{w} = \begin{bmatrix} \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_2 \\ \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_N \\ \end{bmatrix}
We can quickly verify that sizes indeed work out: \underbrace{X}_{(N \times 2)} \underbrace{\mathbf{w}}_{(2 \times 1)} = \begin{bmatrix} \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_2 \\ \mathbf{w}^T\mathbf{x}_i \\ \mathbf{w}^T\mathbf{x}_N \\ \end{bmatrix}
So now we have a formula, which lets us a single weight vector \mathbf{w} and apply it to our entire dataset at once, to generate all of our estimates at once:
X\mathbf{w} = \mathbf{\hat{y}}
Dataset
Lets take a second to talk about our dataset matrix X. Above, we did the following to look at our dataset:
; plt.scatter(x,y)
At first glance, we don’t see any matrix here! Our x’s are only 1 dimensional! Well, recall the convention that x_0=1! So our dataset matrix is actually:
X = \begin{bmatrix} 1 & x_1 \\ \ldots \\ 1 & x_i \\ \ldots \\ 1 & x_N \end{bmatrix}
where now i is indexing from 1,\ldots,N - the size of our dataset. So lets go ahead and form it:
= np.ones((N,2))
X 1] = x
X[:,0:5,:] X[
array([[ 1. , -2. ],
[ 1. , -1.95959596],
[ 1. , -1.91919192],
[ 1. , -1.87878788],
[ 1. , -1.83838384]])
Now lets assemble our output vector:
= y[:,None]
y # or y = y[:, np.newaxis]
print(y.shape)
0:5,:] y[
(100, 1)
array([[-4.54437285],
[-6.75779464],
[-5.2087573 ],
[-4.73313111],
[-5.35793348]])
Broadcasting
= np.random.rand(3,1)
a a.shape,a
((3, 1),
array([[0.09708829],
[0.71928046],
[0.79971499]]))
= np.random.rand(3)
b b.shape,b
((3,), array([0.46586583, 0.67176581, 0.4836771 ]))
As we can see, at first glance these are both vectors of length 3…right? Lets substract them and see what happens:
= a-b
result result.shape,result
((3, 3),
array([[-0.36877755, -0.57467753, -0.38658881],
[ 0.25341463, 0.04751465, 0.23560336],
[ 0.33384915, 0.12794917, 0.31603789]]))
You-should-know: This is not what you might except as a mathematician, but you must familiarize yourself with and come to love broadcasting as a Data scientist!
To prevent having to squeeze things after a dot product, I think its just easier to ensure that y
has the proper dimension.
Cost function
Now that we understand our model, we can talk about how exactly we’re going to learn what the best weights should be.
In order to learn, we need a measure of distance or cost or quality of our current guess, which tells us how good/bad were doing so far. As we’ve discussed in class, a natural loss for regression is the regression loss, or the mean squared error (MSE) loss:
C(\mathbf{w}) = \underbrace{\frac{1}{N} \sum_{i=1}^N}_{\text{mean}} \;\; \underbrace{(\hat{y}_i-y_i)^2}_{\text{squared error}}
Pause and Ponder: Often times, you’ll see this cost function defined with a \frac{1}{2N} instead of just \frac{1}{N}. * Why do you think that is? * Do you think that will affect our learning procedure?
The cost function given above looks a bit strange at first. I’ve written it as a function of our weight vector \mathbf{w}, but \mathbf{w} does not appear on the right hand side, but \hat{y}_i does. Recall, that all of our individual \hat{y}_i are generated from one weight vector. We could also rewrite the above as:
C(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^N \;\; (\mathbf{w}^T\mathbf{x}_i-y_i)^2
So, we can now calculate this for out entire dataset! Lets do so below, for a random weight vector:
= np.random.random(size=(2,1))
w_rand w_rand
array([[0.51374312],
[0.57865621]])
Now, lets loop through all N data points and calculate this cost:
= 0
cost for i in range(N):
= X[i,:]
x_i = w_rand.T.dot(x_i)
y_hat_i = y_hat_i - y[i]
error_i = cost + error_i**2
cost
= np.squeeze(cost/N) # the squeeze just gets rid of empty dimensions
cost cost
array(10.13753696)
So this is our cost for a random weight vector. Lets define this as a fuction, so we can plot it over the space of w_0 and w_1.
Recall: this is the same as plotting the cost over the sapce of intercepts (w_0) and slopes (w_1)
def cost(dataset,w):
= dataset.shape[0]
N = 0
cost for i in range(N):
= dataset[i,:]
x_i = w.T.dot(x_i)
y_hat_i = y_hat_i - y[i]
error_i = cost + error_i**2
cost
return np.squeeze(cost/N)
Lets verify it works as we expect:
cost(X,w_rand)
array(10.13753696)
Perfect - lets now calculate our costs over the entire grid of potential slope and interceps:
Note: The below is slow! But right now Im purposefully not vectorizing anything so it follows our intuition
= 100
len_w = np.linspace(0,2,len_w) # possible bias
w_0 = np.linspace(2,4,len_w) # possible slopes
w_1 = np.zeros((len_w,len_w))
costs for i in range(len_w):
for j in range(len_w):
= np.asarray([w_0[i],w_1[j]])
current_w = cost(X,current_w) costs[i,j]
Wow! That took a while!
Now lets actually plot our cost surface along with a contour plot and the real minimum highlighted.
Note: you can skip the plotting details in this cell if you want.
# plotting
= np.meshgrid(w_0, w_1)
W_0, W_1 = costs.reshape(W_0.shape)
Costs
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(111,projection='3d')
ax 'w_0')
ax.set_xlabel('w_1')
ax.set_ylabel('cost')
ax.set_zlabel(-2,3])
ax.set_zlim([
='inferno');
ax.plot_surface(W_0, W_1, Costs,cmap='inferno', offset=-2,levels=15);
ax.contour(W_0, W_1, Costs, cmap1,3,-2,s=30,color='r'); # plot the known minimum ax.scatter(
So now we can actually see our cost surface as a function of our weight vector!
Pause and Ponder: Why is it that our true minimum shown in red, is not the minimum of our cost surface?
Vectorized Cost function
We’ve come a long way so far, and we haven’t even learned anything! Before we move forward with that, lets vectorize our cost function.
Vectorization is often used to mean “rewrite in linear-algebraic notation”. This results in much faster code, since our numerical libraries have been optimized to operate on vectors/matrices and not for loops!
The function we just explored was:
C(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^N \;\; (\mathbf{w}^T\mathbf{x}_i-y_i)^2
Before we proceed, lets take a quick asside to remind ourselves of the definition of the \mathcal{L}_2-norm:
The \mathcal{L}_2-norm of a vector \mathbf{a} is:
\|\mathbf{a}\|_2 = \sqrt{a_1^2 + \ldots + a_n^2}
We are often interested in the squared \mathcal{L}_2-norm of a vector:
\|\mathbf{a}\|_2^2 = a_1^2 + \ldots + a_n^2
We are also often interested in the squared \mathcal{L}_2-norm of some kind of distance or error vector:
\|\mathbf{a}-\mathbf{b}\|_2^2 = (a_1-b_1)^2 + \ldots + (a_n-b_n)^2
Now, lets use the fact that we can calculate our all our \hat{y}_i at once and rewrite the cost in terms of the \mathcal{L}_2-norm we just defined above
\begin{align} C(\mathbf{w}) &= \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|_2^2 \\ &= \frac{1}{N} \| \mathbf{\hat{y}} - \mathbf{y} \|^2_2 \\ &= \frac{1}{N} \left[ (\hat{y}_1 - y_1)^2 + \ldots + (\hat{y}_N-y_N)^2 \right] \end{align}
which is exactly what we want! So now we can calculate our cost as:
= np.linalg.norm(X.dot(w_rand)-y)**2/X.shape[0]
cost cost
10.137536961317753
This is exactly what we had above! Lets now re-define our vectorized cost function:
def cost(X, y, w):
return np.linalg.norm(X.dot(w)-y,axis=0)**2/X.shape[0]
Lets re-calculate our costs over a grid and replot it!
= 100
len_w = np.linspace(0,2,len_w)
w_0 = np.linspace(2,4,len_w)
w_1 = np.meshgrid(w_0, w_1) W_0, W_1
W_0.shape, W_1.shape
((100, 100), (100, 100))
# combine and reshape mesh for calculation
= np.array([W_0,W_1]).reshape(2, len(w_0)**2) wgrid
Pay attention to how much faster this cell is than the double for loop above!
cost(X,y,wgrid)
array([3.78144022, 3.74421484, 3.7078057 , ..., 3.3598471 , 3.40179731,
3.44456376])
# calculate on grid, and reshape back for plotting
= cost(X,y,wgrid).reshape(W_0.shape) costs
Wow! That was fast! Now lets plot it!
# plotting
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(111,projection='3d')
ax 'w_0')
ax.set_xlabel('w_1')
ax.set_ylabel('cost')
ax.set_zlabel(-2,3])
ax.set_zlim([
='inferno');
ax.plot_surface(W_0, W_1, costs,cmap='inferno', offset=-2,levels=10);
ax.contour(W_0, W_1, costs, cmap1,3,-2,s=30,color='r'); # plot the known minimum ax.scatter(
It works!
Gradient of our cost
Now, we can really get going! Lets start by giving the gradient of this cost function (recall: we calculated this in class!). Lets also call this cost what it is, the in-sample error:
E_{in}(\mathbf{w}) = C(\mathbf{w}) = \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|_2^2
So its gradient is:
\begin{align} \nabla E_{in}(\mathbf{w}) &= \frac{2}{N}(X^TX\mathbf{w} - X^T\mathbf{y}) \end{align}
Recall an earlier pause-and-ponder where I asked about the definition of the cost function with a \frac{1}{2N} vs \frac{1}{N}!
Lets also note an underappreciated property of this function which will become relevant later:
Pause-and-ponder: This gradient is a function of our entire dataset! This means that at each point in weight space, to calculate the gradient we need to proces our entire dataset!
Keep that point in mind as we will come back to it later!
Gradient Descent for Linear Regression
Now we have all the pieces together to start talking about and implement gradient descent! Recall the weight update equation we gave in class:
\boldsymbol{w}(t+1) = \boldsymbol{w}(t) + \eta \hat{\boldsymbol{v}}
where \hat{\boldsymbol{v}} is the unit-vector in the direction we want to step.
For gradient descent, this becomes:
The gradient descent algorithm is just a step in the direction of the negative gradient: \boldsymbol{w}(t+1) = \boldsymbol{w}(t) -\eta\cdot\nabla f(\boldsymbol{w}(t))
Lets look at this for special case of linear regression, by starting with the weight update equation:
\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla E_{in}
Lets move \mathbf{w}_t to the other side so we can focus on the change in our weights:
\begin{align} \Delta \mathbf{w} &= \mathbf{w}_{t+1} - \mathbf{w}_t \\ &= - \eta \nabla E_{in} \\ &= - \eta \frac{2}{N} (X^TX\mathbf{w} - X^T\mathbf{y}) \end{align}
We can now write the GD algorithm for linear regression:
The gradient descent algorithm for linear regression is:
\begin{align} \boldsymbol{w}(t+1) &= \boldsymbol{w}(t) -\eta \nabla E_{in} \\ &= - \eta \frac{2}{N} (X^TX\mathbf{w} - X^T\mathbf{y}) \end{align}
Lets first define a function to calculate this gradient:
def cost_grad(X,y,w):
= X.shape[0]
N return (2/N)*(X.T.dot(X).dot(w)-X.T.dot(y))
Out of curiosity, lets visualize it!
= 100
len_w = np.linspace(0,2,len_w)
w_0 = np.linspace(2,4,len_w)
w_1 = np.meshgrid(w_0, w_1)
W_0, W_1
# combine and reshape mesh for calculation
= np.array([W_0,W_1]).reshape(2, len(w_0)**2) wgrid
# calculate on grid, and reshape back for plotting
= cost(X,y,wgrid).reshape(W_0.shape) costs
# calculate grad on grid, and reshape back for plotting
= cost_grad(X,y,wgrid).reshape(2,W_0.shape[0],W_0.shape[1]) grad_costs
# plotting
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(131,projection='3d')
ax1 'w_0')
ax1.set_xlabel('w_1')
ax1.set_ylabel('cost')
ax1.set_zlabel(-2,3])
ax1.set_zlim(['Cost')
ax1.set_title(
='inferno');
ax1.plot_surface(W_0, W_1, costs,cmap='inferno', offset=-2,levels=10);
ax1.contour(W_0, W_1, costs, cmap1,3,-2,s=30,color='r'); # plot the known minimum
ax1.scatter(
= plt.subplot(132, projection='3d', sharex = ax1, sharey=ax1)
ax2 'w_0')
ax2.set_xlabel('w_1')
ax2.set_ylabel('grad')
ax2.set_zlabel('Gradient wrt w_0')
ax2.set_title(0,:],cmap='inferno');
ax2.plot_surface(W_0, W_1, grad_costs[='inferno', offset=-2,levels=10);
ax2.contour(W_0, W_1, costs, cmap1,3,-2,s=30,color='r'); # plot the known minimum
ax2.scatter(
= plt.subplot(133, projection='3d', sharex = ax1, sharey=ax1)
ax3 'w_0')
ax3.set_xlabel('w_1')
ax3.set_ylabel('grad')
ax3.set_zlabel('Gradient wrt w_1')
ax3.set_title(1,:],cmap='inferno');
ax3.plot_surface(W_0, W_1, grad_costs[='inferno', offset=-2,levels=10);
ax3.contour(W_0, W_1, costs, cmap1,3,-2,s=30,color='r'); # plot the known minimum ax3.scatter(
Pause-and-ponder: What does this mean? 🧐 Think about this!
Implementation
Lets actually implement our weight updates now!
def gradient_descent(dataset, labels, w_init=None, eta=None, max_iter=300, quiet=False):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 0.01
eta
# clean up initial point sizes:
= np.asarray(w_init).reshape(dataset.shape[1],1)
w_init
# gradient descent params
= 0.001
delta = 1
delta_w = 0
iterations
# save weights and costs along path
= [w_init]
wPath = [cost(dataset,labels,w_init)]
cPath
# take first step
= w_init
w_old
# perform GD
while True:
# check to see if you should terminate
if iterations > max_iter:
if not quiet: print("terminating GD after max iter:",max_iter)
break
if delta_w < delta:
if not quiet: print(f"terminating GD after {iterations} iterations with weight change of {delta}.")
break
# update your weights
# STEP 1
= w_old - eta* cost_grad(dataset,labels,w_old)
w_new
# update paths
wPath.append(w_new)
cPath.append(cost(dataset,labels,w_new))
# update iteration and weight change
+= 1
iterations = np.linalg.norm(w_new - w_old)
delta_w
# STEP 2
# make your previous new weights the current weights
= w_new
w_old
# clean up path variables
= np.squeeze(np.asarray(wPath))
wPath = np.squeeze(np.asarray(cPath))
cPath
return w_new, wPath, cPath
= gradient_descent(X, y, [0.25,2.4], eta = 0.2)
w_gd, wPath_gd, cPath_gd w_gd
terminating GD after 12 iterations with weight change of 0.001.
array([[0.92994594],
[3.11226582]])
Plot Path
Now lets plot the path GD took!
# plotting
= 100
len_w = np.linspace(0,2,len_w)
w_0 = np.linspace(2,4,len_w)
w_1 = np.meshgrid(w_0, w_1)
W_0, W_1
# combine and reshape mesh for calculation
= np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid
# calculate on grid, and reshape back for plotting
= cost(X,y,wgrid).reshape(W_0.shape)
costs
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(121,projection='3d')
ax1 'w_0')
ax1.set_xlabel('w_1')
ax1.set_ylabel('cost')
ax1.set_zlabel(-2,3])
ax1.set_zlim(["cost surface")
ax1.set_title(
0],wPath_gd[:,1],-2,s=10,c=cPath_gd);
ax1.scatter(wPath_gd[:,='inferno');
ax1.plot_surface(W_0, W_1, costs,cmap='inferno', offset=-2,levels=10,alpha=0.5);
ax1.contour(W_0, W_1, costs, cmap
= fig.add_subplot(122);
ax2 ;
ax2.plot(cPath_gd)"Cost during GD")
ax2.set_title("Iteration"); ax2.set_xlabel(
Lets define a quick plotting function to make it easier to re-use:
def plot_path(wPath,cPath,colorby='cost'):
if colorby == 'cost':
= cPath
colorer else:
= range(len(wPath))
colorer # plotting
= 100
len_w = np.linspace(0,2,len_w)
w_0 = np.linspace(2,4,len_w)
w_1 = np.meshgrid(w_0, w_1)
W_0, W_1
# combine and reshape mesh for calculation
= np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid
# calculate on grid, and reshape back for plotting
= cost(X,y,wgrid).reshape(W_0.shape)
costs
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(121,projection='3d')
ax1 'w_0')
ax1.set_xlabel('w_1')
ax1.set_ylabel('cost')
ax1.set_zlabel(-2,3])
ax1.set_zlim([0],wPath[:,1],-2,s=10,c=colorer,cmap='inferno');
ax1.scatter(wPath[:,='inferno');
ax1.plot_surface(W_0, W_1, costs,cmap='inferno', offset=-2,levels=10,alpha=0.5);
ax1.contour(W_0, W_1, costs, cmap
= fig.add_subplot(122);
ax2 ;
ax2.plot(cPath)"Cost during training")
ax2.set_title("Iteration"); ax2.set_xlabel(
Go back, and try rerunning gradient descent with different initial values and plot:
= gradient_descent(X,y,[1.9,3.9])
_,wPath_gd,cPath_gd plot_path(wPath_gd,cPath_gd)
terminating GD after 151 iterations with weight change of 0.001.
from ipywidgets import interact, interactive, fixed, interact_manual, FloatSlider
# @interact(w_0=(0,2,0.1), w_1=(2,4,0.1))
def interactive_gd(w_0,w_1):
= gradient_descent(X,y,[w_0,w_1])
_,wPath_gd,cPath_gd
plot_path(wPath_gd,cPath_gd)
=FloatSlider(min=0,max=2,step=0.1,continous_update=False),
interact(interactive_gd, w_0=FloatSlider(min=2,max=4,step=0.1,continous_update=False)); w_1
Ok. Now we have should have a much better grasp of Gradient Descent for Linear Regression!
Multivarite Linear Regression
The best part of the formulism above, is that it now works for higher order models! For example, lets run the same analysis as above, except not we are tring to fit a 3rd-order polynomial:
def f(x):
return 4*x**3 + 4*x**2 + 3*x - 4
= np.linspace(-2,2,100)
x
= f(x) + np.random.randn(len(x))*5
y
=(12,7))
plt.figure(figsize;
plt.scatter(x,y)'--r',alpha=0.5); plt.plot(x,f(x),
The model we are trying to fit here can now be written as:
h(\mathbf{w}) = w_0 + w_1x + w_2x^2 + w_3x^3
So just like for a line, we were learning 2 parameters, for a 3-rd order polynomial we are learning 4:
h(\mathbf{w}) = \mathbf{w}^T\mathbf{x}
For
\mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ w_3 \end{bmatrix}, \;\;\quad\;\; \mathbf{x} = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \end{bmatrix}
Lets form our dataset/design matrix as above by stacking:
= np.hstack((x[:,np.newaxis]*0+1,
X
x[:,np.newaxis], **2,
x[:,np.newaxis]**3)) x[:,np.newaxis]
So now each row of X is a sample of our dataset, and each column is a dimension:
print("sample:",X[1,:])
sample: [ 1. -1.95959596 3.84001632 -7.52488047]
Lets also redefine our y’s as a column vector:
= y[:,np.newaxis]
y y.shape
(100, 1)
Because of how we defined our cost
and cost_grad
functions, we should be able to use them without modification!. Lets calculate the cost on a random 4 dimensional vector:
= np.random.rand(4,1)*5
w_rand w_rand
array([[4.40486757],
[2.19318454],
[2.83438899],
[4.02340168]])
cost(X,y,w_rand)
array([77.98811223])
Perfect! Unfortunately, we can’t easily visualize it, so lets instead plot the GD procedure as the polynomial we are fitting:
= gradient_descent(X,y)
w_gd,wPath_gd,cPath_gd w_gd
terminating GD after max iter: 300
array([[-3.4482133 ],
[ 3.50132037],
[ 3.62600224],
[ 4.17202869]])
Recall, this weight vector now defines a polynomial!. Lets plot the polynomial given by our final weight vector:
Note: poly1d
expects the coefs in decreasing order, so we have to reverse our weight vector:
= np.flip(np.squeeze(w_gd))
coefs = np.poly1d(coefs) poly_fin
poly_fin
poly1d([ 4.17202869, 3.62600224, 3.50132037, -3.4482133 ])
= plt.figure(figsize=(10,7))
fig = fig.add_subplot(111)
ax1 ;
ax1.scatter(x,y)'r'); ax1.plot(x,poly_fin(x),
Not bad! Lets plot the path that GD took, in terms of our polynomials!
if regenerate_anim:
= plt.figure(figsize=(10,7))
fig
= fig.add_subplot(111)
ax1 ;
ax1.scatter(x,y)= ax1.plot([],[],'--r')
line,
def animate(i):
= np.flip(np.squeeze(wPath_gd[i]))
coefs = np.poly1d(coefs)
poly_fin ;
line.set_data(x,poly_fin(x))'Iteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs[0], coefs[1], coefs[2], coefs[3]))
ax1.set_title(return line
plt.close()= matplotlib.animation.FuncAnimation(fig, animate, frames=len(wPath_gd), interval=100)
anim # HTML(anim.to_html5_video())
'multi_lin_reg.mp4')
anim.save(
'multi_lin_reg.mp4',width=800) Video(
Stochastic Gradient Descent
We just got done learning about Gradient Descent and implemented it for a multivariate linear regression. Lets restate a pause-and-ponder point I gave above here again:
Pause-and-ponder: Gradient descent, as given above, is a function of our entire dataset! This means that at each point in weight space, to calculate the gradient we need to proces our entire dataset!
This is actually called Batch Gradient Descent, because it operates on the entire batch of data at once. We have to calculate the gradient at every point in our dataset and average them all together everytime. If our dataset is massive, this quickly becomes impractical!
So, at the other end of Batch GD is Stochastic Gradient Descent (SGD), which calculates the gradient at just one point!
Note: SGD as defined here works on just 1 sample. Mini-batch, as we will see below, works on mini-batches of our dataset. Often, when people say SGD they mean mini-batch, or mini-batch-SGD.
That is clearly an estimate for our gradient, but imagine just how noisy it must be!
Lets remind ourselves about our cost
and cost_grad
implementations by inspecting their source:
import inspect
print(inspect.getsource(cost))
print(inspect.getsource(cost_grad))
def cost(X, y, w):
return np.linalg.norm(X.dot(w)-y,axis=0)**2/X.shape[0]
def cost_grad(X,y,w):
N = X.shape[0]
return (2/N)*(X.T.dot(X).dot(w)-X.T.dot(y))
Note, we didn’t redefine any new functions here, we just printed their original source code as defined above.
SGD Cost
So we see that they are indeed in terms of the entire dataset. Lets define new versions here, specifically for SGD.
Recall for GD, the cost was:
C(\mathbf{w})_{gd} = \frac{1}{N} \sum_{i=1}^N \;\; (\mathbf{w}^T\mathbf{x}_i-y_i)^2
Lets rewrite it to make it a bit more explicit:
C(\mathbf{w})_{gd} = \frac{1}{N} \sum_{i=1}^N \; \color{red}{C_i}(\mathbf{w})
So we see in GD, were evaluating a point-wise cost on our entire dataset. In SGD, we only want to evaluate it on one-point. So lets write out the cost, and emphasize the dependence on 1 data point:
\begin{align} C(\mathbf{w},\color{red}{x_i,y_i})_{sgd} &= \color{red}{C_i}(\mathbf{w})\\ &= (\mathbf{w}^T\color{red}{\mathbf{x}_i-y_i})^2 \end{align}
Now we can implement this cost, and its gradient:
def cost_sgd(x_i, y_i, w):
return (w.T.dot(x_i) - y_i)**2
def cost_grad_sgd(x_i, y_i, w):
return 2*x_i*(w.T.dot(x_i)-y_i)
Implementation
Now we are ready to describe the full SGD algorithm. Since we are estimating our gradient at each training sample, we now define a new term:
Def: An epoch is a full run of a learning procedure through the entire training set.
So we can (and often must) take multiple epochs of training. The tradeoff is that each iteration is now must faster!
Now we can describe the algorithm as:
The SGD algorithm is:
for each epoch
:
- Randomly shuffle the
training_set
- for each
sample
intraining_set
- w_{t+1} = w_{t} - \eta \; \nabla C_i(\mathbf{w}_t)
Lets implement that below
= np.arange(9).reshape(3,3)
A A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
= np.random.permutation(3)
indices indices
array([1, 2, 0])
A[indices]
array([[3, 4, 5],
[6, 7, 8],
[0, 1, 2]])
A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
def sgd(dataset, labels, w_init=None, eta=None, epochs=5):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 1e-2
eta
# dataset
= dataset.shape[0]
N = dataset.shape[1]
d
# clean up initial point sizes:
= np.asarray(w_init).reshape(d,1)
w_init
# SGD params - Note their difference from GD
= 1e-10
delta = 1000
max_iter
# initialzie loop variables
= 1
delta_w = 0
iterations
# save weights and costs along path
= [w_init]
wPath
# calc cost at that initial value
= np.random.randint(N)
idx = dataset[np.newaxis,idx,:].T
x_i = labels[idx,:]
y_i = [cost_sgd(x_i,y_i,w_init)]
cPath
# take first step
= w_init
w_old
# epochs of training
for epoch in range(epochs):
# shuffle the dataset - note we do this by shuffling indices
= np.random.permutation(N)
indices = dataset[indices,:]
shuffled_dataset = labels[indices,:]
shuffled_labels # NOTE: try commenting out this line above and seeing how it behaves
# for each sample
for i in range(N):
# np.newaxis is there to deal with the 1 for the row size
= shuffled_dataset[i,:]
curr_sample = curr_sample[np.newaxis].T
curr_sample = shuffled_labels[i,:]
curr_label
# update your weights with the grad at one sample
= cost_grad_sgd(curr_sample,curr_label,w_old)
grad = w_old - eta * grad
w_new
# update paths
wPath.append(w_new)
cPath.append(cost_sgd(curr_sample,curr_label,w_new))
# calculate weight change
# NOTE to me: ask a HW question here!!
= np.linalg.norm(w_new - w_old)
delta_w
# make your previous new weights the current weights
= w_new
w_old
# at end of epoch, check to see if you should terminate
if delta_w < delta:
break
# clean up path variables
= np.squeeze(np.asarray(wPath))
wPath = np.squeeze(np.asarray(cPath))
cPath
print(f"terminating SGD after {epoch} epochs with weight change of {delta_w:0.5f}.")
return w_new, wPath, cPath
SGD on Linear Regression
Lets re-define our simple linear regression from above and see how we do. This time, lets make a function to make it easier to setup comparisons below:
def lin_reg():
def f(x):
return 3*x + 1
= 100
N = np.linspace(-2,2,N)
x = f(x) + np.random.randn(N) # our zero man gaussian noise
y
= np.ones((N,2))
X 1] = x
X[:,
= y[:,np.newaxis]
y
return x,X,y,f
= lin_reg()
x,X,y,f ; plt.scatter(x,y)
= sgd(X,y,w_init=[0.75,2.5],eta=1e-2,epochs=100)
w_sgd, wPath_sgd, cPath_sgd w_sgd
terminating SGD after 99 epochs with weight change of 0.03233.
array([[1.04826202],
[3.13976719]])
Plot
Lets plot it!
= sgd(X,y,w_init=[0.1,2],eta=1e-2,epochs=10)
w_sgd, wPath_sgd, cPath_sgd print(w_sgd)
='iter') plot_path(wPath_sgd,cPath_sgd,colorby
terminating SGD after 9 epochs with weight change of 0.01851.
[[1.16979153]
[3.05990881]]
Pause-and-ponder: What do you notice about this procedure?
Go back and re-run this procedure with different values for your parameters of interest!
Comparison
Lets compare both of them!
= [0,2]
w_init = gradient_descent(X,y,w_init=w_init)
w_gd,wPath_gd,cPath_gd = sgd(X,y,w_init=w_init,eta=1e-3/2,epochs=200)
w_sgd, wPath_sgd, cPath_sgd
;
plot_path(wPath_gd,cPath_gd); plot_path(wPath_sgd,cPath_sgd)
terminating GD after 161 iterations with weight change of 0.001.
terminating SGD after 199 epochs with weight change of 0.00005.
Lets look at the contour plots:
= 300
num_points = np.linspace(0.7,1.3,num_points)
w_0 = np.linspace(2.75,3.25,num_points)
w_1 = np.meshgrid(w_0,w_1)
W_0,W_1 = np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid = cost(X,y,wgrid).reshape(W_0.shape)
costs
=(15,7));
plt.figure(figsize= plt.subplot(121)
ax1 ='inferno');
ax1.contour(W_0, W_1, costs, cmap'w_0');
ax1.set_xlabel('w_1');
ax1.set_xlabel(0.7,1.3])
ax1.set_xlim([2.75,3.25])
ax1.set_ylim([0],wPath_gd[:,1], s=50,c=cPath_gd);
ax1.scatter(wPath_gd[:,
= plt.subplot(122, sharex=ax1, sharey=ax1)
ax2 ='inferno');
ax2.contour(W_0, W_1, costs, cmap0],wPath_sgd[:,1], c=range(len(wPath_sgd))); ax2.scatter(wPath_sgd[:,
Pause-and-ponder: What do you notice?
SGD on Polynomial Regression
As above, lets redefine our polynomial regression problem. This time, lets make it a function to make it easier to compare below.
def poly_reg():
def f(x):
return 4*x**3 + 4*x**2 + 3*x - 4
= np.linspace(-2,2,100)
x = f(x) + np.random.randn(len(x))*5
y
= np.hstack((x[:,np.newaxis]*0+1,
X
x[:,np.newaxis], **2,
x[:,np.newaxis]**3))
x[:,np.newaxis]
= y[:,np.newaxis]
y
return x,X,y,f
= poly_reg()
x,X,y,f =(12,7))
plt.figure(figsize;
plt.scatter(x,y)'--r',alpha=0.5); plt.plot(x,f(x),
Now we can run them both and see!
= [0,0,0,0]
w_init = gradient_descent(X,y,w_init=w_init)
w_gd,wPath_gd,cPath_gd = sgd(X,y,w_init=w_init,eta=1e-2,epochs=100)
w_sgd, wPath_sgd, cPath_sgd
w_gd,w_sgd
terminating GD after max iter: 300
terminating SGD after 99 epochs with weight change of 0.16115.
(array([[-3.37519071],
[ 2.35485563],
[ 3.40590893],
[ 4.0224176 ]]),
array([[-4.26754354],
[ 3.14637804],
[ 3.39303763],
[ 5.26249499]]))
Plot it
Before we plot the path, lets plot the overall costs:
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax2 ;
ax1.plot(cPath_gd); ax2.plot(cPath_sgd)
=(15,10))
plt.figure(figsize plt.plot(cPath_sgd)
Now we can plot the path:
if regenerate_anim:
= plt.figure(figsize=(20,7))
fig
= fig.add_subplot(121)
ax1 ;
ax1.scatter(x,y)
= fig.add_subplot(122)
ax2 ;
ax2.scatter(x,y)
= ax1.plot([],[],'--r')
line1, = ax2.plot([],[],'--r')
line2,
= 50
num_frames = np.linspace(0,len(wPath_gd)-1,num_frames).astype(int)
gd_frames = np.linspace(0,len(wPath_sgd)-1,num_frames).astype(int)
sgd_frames
# abs_cost_diff = np.abs(cPath_sgd[sgd_frames] - cPath_gd[gd_frames])
# # ax3 = fig.add_subplot(133)
# # ax3.plot(range(num_frames),abs_cost_diff)
# # scat3 = ax3.scatter([],[],s=100)
def animate(i):
= gd_frames[i]
gd_frame = sgd_frames[i]
sgd_frame
# abs_cost_diff[i] = np.abs(cPath_sgd[sgd_frame] - cPath_gd[gd_frame])
= np.flip(np.squeeze(wPath_gd[gd_frame]))
coefs_gd = np.flip(np.squeeze(wPath_sgd[sgd_frame]))
coefs_sgd
= np.poly1d(coefs_gd)
poly_gd = np.poly1d(coefs_sgd)
poly_sgd
;
line1.set_data(x,poly_gd(x))'GDIteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs_gd[0], coefs_gd[1], coefs_gd[2], coefs_gd[3]))
ax1.set_title(
;
line2.set_data(x,poly_sgd(x))'SGD Iteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs_sgd[0], coefs_sgd[1], coefs_sgd[2], coefs_sgd[3]))
ax2.set_title(
return line1, line2
plt.close()= matplotlib.animation.FuncAnimation(fig, animate, frames=num_frames, interval=100)
anim # HTML(anim.to_html5_video())
'multi_lin_reg_sgd.mp4')
anim.save('multi_lin_reg_sgd.mp4',width=800) Video(
Pause-and-ponder: What do you observe? Go back and mess around with the setup and compare!
Note: Ask for a plot of the “error-in-each-dimension” like we discussed in class
= plt.figure(figsize=(15,7))
fig = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax2 ;
ax1.plot(cPath_gd); ax2.plot(cPath_sgd)
Mini-Batch GD
So far we’ve seen GD, which uses all N-many examples in each iteration, and SGD which uses 1 example per iteration. Well, right in between is mini-batch GD which uses a batch-size
-many examples per iteration!
Typically, this batch_size
could be from 2-512, sticking to powers of two, in the real world depending on the size of the each example, how much memory you have, etc.
import inspect
print(inspect.getsource(cost))
print(inspect.getsource(cost_grad))
def cost(X, y, w):
return np.linalg.norm(X.dot(w)-y,axis=0)**2/X.shape[0]
def cost_grad(X,y,w):
N = X.shape[0]
return (2/N)*(X.T.dot(X).dot(w)-X.T.dot(y))
Implementation
The implementation of mini-batch GD is:
The mini-batch GD algorithm is:
for each epoch
:
- seperate your data into
N/batch_size
-many batches - for each
batch
inbatches
- w_{t+1} = w_{t} - \eta \; \nabla C(\mathbf{w}_t,batch)
This looks familiar! And it should! This is exactly GD, if we only had N/batch_size
many examples! So the only difference in this algorithm and GD is that here, we process less examples before taking a gradient step!
def mini_batch_gd(dataset, labels, w_init=None, eta=None, batch_size=10, epochs=5, quiet=False):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 1e-2
eta
# dataset
= dataset.shape[0]
N = dataset.shape[1]
d
# calculat our batches
= int(N/batch_size)
num_batches
# clean up initial point sizes:
= np.asarray(w_init).reshape(d,1)
w_init
# mini-batch params
= 1e-5
delta = 1
delta_w = 0
iterations
# save weights and costs along path
= []
wPath_mbgd = []
cPath_mbgd = []
cPath_mbgd_epoch
# take the first step
= w_init
w_curr
# loop through all epochs
for epoch in range(epochs):
# loop through all batches
for batch_num in range(num_batches):
# get current batch
= batch_num*batch_size
batch_start = (batch_num+1)*batch_size
batch_end = dataset[batch_start:batch_end,:]
batch = labels[batch_start:batch_end]
batch_labels
# run gradient descent on the batch!
= gradient_descent(batch, batch_labels, w_init=w_curr, eta=eta, max_iter=1,quiet=True)
w_curr,_,_
wPath_mbgd.append(w_curr)
cPath_mbgd.append(cost(batch,batch_labels,w_curr))
#if not quiet: print(f"Batch:{batch_num+1}/{num_batches}, Loss:{cPath_mbgd[-1]}")
if not quiet: print(f"Epoch:{epoch+1}/{epochs}, Loss:{cPath_mbgd[-1]}")
# clean up path variables
= np.squeeze(np.asarray(wPath_mbgd))
wPath_mbgd = np.squeeze(np.asarray(cPath_mbgd))
cPath_mbgd
return w_curr, wPath_mbgd, cPath_mbgd
MBGD on Linear Regression
Lets use the setup function we defined above:
= lin_reg()
x,X,y,f ; plt.scatter(x,y)
= gradient_descent(X,y,quiet=True)
w_gd, wPath_gd, cPath_gd = sgd(X,y)
w_sgd, wPath_sgd, cPath_sgd = mini_batch_gd(X,y,epochs=100,quiet=True,batch_size=1) w_mbgd, wPath_mbgd, cPath_mbgd
terminating SGD after 4 epochs with weight change of 0.01009.
Plot it
Lets take a look!
= plt.figure(figsize=(20,7))
fig
= fig.add_subplot(131)
ax1 ;
ax1.plot(cPath_gd)"Loss by iteration for GD")
ax1.set_title("iteration");
ax1.set_xlabel(
= fig.add_subplot(132)
ax2 ;
ax2.plot(cPath_sgd)'Loss by iteration for SGD');
ax2.set_title("iteration");
ax2.set_xlabel("loss");
ax2.set_ylabel(
= fig.add_subplot(133)
ax3 ;
ax3.plot(cPath_mbgd)'Loss by iteration for MBGD');
ax3.set_title("iteration");
ax3.set_xlabel("loss"); ax3.set_ylabel(
= plt.figure(figsize=(15,7))
fig
= fig.add_subplot(121)
ax1 ;
ax1.plot(cPath_gd)"Loss by iteration for GD")
ax1.set_title("iteration");
ax1.set_xlabel(
= fig.add_subplot(122)
ax2 0::10]);
ax2.plot(cPath_mbgd['Loss by EPOCH for MBGD');
ax2.set_title("EPOCH");
ax2.set_xlabel("loss"); ax2.set_ylabel(
Lets take a look at the paths!
= 300
num_points = np.linspace(0.7,1.3,num_points)
w_0 = np.linspace(2.75,3.25,num_points)
w_1 = np.meshgrid(w_0,w_1)
W_0,W_1 = np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid = cost(X,y,wgrid).reshape(W_0.shape)
costs
=(20,7));
plt.figure(figsize
# Gradient Descent
= plt.subplot(131)
ax1 ='inferno');
ax1.contour(W_0, W_1, costs, cmap'w_0');
ax1.set_xlabel('w_1');
ax1.set_xlabel(0.7,1.3])
ax1.set_xlim([2.75,3.25])
ax1.set_ylim([0],wPath_gd[:,1], s=50,c=cPath_gd);
ax1.scatter(wPath_gd[:,"GD");
ax1.set_title(
# SGD
= plt.subplot(132, sharex=ax1, sharey=ax1)
ax2 ='inferno');
ax2.contour(W_0, W_1, costs, cmap0],wPath_sgd[:,1], c=range(len(wPath_sgd)));
ax2.scatter(wPath_sgd[:,"SGD");
ax2.set_title(
# MBGD
= plt.subplot(133, sharex=ax1, sharey=ax1)
ax3 ='inferno');
ax3.contour(W_0, W_1, costs, cmap0],wPath_mbgd[:,1], c=range(len(wPath_mbgd)));
ax3.scatter(wPath_mbgd[:,"Mini-batch GD"); ax3.set_title(
Pause-and-ponder: What do you notice about the graphs above? What can you say about these graphs?
MBGD on Polynomial Regression
Lets run this on the polynomial problem
= poly_reg()
x,X,y,f =(12,7))
plt.figure(figsize;
plt.scatter(x,y)'--r',alpha=0.5); plt.plot(x,f(x),
= gradient_descent(X,y,quiet=True)
w_gd, wPath_gd, cPath_gd = sgd(X,y)
w_sgd, wPath_sgd, cPath_sgd = mini_batch_gd(X,y,epochs=100,quiet=True) w_mbgd, wPath_mbgd, cPath_mbgd
terminating SGD after 4 epochs with weight change of 0.01274.
Plot it!
= plt.figure(figsize=(20,7))
fig
= fig.add_subplot(131)
ax1 ;
ax1.plot(cPath_gd)"Loss by iteration for GD")
ax1.set_title("iteration");
ax1.set_xlabel(
= fig.add_subplot(132)
ax2 ;
ax2.plot(cPath_sgd)'Loss by iteration for SGD');
ax2.set_title("iteration");
ax2.set_xlabel("loss");
ax2.set_ylabel(
= fig.add_subplot(133)
ax3 10]);
ax3.plot(cPath_mbgd[::'Loss by iteration for MBGD');
ax3.set_title("iteration");
ax3.set_xlabel("loss"); ax3.set_ylabel(
Now lets look at by epoch for MBSGD
= plt.figure(figsize=(20,7))
fig
= fig.add_subplot(131)
ax1 ;
ax1.plot(cPath_gd)"Loss by iteration for GD")
ax1.set_title("iteration");
ax1.set_xlabel(
= fig.add_subplot(132)
ax2 ;
ax2.plot(cPath_sgd)'Loss by iteration for SGD');
ax2.set_title("iteration");
ax2.set_xlabel("loss");
ax2.set_ylabel(
= fig.add_subplot(133)
ax3 ;
ax3.plot(cPath_mbgd)'Loss by EPOCH for MBGD');
ax3.set_title("EPOCH");
ax3.set_xlabel("loss"); ax3.set_ylabel(
= fig.add_subplot(133)
ax3 ;
ax3.plot(cPath_mbgd)'Loss by EPOCH for MBGD');
ax3.set_title("EPOCH");
ax3.set_xlabel("loss"); ax3.set_ylabel(
How about those paths!
if regenerate_anim:
= plt.figure(figsize=(30,7))
fig
= fig.add_subplot(131)
ax1 ;
ax1.scatter(x,y)
= fig.add_subplot(132)
ax2 ;
ax2.scatter(x,y)
= fig.add_subplot(133)
ax3 ;
ax3.scatter(x,y)
= ax1.plot([],[],'--r')
line1, = ax2.plot([],[],'--r')
line2, = ax3.plot([],[],'--r')
line3,
= 50
num_frames = np.linspace(0,len(wPath_gd)-1,num_frames).astype(int)
gd_frames = np.linspace(0,len(wPath_sgd)-1,num_frames).astype(int)
sgd_frames = np.linspace(0,len(wPath_mbgd)-1,num_frames).astype(int)
mbgd_frames
def animate(i):
= gd_frames[i]
gd_frame = sgd_frames[i]
sgd_frame = sgd_frames[i]
mbgd_frame
= np.flip(np.squeeze(wPath_gd[gd_frame]))
coefs_gd = np.flip(np.squeeze(wPath_sgd[sgd_frame]))
coefs_sgd = np.flip(np.squeeze(wPath_mbgd[mbgd_frame]))
coefs_mbgd
= np.poly1d(coefs_gd)
poly_gd = np.poly1d(coefs_sgd)
poly_sgd = np.poly1d(coefs_mbgd)
poly_mbgd
;
line1.set_data(x,poly_gd(x))'GDIteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs_gd[0], coefs_gd[1], coefs_gd[2], coefs_gd[3]))
ax1.set_title(
;
line2.set_data(x,poly_sgd(x))'SGD Iteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs_sgd[0], coefs_sgd[1], coefs_sgd[2], coefs_sgd[3]))
ax2.set_title(
;
line3.set_data(x,poly_mbgd(x))'MBGD Iteration: %d - %.2f x^3 + %.2f x^2 + %.2f x + %.2f' % (i, coefs_mbgd[0], coefs_mbgd[1], coefs_mbgd[2], coefs_mbgd[3]))
ax3.set_title(
return line1, line2, line3
plt.close()= matplotlib.animation.FuncAnimation(fig, animate, frames=num_frames, interval=100)
anim # HTML(anim.to_html5_video())
'multi_lin_reg_mbgd.mp4')
anim.save(
'multi_lin_reg_mbgd.mp4',width=800) Video(
Pause-and-ponder: What do you notice about these graphs? Reflect on it and answer here!
Momentum
Now we are finally armed with enough to consider modern modifications on these schemes! The first of which is adding some momentum to SGD.
The momentum-method (SGD+Momentum) algorithm is given by:
\begin{align} z_{k+1} &= \beta z_k + \nabla f(w_k) \\ w_{k+1} &= w_k - \eta z_{k+1} \end{align}
Its not as complicated as you might first think! Lets quickly discover a special case:
Pause-and-ponder: What do you get when you set the momentum parameter \beta = 0?
Quickly, we shoud see that we recover the original gradient descent iterates. So what does \beta != 0 give us? Well in optimization this is called acceleration.
It gives our iterates some “memory”. Lets write out a few steps to see what we mean:
\begin{align} w_1 &= w_0 - \eta z_1, \quad\quad z_1 = \nabla f(w_0) \\ w_2 &= w_1 - \eta z_2, \quad\quad z_2 = \beta\cdot\nabla f(w_0) + \nabla f(w_1) \\ w_3 &= w_2 - \eta z_3, \quad\quad z_3 = \beta \cdot z_2 + \nabla f(w_3) = \beta \cdot (\beta\cdot\nabla f(w_0) + \nabla f(w_1)) + \nabla f(w_2)\\ \ldots \end{align}
Its named momentum by analogy to the concept of momentum in physics. We can think of the weight vector w as a particle traveling through parameter space. This particle incurs some acceleration from the gradient of the loss (which acts as the “force” on the particle).
This acceleration tends to keep the weight vector traveling in the same direction as it was, and can preventing succumbing to unwanted oscillations.
Implementation
def sgd_momentum(dataset, labels, w_init=None, eta=None, beta=None, epochs=5):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 1e-2
eta if beta is None:
= 0.1
beta
# dataset
= dataset.shape[0]
N = dataset.shape[1]
d
# clean up initial point sizes:
= np.asarray(w_init).reshape(d,1)
w_init
# M-SGD params
= 1e-10
delta = 1
delta_w = 1000
max_iter = 0
iterations
# save weights and costs along path
= [w_init]
wPath
# calc cost at that initial value
= np.random.randint(N)
idx = dataset[np.newaxis,idx,:].T
x_i = labels[idx,:]
y_i = [cost_sgd(x_i,y_i,w_init)]
cPath
# take first step
= w_init
w_old = 0
z_old
# epochs of training
for epoch in range(epochs):
# shuffle the dataset - note we do this by shuffling indices
= np.random.permutation(N)
indices = dataset[indices,:]
shuffled_dataset = labels[indices,:]
shuffled_labels
for i in range(N):
# np.newaxis is there to deal with the 1 for the row size
= shuffled_dataset[i,:]
curr_sample = curr_sample[np.newaxis].T
curr_sample = shuffled_labels[i,:]
curr_label
# update your weights with the grad at one sample
= cost_grad_sgd(curr_sample,curr_label,w_old)
grad = beta*z_old + grad
z_new = w_old - eta * z_new
w_new
# update paths
wPath.append(w_new)
cPath.append(cost_sgd(curr_sample,curr_label,w_new))
# calculate weight change
= np.linalg.norm(w_new - w_old)
delta_w
# make your previous new weights the current weights
= w_new
w_old = z_new
z_old
# at end of epoch, check to see if you should terminate
if delta_w < delta:
break
# clean up path variables
= np.squeeze(np.asarray(wPath))
wPath = np.squeeze(np.asarray(cPath))
cPath
print(f"terminating M-SGD after {epoch} epochs with weight change of {delta_w:0.5f}.")
return w_new, wPath, cPath
M-SGD on Linear Regression
def lin_reg():
def f(x):
return 3*x + 1
= 100
N = np.linspace(-2,2,N)
x = f(x) + np.random.randn(N) # our zero man gaussian noise
y
= np.ones((N,2))
X 1] = x
X[:,
= y[:,np.newaxis]
y
return x,X,y,f
= lin_reg()
x,X,y,f ; plt.scatter(x,y)
= sgd(X,y,w_init=[0.75,2.5],eta=1e-2,epochs=100)
w_sgd, wPath_sgd, cPath_sgd = sgd_momentum(X,y,w_init=[0.75,2.5],eta=1e-5,epochs=100,beta=0.9) w_sgd_m, wPath_sgd_m, cPath_sgd_m
terminating SGD after 99 epochs with weight change of 0.00242.
terminating M-SGD after 99 epochs with weight change of 0.00006.
Plot it!
='iter') plot_path(wPath_sgd_m,cPath_sgd_m,colorby
Comparison
=[0,2]
w_init=1e-3
eta=0.9
beta
= sgd(X,y,w_init=w_init,eta=eta,epochs=100)
w_sgd, wPath_sgd, cPath_sgd = sgd_momentum(X,y,w_init=w_init,eta=eta,beta=beta,epochs=100)
w_sgd_m, wPath_sgd_m, cPath_sgd_m
;
plot_path(wPath_sgd,cPath_sgd); plot_path(wPath_sgd_m,cPath_sgd_m)
terminating SGD after 99 epochs with weight change of 0.00094.
terminating M-SGD after 99 epochs with weight change of 0.00831.
Plot Paths!
=[0,2]
w_init=1e-4
eta=0.99
beta
= sgd(X,y,w_init=w_init,eta=eta,epochs=100)
w_sgd, wPath_sgd, cPath_sgd = sgd_momentum(X,y,w_init=w_init,eta=eta,beta=beta,epochs=100) w_sgd_m, wPath_sgd_m, cPath_sgd_m
terminating SGD after 99 epochs with weight change of 0.00001.
terminating M-SGD after 99 epochs with weight change of 0.00168.
# plot
= 300
num_points
= 0.5
x_l = 1.5
x_r = 2.5
y_l = 3.5
y_r
= np.linspace(x_l,x_r,num_points)
w_0 = np.linspace(y_l,y_r,num_points)
w_1 = np.meshgrid(w_0,w_1)
W_0,W_1 = np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid = cost(X,y,wgrid).reshape(W_0.shape)
costs
=(20,7));
plt.figure(figsize
# SGD
= plt.subplot(121)
ax1 ='inferno');
ax1.contour(W_0, W_1, costs, cmap'w_0');
ax1.set_xlabel('w_1');
ax1.set_xlabel(
ax1.set_xlim([x_l,x_r])
ax1.set_ylim([y_l,y_r])0],wPath_sgd[:,1], s=50,c=cPath_sgd);
ax1.scatter(wPath_sgd[:,"SGD");
ax1.set_title(
# MSGD
= plt.subplot(122, sharex=ax1, sharey=ax1)
ax2 ='inferno');
ax2.contour(W_0, W_1, costs, cmap0],wPath_sgd_m[:,1], c=range(len(wPath_sgd_m)));
ax2.scatter(wPath_sgd_m[:,"M-SGD"); ax2.set_title(
Pause-and-ponder: What do you notice? Go back and change the \eta and \beta parameters above and jot your observations down here!
GD + Momentum
See-and-do: Go back, and add some momentum to our stock GD implementation, and make your comparisons!
def gd_momentum():
# EDIT HERE!
return
MBGD + Momentum
See-and-do: Go back, and add some momentum to our Mini-batch GD implementation, and make your comparisons!
def mbgd_momentum():
# EDIT HERE!
return
Unified Implementation
So far, we’ve implemented the following:
- Gradient Descent
- Stochastic Gradient Descent
- Mini-batch GD
What about “Mini-batch SGD”? How would you implement it? What would you expect it to behave like?
To the methods above, we can imagine adding some momentum, giving us: * GD + Momentum * Stochastic Gradient Descent + Momentum * Mini-batch GD + Momentum
Of which we’ve only given one, but tasked you with the rest.
Armed with everything you know, you should now be able to give me one function which can do all of these things! Among other things, this function should take a batch_size
, an epoch
, an eta
, and a beta
parameter.
def gd_optimizer():
# EDIT HERE
return
RMS Prop
Another extension which aims to curb our steps sizes in “bad directions” is RMS Prop which stands for Root Mean Squared Propagation. The name makes it sound more complicated than it is! Its updates are:
\begin{align} v_{t+1} = \gamma v_{t} + (1-\gamma)\left(\nabla f(w_t)\right)^2 \end{align} \\ w_{t+1} = w_{t} - \frac{\eta}{\sqrt{v_{t+1}}}\nabla f(w_t)
Where \gamma is called the forgetting factor.
This algorithm scales the step-size/learning rate differently for each parameter! It does so automatically through this scaling!
Factoid: This algorithm was first propsoed by Geoff Hinton while teaching a Coursera course!
Implementation
def mini_batch_rmsprop(dataset, labels, w_init=None, eta=None, gamma=None, batch_size=10, epochs=5, quiet=False):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 1e-2
eta if gamma is None:
= 0.9
gamma
# dataset
= dataset.shape[0]
N = dataset.shape[1]
d
= int(N/batch_size)
num_batches
# clean up initial point sizes:
= np.asarray(w_init).reshape(d,1)
w_init
# mini-batch params
= 1e-5
delta = 1
delta_w = 0
iterations
# save weights and costs along path
= []
wPath_rms = []
cPath_rms
# take the first step
= w_init
w_old = 0
v_old
for epoch in range(epochs):
for batch_num in range(num_batches):
# get current batch
= batch_num*batch_size
batch_start = (batch_num+1)*batch_size
batch_end = dataset[batch_start:batch_end,:]
batch = labels[batch_start:batch_end]
batch_labels
# update your weights
= cost_grad(batch,batch_labels,w_old)
grad
= gamma*v_old + (1-gamma)*grad**2
v_new = w_old - (eta/np.sqrt(v_new + np.random.rand()))*grad
w_new
# update old weights
= w_new
w_old = v_new
v_old
# append trajectory
wPath_rms.append(w_new)
cPath_rms.append(cost(batch,batch_labels,w_new))
if not quiet: print(f"Epoch:{epoch+1}/{epochs}, Loss:{cPath_rms[-1]}")
# clean up path variables
= np.squeeze(np.asarray(wPath_rms))
wPath_rms = np.squeeze(np.asarray(cPath_rms))
cPath_rms
return w_new, wPath_rms, cPath_rms
Plot it!
= lin_reg()
x,X,y,f ; plt.scatter(x,y)
= 0.8
gamma = sgd(X,y,w_init=[0.75,2.5],eta=1e-2,epochs=100)
w_sgd, wPath_sgd, cPath_sgd = mini_batch_rmsprop(X,y,w_init=[0.75,2.5],eta=1e-2,gamma=gamma,epochs=100,quiet=True) w_rms, wPath_rms, cPath_rms
terminating SGD after 99 epochs with weight change of 0.01311.
=[0,2]
w_init=1e-2
eta=0.8
gamma
= mini_batch_gd(X,y,epochs=100,quiet=True)
w_mbgd, wPath_mbgd, cPath_mbgd = sgd(X,y,w_init=w_init,eta=eta,epochs=100)
w_sgd, wPath_sgd, cPath_sgd = mini_batch_rmsprop(X,y,w_init=w_init,eta=eta,gamma=gamma,epochs=100,quiet=True)
w_rms, wPath_rms, cPath_rms
;
plot_path(wPath_mbgd,cPath_mbgd);
plot_path(wPath_sgd,cPath_sgd); plot_path(wPath_rms,cPath_rms)
terminating SGD after 99 epochs with weight change of 0.04209.
Adam
Finally - we’ve made it! A modern algorithm! At a very basic level, ADAM is just momentum+rmsprop! You could implement it right now!
Before we give its update, lets explain what ADAM stands for: ADAptive Moment estimation. Well, if you think of the gradient as the thing we are estimating, then its average of a dataset is the “first moment”, or the mean. The square of that is the “second moment”.
So my combining momentum, which deals with the first moment, and RMSProp which deals with the second moment, we have ADAM.
The ADAM algorithm updates are:
\require{cancel} \begin{align} z_{t+1} &= \frac{\beta_1 z_t + (1-\beta_1)\nabla f(w_t)}{\color{red}{\cancel{1-\beta_1^t}}} \\ v_{t+1} &= \frac{\beta_2 v_{t} + (1-\beta_2)\left(\nabla f(w_t)\right)^2}{\color{red}{\cancel{1-\beta_2^t}}} \\ w_{t+1} &= w_{t} - \eta\frac{z_{t+1}}{\color{red}{\sqrt{v_{t+1}} + \epsilon}}\color{red}{\cancel{\nabla f(w_t)}} \end{align}
Notes: * Note: Let \epsilon be a small, but fixed number like 10^{-8}. * We are giving it in the “usual” convention in terms of parameters, so \beta_1 controls the amount of “momentum”, while \beta_2 controls the amount of “RMSprop”. * We are also giving the usual ADAM implementation that includes the bias correction of dividing by \cancel{1-\beta^t}. As t grows, this corrective term disappears. * “In the field” typical values for \beta_1,\beta_2 are: \beta_1=0.9,\beta_2=0.999, and \eta still needs to be tuned. * Note: You should still try different values here! These are “in the field” values for “real” cost functions.
def mini_batch_adam(dataset, labels, w_init=None, eta=None, beta1=0.9, beta2=0.999, batch_size=10, epochs=5, quiet=False):
if w_init is None:
= np.random.rand(dataset.shape[1],1)
w_init if eta is None:
= 1e-2
eta = 1e-8
epsilon
# dataset
= dataset.shape[0]
N = dataset.shape[1]
d
= int(N/batch_size)
num_batches
# clean up initial point sizes:
= np.asarray(w_init).reshape(d,1)
w_init
# mini-batch params
= 1e-5
delta = 1
delta_w = 0
iterations
# save weights and costs along path
= []
wPath_adam = []
cPath_adam
# take the first step
= w_init
w_old = 0
z_old = 0
v_old
= 1
t for epoch in range(epochs):
for batch_num in range(num_batches):
# get current batch
= batch_num*batch_size
batch_start = (batch_num+1)*batch_size
batch_end = dataset[batch_start:batch_end,:]
batch = labels[batch_start:batch_end]
batch_labels
# update your weights
= cost_grad(batch,batch_labels,w_old)
grad
= beta1*z_old + (1-beta1)*grad
z_new = z_new/(1+beta1**t)
z_corr = beta2*v_old + (1-beta2)*grad**2
v_new = v_new/(1+beta2**t)
v_corr = w_old - eta*z_corr/(np.sqrt(v_corr) + epsilon)
w_new
# update old weights
= w_new
w_old = z_new
z_old = v_new
v_old
# increment t
+= 1
t
# append trajectory
wPath_adam.append(w_new)
cPath_adam.append(cost(batch,batch_labels,w_new))
if not quiet: print(f"Epoch:{epoch+1}/{epochs}, Loss:{cPath_adam[-1]}")
# clean up path variables
= np.squeeze(np.asarray(wPath_adam))
wPath_adam = np.squeeze(np.asarray(cPath_adam))
cPath_adam
return w_new, wPath_adam, cPath_adam
= lin_reg()
x,X,y,f ; plt.scatter(x,y)
=[0.1,2.1]
w_init=1e-2
eta
= mini_batch_adam(X,y,w_init=w_init,eta=eta,epochs=100,quiet=True) w_adam, wPath_adam, cPath_adam
# plotting
= 100
len_w = np.linspace(0,2,len_w)
w_0 = np.linspace(2,4,len_w)
w_1 = np.meshgrid(w_0, w_1)
W_0, W_1
# combine and reshape mesh for calculation
= np.array([W_0,W_1]).reshape(2, len(w_0)**2)
wgrid
# calculate on grid, and reshape back for plotting
= cost(X,y,wgrid).reshape(W_0.shape)
costs
= plt.figure(figsize=(10,10))
fig = fig.add_subplot()
ax1 'w_0')
ax1.set_xlabel('w_1')
ax1.set_ylabel(# ax1.set_zlabel('cost')
# ax1.set_zlim([-2,3])
0],wPath_adam[:,1],s=10,c=cPath_adam,cmap='inferno');
ax1.scatter(wPath_adam[:,#ax1.plot(wPath_adam[:,0],wPath_adam[:,1]);
# ax1.plot_surface(W_0, W_1, costs,cmap='inferno');
='inferno',levels=20,alpha=0.5); ax1.contour(W_0, W_1, costs, cmap