Appendix F: Optimization#
Mahmood Amintoosi, Spring 2024
Computer Science Dept, Ferdowsi University of Mashhad
Notebook Introduction#
In this short notebook, we will see how to use the gradient obtained with Autograd to perform optimization of an objective function.
Then we will also present some off-the-shelf Pytorch optimizers and learning rate schedulers.
As an eye candy, we will finish with some live optimization vizualisations.
Google Colab only!#
# Auto-setup when running on Google Colab
import os
if 'google.colab' in str(get_ipython()) and not os.path.exists('/content/neural-networks'):
!git clone -q https://github.com/fum-cs/neural-networks.git /content/neural-networks
!pip --quiet install -r /content/neural-networks/requirements_colab.txt
%cd neural-networks/notebooks
# execute only if you're using Google Colab
# !wget -q https://raw.githubusercontent.com/ahug/amld-pytorch-workshop/master/binder/requirements.txt -O requirements.txt
# !pip install -qr requirements.txt
# !wget -q https://raw.githubusercontent.com/ahug/amld-pytorch-workshop/master/live_plot.py -O live_plot.py
import torch
import numpy as np
from utils import live_plot
%matplotlib notebook
torch.set_printoptions(precision=3)
Optimizing “by hand”#
We will start with a simple example : minimizing the square function.
def f(x):
return x ** 2
We will minimize the function \(f\) “by hand” using the gradient descent algorithm.
As a reminder, the update step of the algorithm is: $\(x_{t+1} = x_{t} - \lambda \nabla_x f (x_t)\)$
Note:
The gradient information \(\nabla_x f (x)\) is stored in
x.grad
. Once we have run thebackward
function, we can use it to do our update step.We need to do
x.data = ...
in the update step since want to change x in place but don’t want autograd to track this change
# YOUR TURN
x0 = 8
lr = 0.1
iterations = 30
x = torch.Tensor([x0]).requires_grad_()
y = f(x)
for i in range(iterations):
y = f(x)
y.backward()
x.data = x - lr * x.grad
x.grad.zero_() # Try commenting that line !
print(y.data)
tensor([64.])
tensor([40.960])
tensor([26.214])
tensor([16.777])
tensor([10.737])
tensor([6.872])
tensor([4.398])
tensor([2.815])
tensor([1.801])
tensor([1.153])
tensor([0.738])
tensor([0.472])
tensor([0.302])
tensor([0.193])
tensor([0.124])
tensor([0.079])
tensor([0.051])
tensor([0.032])
tensor([0.021])
tensor([0.013])
Why do we have x.data?#
If you do x = ...
, then x is not a leaf variable anymore and will have a computation history. Since it is not a leaf anymore after the first iteration, its gradient will not be available at the second iteration.
Workarounds:
Define x as a new leaf variable requiring gradient at each iterations using
detach()
andrequire_grad_()
Update
x.data
so that it is not recorded by autograd
Optimizing with an optimizer#
Different optimizers#
Pytorch provides most common optimization algorithms encapsulated into “optimizer classes”.
An optimizer is a useful object that automatically loops through all the numerous parameters of your model and performs the (potentially complex) update step for you.
You first need to execute import optim
.
import torch.optim as optim
Below are the most commonly used optimizers. Each of them have its specific parameters that you can check on the Pytorch Doc.
parameters = [x] # This should be the list of model parameters
optimizer = optim.SGD(parameters, lr=0.01, momentum=0.9)
optimizer = optim.Adam(parameters, lr=0.01)
optimizer = optim.Adadelta(parameters, lr=0.01)
optimizer = optim.Adagrad(parameters, lr=0.01)
optimizer = optim.RMSprop(parameters, lr=0.01)
optimizer = optim.LBFGS(parameters, lr=0.01) #Broyden–Fletcher–Goldfarb–Shanno algorithm
# and there is more ...
Using an optimizer#
Now, let’s use an optimizer to do the optimization !
You will need 2 new functions:
optimizer.zero_grad()
: This function sets the gradient of the parameters (x here) to 0 (otherwise it will get accumulated)optimizer.step()
: This function applies an update step
# YOUR TURN
x0 = 8
lr = 0.01
iterations = 10
x = torch.Tensor([x0]).requires_grad_()
y = f(x)
# Define your optimizer
parameters = [x] # This should be the list of model parameters
optimizer = optim.SGD(parameters, lr=0.01, momentum=0.9)
for i in range(iterations):
optimizer.zero_grad()
y = f(x)
y.backward()
optimizer.step()
print(y.data)
tensor([64.])
tensor([61.466])
tensor([56.840])
tensor([50.662])
tensor([43.507])
tensor([35.934])
tensor([28.444])
tensor([21.452])
tensor([15.268])
tensor([10.096])
Learning rate scheduler#
Learning rate scheduler seek to adjust the learning rate during training by reducing the learning rate according to a pre-defined schedule.
Below are some of the scheduler available in pytorch.
optim.lr_scheduler.LambdaLR
optim.lr_scheduler.ExponentialLR
optim.lr_scheduler.MultiStepLR
optim.lr_scheduler.StepLR
# and some more ...
torch.optim.lr_scheduler.StepLR
Let’s try optim.lr_scheduler.ExponentialLR
def f(x):
return x.abs() * 5
x0 = 8
lr = 0.5
iterations = 15
x = torch.Tensor([x0]).requires_grad_()
optimizer = optim.SGD([x], lr=lr)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, 0.8)
for i in range(iterations):
optimizer.zero_grad()
y = f(x)
y.backward()
optimizer.step()
scheduler.step()
print(y.data, " | lr : ", optimizer.param_groups[0]['lr'])
tensor([40.]) | lr : 0.4
tensor([27.500]) | lr : 0.32000000000000006
tensor([17.500]) | lr : 0.25600000000000006
tensor([9.500]) | lr : 0.20480000000000007
tensor([3.100]) | lr : 0.16384000000000007
tensor([2.020]) | lr : 0.13107200000000005
tensor([2.076]) | lr : 0.10485760000000005
tensor([1.201]) | lr : 0.08388608000000004
tensor([1.421]) | lr : 0.06710886400000003
tensor([0.677]) | lr : 0.05368709120000003
tensor([1.001]) | lr : 0.042949672960000025
tensor([0.341]) | lr : 0.03435973836800002
tensor([0.733]) | lr : 0.027487790694400018
tensor([0.126]) | lr : 0.021990232555520017
tensor([0.561]) | lr : 0.017592186044416015
Live Plots#
Below are some live plots to see what actually happens when you optimize a function.
You can play with learning rates, optimizers and also define new functions to optimize !
2D Plot - Optimization process#
from live_plot import init_2dplot, add_point_2d
def function_2d(x):
return x ** 2 / 20 + x.sin().tanh()
x0 = 8
lr = 3
iterations = 15
x_range = torch.arange(-10, 10, 0.1)
init_2dplot(x_range, function_2d, delta_=0.5)
x = torch.Tensor([x0]).requires_grad_()
optimizer = torch.optim.Adam([x], lr=lr)
for i in range(iterations):
optimizer.zero_grad()
f = function_2d(x)
f.backward()
add_point_2d(x, f)
optimizer.step()
3D Plot - Optimization process#
from live_plot import init_3dplot, add_point_3d
Choose a function below and run the cell
elev, azim = 40, 250
x0, y0 = 6, 0
x_range = torch.arange(-10, 10, 1).float()
y_range = torch.arange(-15, 10, 2).float()
def function_3d(x, y):
return x ** 2 - y ** 2
elev, azim = 30, 130
x0, y0 = 10, -4
x_range = torch.arange(-10, 15, 1).float()
y_range = torch.arange(-15, 10, 2).float()
def function_3d(x, y):
return x ** 3 - y ** 3
elev, azim = 80, 130
x0, y0 = 4, -5
x_range = torch.arange(-10, 10, .5).float()
y_range = torch.arange(-10, 10, 1).float()
def function_3d(x, y):
return (x ** 2 + y ** 2).sqrt().sin()
elev, azim = 37, 120
x0, y0 = 6, -15
x_range = torch.arange(-10, 12, 1).float()
y_range = torch.arange(-25, 5, 1).float()
# lr 0.15 momentum 0.5
def function_3d(x, y):
return (x ** 2 / 20 + x.sin().tanh()) * (y.abs()) ** 1.2 + 5 * x.abs() + (y + 7)**2 / 10
Optimize the function
init_3dplot(x_range, y_range, function_3d, elev, azim, delta_=0.1)
#x0 =
#y0 =
lr = 0.01
iterations = 15
x = torch.Tensor([x0]).requires_grad_()
y = torch.Tensor([y0]).requires_grad_()
optimizer = torch.optim.SGD([x, y], lr=lr)
for i in range(iterations):
optimizer.zero_grad()
f = function_3d(x, y)
f.backward()
add_point_3d(x, y, f)
optimizer.step()