Pytorch Basics II
Pytorch Basics II
Working with Gradients
Weights and biases of individual neurons are determined during the training process.
Regression– the simplest neural network. It tries to best-fit a line that passes through the data.
y = Wx + b
- Minimize the sum of squares of the distance of the points from the regression line.
The actual training of a neural network happens via Gradient Descent Optimization.
MSE = Mean Square Error of Loss. A metric to be minimized during training of regression model.
Loss = ypredicted
- yactual
Where ypredicted
is, given x, model outputs predicted value of y and yactual
is the actual label, available in the training data.
There are three ways to calculate gradients:
-
Symbolic Differentiation – Conceptually simple but hard to implement
-
Numeric Differentiation – Easy to implement but won’t scale
-
Automatic Differentiation – Conceptually difficult but easy to implement.
Pytorch and Tensorflow rely on automatic differentiation.
In Pytorch, the package used to calculate gradients for bacpropagation is Autograd
-
Optimizer uses the error function and tweaks the model parameters to minimize error.
-
Backward pass: updates parameter values.
-
Backpropagation is implemented using a technique called reverse auto-differentiation.
-
Gradient – vector of partial derivatives – these gradients apply to specific time t.
-
Parameters (t+1) = Parameters (t) - learning_rate X Gradient(t)
-
For next time step: update parameter values. Move each parameter value in the direction of reducing gradient.
-
Learning rate is the size of the step in the direction of the reducing gradient. If we want to take small steps to converge to what’s the min value of loss, keep the LR small and vice versa. Keep in mind that using a small LR the model would take long to train and converge to the lowest loss value.
Automatic Differentiation
Reverse-mode auto-differentiation
-
Used in Pytorch and TF
-
Two passes in each training step
- Forward step: Calculate loss
- Backward step: Update parameter values
Automatic Differentiation:
-
Relies on a mathematical trick
-
Based on Taylor’s Series Expansion
-
Allows fast approximation of gradients
-
It can be performed in two modes: Reverse and Forward mode
- Forward-mode is similar to numeric differentiation. It requires one pass per parameter and will not scale to complex networks.
Implementing Autograd in Pytorch
# Create two 2X3 tensors
tensor_1 = torch.Tensor([[1, 2, 3],
[4, 5, 6]])
tensor_1
tensor([[1., 2., 3.],
[4., 5., 6.]])
tensor_2 = torch.Tensor([[7, 8, 9],
[10, 11, 12]])
tensor_2
tensor([[ 7., 8., 9.],
[10., 11., 12.]])
# Every tensor created in Pytorch has the ```requires_grad``` property
tensor_1.requires_grad
``False`
When requires_grad
= True, it tracks computations for a tensor in the forward phase and will calculate gradients for this tensor in the backward phase.
The default value is False
tensor_2.requires_grad
False
# To enable tracking on the tensor:
tensor_1.requires_grad_()
tensor([[1., 2., 3.],
[4., 5., 6.]], requires_grad=True)
# Check property again
tensor_1.requires_grad
True
Gradients calculated using Automatic Differentiation with respect to any tensor is present in the graph matric associated with that tensor.
print(tensor_1.grad)
None
It returned None because no gradients have been calculated. This is still part of a computation graph but no forward or backward passes have been made yet. We have created a tensor but haven’t used it to perform any calculations.
The computation graph in Pytorch is made up of tensors and functions, where tensors can be the nodes and functions are the transformations performed along the edges. Every tensor has a function that is used to create that function:
print(tensor_1.grad_fn)
None
# Set up a graph by performing a calculation on the tensors
output_tensor = tensor_1 * tensor_2
The output tensor will inherit the requires_graph
property that was given to tensor_1.
print(output_tensor.requires_grad)
True
print(output_tensor.grad)
None
Still no gradients as we haven’t made any backwards pass yet.
But it will have a grad function because we used an specific multiplication operation (see MulBackward0
below) to create this output tensor. User-created tensors have no corresponding function.
print(output_tensor.grad_fn)
<MulBackward0 object at 0x7f53399177b8>
# Create another output tensor with a different operation
output_tensor_1 = (tensor_1 * tensor_2).mean()
print(output_tensor_1.grad_fn)
<MeanBackward0 object at 0x7f5339917588>
Notice that the function displayed now is MeanBackward0
even though we also used a multiplication function to create this output tensor. The grad_fn
references the loss function used to create this tensor, in this case the mean.
print(tensor_1.grad)
None
Although we used tensor_1
for several operations, it still does not have gradients. This is because we have’t perform a backward pass used to calculate gradients. Gradient calculation (a vector of partial derivatives) will only be calculated when we pass the backward
function to an output.
output_tensor_1.backward()
print(tensor_1.grad)
tensor([[1.1667, 1.3333, 1.5000],
[1.6667, 1.8333, 2.0000]])
Since the gradients are the partial derivatives for the parameters in tensor_1 its shape will exactly match the shape of the tensor:
tensor_1.grad.shape, tensor_1.shape
(torch.Size([2, 3]), torch.Size([2, 3]))
A tensor will inherit its properties to the output tensor, as seen above, if we do not want Pytorch to track the history of the tensors we can use torch.no_grad
with torch.no_grad():
new_tensor = tensor_1 * 3
print('new_tensor = ', new_tensor)
print('requires_grad for tensor = ', tensor_1.requires_grad)
print('requires_grad for tensor = ', tensor_2.requires_grad)
print('requires_grad for tensor = ', new_tensor.requires_grad)
new_tensor = tensor([[ 3., 6., 9.],
[12., 15., 18.]])
requires_grad for tensor = True
requires_grad for tensor = False
requires_grad for tensor = False
Here we can see that the new_tensor
did not inherit the properties of tensor_1
even though it was used to create the output tensor because it was created inside the torch.no_grad
code block.
# Create a function that multiplies any number by 2
def calculate(t):
return t * 2
# Create same function but add the no_grad decorator
@torch.no_grad()
def calculate_no_grad(t):
return t * 2
Both functions perform the same operation but for the second function, gradients will no be enabled, history tracking will not be turn-on even if the tensor has its grad property set to True.
# Let's test the 1st function passing tensor_1
result_tensor = calculate(tensor_1)
result_tensor
tensor([[ 2., 4., 6.],
[ 8., 10., 12.]], grad_fn=<MulBackward0>)
# Test 2nd function
result_tensor_no_grad = calculate_no_grad(tensor_1)
result_tensor_no_grad
tensor([[ 2., 4., 6.],
[ 8., 10., 12.]])
result_tensor_no_grad.requires_grad
False
History tracking can be explicitly enabled even if the code is executed inside a torch.no_grad()
code block.
with torch.no_grad():
new_tensor_no_grad = tensor_1 * 3
print('new_tensor_no_grad = ', new_tensor_no_grad)
# Explicitly enable grad
with torch.enable_grad():
new_tensor_grad = tensor_1 * 3
print('new_tensor_grad = ', new_tensor_grad)
new_tensor_no_grad = tensor([[ 3., 6., 9.],
[12., 15., 18.]])
new_tensor_grad = tensor([[ 3., 6., 9.],
[12., 15., 18.]], grad_fn=<MulBackward0>)
Also, the value for requires_grad
can be defined when creating a new tensor:
tensor_1_1 = torch.tensor([[1.0, 2.0],
[3.0, 4.0]], requires_grad=True)
tensor_1_1
tensor([[1., 2.],
[3., 4.]], requires_grad=True)
If I create a new tensor tensor_1_2
by default– the requires_grad
parameter will be set to False.
tensor_1_2 = torch.tensor([[3.0, 4.0 ],
[5, 6]])
tensor_1_2
tensor([[3., 4.],
[5., 6.]])
# Update tensor_1_2 to True
tensor_1_2.requires_grad_()
tensor([[3., 4.],
[5., 6.]], requires_grad=True)
# Perform a simple calculation that'd perform the forward pass
final_tensor = (tensor_1_1 + tensor_1_2).mean()
final_tensor
tensor(7., grad_fn=<MeanBackward0>)
# Check ```requires_grad``` property for the final tensor
final_tensor.requires_grad
True
There are no gradients for the two input tensors as we only perform the forward pass on the computation graph
print(tensor_1_1.grad)
None
print(tensor_1_2.grad)
None
Since we have history tracking enabled for these two input tensors in this computation, we can call .backward()
to calculate the gradients.
final_tensor.backward()
print(tensor_1_1.grad)
tensor([[0.2500, 0.2500],
[0.2500, 0.2500]])
print(tensor_1_2)
tensor([[3., 4.],
[5., 6.]], requires_grad=True)
Tensors involved in a computation are part of a larger computational graph. If we want to retreive a tensor that is detached of the current computation we can call .detach()
. This detach tensor will always have requires_grad=False
detached_tensor = tensor_1_1.detach()
detached_tensor
tensor([[1., 2.],
[3., 4.]])
tensor_1_1
tensor([[1., 2.],
[3., 4.]], requires_grad=True)
# Use the detach and the original tensor in a computation
mean_tensor = (tensor_1_1 + detached_tensor).mean()
mean_tensor.backward()
tensor_1_1.grad
tensor([[0.5000, 0.5000],
[0.5000, 0.5000]])
print(detached_tensor.grad)
None
Autograd with Variables
Variables are no longer needed to work with autograd and to store gradients, those parameters are now part of the tensor itself. However the Variable API still exists in Pytorch and can be used by:
from torch.autograd import Variable
# Instantiate a variable
var = Variable(torch.FloatTensor([9]))
var
tensor([9.])
All the attributes of a tensor we saw above, are the same for the variable that holds the tensor.
# Update the requires_grad property
var.requires_grad_()
tensor([9.], requires_grad=True)
w1 = Variable(torch.FloatTensor([3]), requires_grad=True)
w2 = Variable(torch.FloatTensor([7]), requires_grad=True)
w1
tensor([3.], requires_grad=True)
w2
tensor([7.], requires_grad=True)
result_var = var * w1
result_var
tensor([27.], grad_fn=<MulBackward0>)
result_var.backward()
w1.grad
tensor([9.])
print(w2.grad)
None
var.grad
tensor([3.])