Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

PyTorch Notes

1
2
3
4
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

Basic op

numpy to/from tensor

1
2
3
4
5
6
7
8
# numpy -> tensor
np_array = np.ones((2,3))
torch_tensor = torch.from_numpy(np_array)
# or:
torch_tensor = torch.FloatTensor(np_array)

# tensor -> numpy
np_array = torch_tensor.numpy()

contiguous

1
x.transpose(1, 2).contiguous().view(...)

Parameter v.s. register_buffer

  • nn.Parameter is considered a module parameter and will appear in parameters() iterator. This would do backprop.

    1
    nn.Parameter(data: Tensor, required_grad:bool = True)
  • register_buffer add a persistant buffer to the module. It is used to register a buffer, not a parameter. It cannot do backprop.

    1
    self.register_buffer(name:str, tensor: Tensor)

Multiplication

torch.einsum

multi-linear expressions, i.e. sums of products. Use Einstein summation convention

torch.einsum(equation, *operands) → Tensor

1
2
3
As = torch.randn(3,2,5)
Bs = torch.randn(3,5,4)
torch.einsum('bij,bjk->bik', As, Bs) # batch matrix multiplication

torch.ger

torch.ger(input, vec2, out=None) → Tensor
outer product

1
2
3
4
5
6
7
v1 = torch.arange(1., 5.)
v2 = torch.arange(1., 4.)
torch.ger(v1, v2)
# tensor([[ 1., 2., 3.],
[ 2., 4., 6.],
[ 3., 6., 9.],
[ 4., 8., 12.]])

dimension

1
2
3
4
5
6
7
t = torch.randn(4,5,6)

t.dim() # 3
t.size() # torch.Size([4, 5, 6])
t.shape # torch.Size([4, 5, 6])
t.size(0) # 4
t.size(-1) # 6

nn.Parameter

torch.nn.Parameter, a subclass of torch.Tensor, could automatically add the data into the list of parameters and could appear in Module.parameters iterator. It can be automatically optimized by the optimizer if in optimized parameter list. Its arguments:

  • data (Tensor): parameter tensor.
  • requires_grad (bool, optional): if the parameter requires gradient. Default: True.

Tensor

byte()

1
2
3
4
5
t = torch.ones(2,3)

t.byte()
# equals to
t.to(torch.uint8)

topk()

torch.topk(input, k, dim=None, largest=True, sorted=True, out=None) -> (Tensor, LongTensor)

  • a namedtuple of (values, indices) is returned, where the indices are the indices of the elements in the original input tensor.
1
2
3
4
x = torch.arange(1., 6.)
# tensor([ 1., 2., 3., 4., 5.])
torch.topk(x, 3)
# torch.return_types.topk(values=tensor([5., 4., 3.]), indices=tensor([4, 3, 2]))

Loss functions

NLLLoss

torch.nn.NLLLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

  • negative log likelihood loss. It is useful to train a classification problem with C classes
  • size_average, reduce - deprecated
  • reduction: (‘none’, ‘mean’ (default), ‘sum’)
  • It requires adding a LogSoftmax layer as the last layer. Combining LogSoftmax with NLLLoss is the same as using CrossEntropyLoss.

Optim

Per-parameter optim

Pass in an iterable of dicts.
E.g. specify per-layer learning rates:

1
2
3
4
5
6
7
optim.SGD([
{'params': model.base.parameters()}, # default lr
{'params': model.classifier.parameters(), 'lr': 1e-3}
],
lr=1e-2, # default
momentum=.9 # for all params
)

Optim step

Optimizer.step()

Step once the gradients are computed loss.backward()

1
2
3
4
5
6
for input, target in dataset:
optimizer.zero_grad() # zero gradients
output = model(input) # foward pass
loss = loss_fn(output, target) # calculate loss
loss.backward() # do backprop, compute gradients
optimizer.step() # update parameters

Optim algorithms

Optimization methods in deep learning

Optimizer.step(closure)

Some algorithms like Conjugate Gradient and LBFGS requries re-evaluate multiple times, so pass in a closure to clear the gradients, compute the loss, finally return.

1
2
3
4
5
6
7
8
for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure) # pass in a closure

Adjust learning rate

use torch.optim.le_scheduler

1
2
3
4
5
scheduler = ...
for epoch in range(100):
train(...)
evaluate(...)
scheduler.step()

gradient clipping

  1. clip by value, set threshold
  2. clip_norm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def clip_grad_norm_(parameters, max_norm, norm_type=2):
parameters = list(filter(lambda p: p.grad is not None, parameters))
max_norm = float(max_norm))
norm_type = float(norm_type)
if norm_type = torch._six.inf
total_norm = max(p.grad.data.abs().max() for p in parameters)
else:
total_norm = 0
for p in parameters:
param_norm = p.grad.data.norm(norm_type)
total_norm += param_norm.item() ** norm_type
total_norm = total_norm ** (1./ norm_type)
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
for p in parameters:
p.grad.data.mul_(clip_coef)
return total_norm

Misc

Define layers

Layers should be directly set as the attribute of a children class of torch.nn.Module, so that model.parameters() can be directly pass to torch.nn.optim. Otherwise, additionally parameters should be passed following model.parameters().

  • In other words, if layers are wrapped by a parent data structure like dict(), the model.parameters() cannot get all the layer parameters to be optimized, so that the first augument of optimizer should be manually set. (as below python code)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class Net(nn.Module):
def __init__(self):
super(SharedLayers, self).__init__()
d = {}
d['f1'] = nn.Linear(20, 10)
d['f2'] = nn.Linear(10, 10)
self.d = d
self.f3 = nn.Linear(10, 1)
self.loss = nn.MSELoss()

def forward(self, x):
x = self.d['f1'](x)
x = self.d['f2'](x)
x = self.f3(x)
return x


if __name__ == '__main__':
net = Net()
x = torch.rand(1, 20)
y = torch.rand(1, 1)
optimizer = optim.Adam(net.parameters(), lr=1e-3)
for _ in range(10):
y_ = net(x)
loss = F.mse_loss(y_, y)
opt.zero_grad()
loss.backward() # do backprop
optimizer.step() # do not optimize layers wrapped in net.d !
# [p for p in l.d['f1'].parameters()] never change!
  • Also, if we need to use gpu to run, often we do: Net().to(device). But if the there are layers encompassed by a dict attribute in the modulde class, we have to do layer.to(device) individually.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    class Net(nn.Module):
    def __init__(self):
    super(SharedLayers, self).__init__()
    d = {}
    d['f1'] = nn.Linear(20, 10).to(device)
    d['f2'] = nn.Linear(10, 10).to(device)
    self.d = d
    self.f3 = nn.Linear(10, 1)
    self.loss = nn.MSELoss()

    ...

    if __name__ == '__main__':
    net = Net().to(device)
    x = torch.rand(1, 20).to(device)
    y = torch.rand(1, 1).to(device)
    ...

NaN

If there exists NaN:

  1. If within iteration 100, it may be due to the big learning rate. Try to reduce the learning rate 1/2~1/10.
  2. If use RNNs, may be because of the gradient exploration. Solution: add “gradient clipping”
  3. Division by 0.
  4. Take logarithm of 0 or negative number, e.g. calculating entropy or cross entropy.
  5. In exponential computation, the result is INF/INF, e.g. softmax. Solution: minus the maximum if possible.

Count the parameter numbers

1
2
3
4
5
6
# approach 1
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
tot_params = sum([np.prod(p.size()) for p in model_parameters])

# approach 2 (count the trainable params)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

Configuration error

MacOSX

import torch error

1
2
3
4
5
6
7
8
9
10
11
>>> import torch
Traceback (most recent call last):
File "/.../lib/python3.6/site-packages/torch/__init__.py", line 79, in <module>
from torch._C import *
...
ImportError: dlopen(/.../lib/python3.6/site-packages/torch/_C.cpython-36m-darwin.so, 9): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
Referenced from: /.../lib/python3.6/site-packages/torch/lib/libshm.dylib
Reason: image not found

# solution
$ brew install libomp