# Basic op

## Parameter v.s. register_buffer

• nn.Parameter is considered a module parameter and will appear in parameters() iterator. This would do backprop.

• register_buffer add a persistant buffer to the module. It is used to register a buffer, not a parameter. It cannot do backprop.

## Multiplication

### torch.einsum

multi-linear expressions, i.e. sums of products. Use Einstein summation convention

torch.einsum(equation, *operands) → Tensor

### torch.ger

torch.ger(input, vec2, out=None) → Tensor
outer product

## nn.Parameter

torch.nn.Parameter, a subclass of torch.Tensor, could automatically add the data into the list of parameters and could appear in Module.parameters iterator. It can be automatically optimized by the optimizer if in optimized parameter list. Its arguments:

• data (Tensor): parameter tensor.

# Tensor

## topk()

torch.topk(input, k, dim=None, largest=True, sorted=True, out=None) -> (Tensor, LongTensor)

• a namedtuple of (values, indices) is returned, where the indices are the indices of the elements in the original input tensor.

# Loss functions

## NLLLoss

torch.nn.NLLLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

• negative log likelihood loss. It is useful to train a classification problem with C classes
• size_average, reduce - deprecated
• reduction: (‘none’, ‘mean’ (default), ‘sum’)
• It requires adding a LogSoftmax layer as the last layer. Combining LogSoftmax with NLLLoss is the same as using CrossEntropyLoss.

# Optim

## Per-parameter optim

Pass in an iterable of dicts.
E.g. specify per-layer learning rates:

## Optim step

### Optimizer.step()

Step once the gradients are computed loss.backward()

## Optim algorithms

Optimization methods in deep learning

### Optimizer.step(closure)

Some algorithms like Conjugate Gradient and LBFGS requries re-evaluate multiple times, so pass in a closure to clear the gradients, compute the loss, finally return.

use torch.optim.le_scheduler

1. clip by value, set threshold
2. clip_norm.

# Misc

## Define layers

Layers should be directly set as the attribute of a children class of torch.nn.Module, so that model.parameters() can be directly pass to torch.nn.optim. Otherwise, additionally parameters should be passed following model.parameters().

• In other words, if layers are wrapped by a parent data structure like dict(), the model.parameters() cannot get all the layer parameters to be optimized, so that the first augument of optimizer should be manually set. (as below python code)
• Also, if we need to use gpu to run, often we do: Net().to(device). But if the there are layers encompassed by a dict attribute in the modulde class, we have to do layer.to(device) individually.

## NaN

If there exists NaN:

1. If within iteration 100, it may be due to the big learning rate. Try to reduce the learning rate 1/2~1/10.
import torch error