A capsule is defined as a group of neuron instantiations whose parameters represent specific properties of a specific type of entity. Here is a brief note of Capsule networks[1][2].

Convolutional Neural Networks (CNNs) extract local-region features with fixed strides, followed by max-pooling, which may only retain the remarkable features but ignore the fine-grain features such as overlapping entities in images. Capsules are regarded as a better solution to handle this.

# Capsules with dynamic routing

Standard inputs and outputs of neural layers are scalar features, while Capsule Networks leverage capsules of vectors to represent features of entities.

## Capsule

For each capsule, the input $\mathbf{u}_i$ and the output $\mathbf{v}_j$ are vectors.[1]

For all except the first layer of capsules, the input to a capsule $\mathbf{s}_j$ is a weighted sum over all “predicted vector” $\hat{\mathbf{u}}_{j \vert i}$, which is linearly transformed with a learnable weight matrix $\mathbf{W}_{ij}$.

where coupling coefficients $c_{ij}$ between capsule $i$ and all all capsules in the layer above are scaled with a “routing softmax” to sum to 1. We will introduce this in the next section.

A squashing activation function rather than ReLU is used on $\mathbf{s}_j$, ensuring that the short and long vectors to approach to the length of almost zero and slighly below 1, respectively.

Here $\mathbf{v}_j \rightarrow \Vert \mathbf{s}_j \Vert \mathbf{s}_j$ for $\mathbf{s}_j$ is too small, while $\mathbf{v}_j \rightarrow \frac{\mathbf{s}_j}{\Vert \mathbf{s}_j \Vert}$ when that is large.

The weight $c_{ij}$ is computed with iteration:

where the log probabilities $b_{ij}$ measure the probability that capsule $i$ should be coupled to capsule $j$ above, and are iteratively updated with:

where $a_{ij}$ measures the agreement between the current output $\mathbf{v}_j$ of capsule $j$ in above layer and $\hat{\mathbf{u}}_{j \vert i}$ in capsule $i$.

## Margin loss

For each capsule $k$, seperate margin loss $L_k$ is:

where $T_k=1$ iff class $k$ is present, $m^+ =0.9, m^-=0.1, \lambda = 0.5$. The total loss just simply sum all capsules.

# Matrix capsules with EM routing

Capsules consist of a pose matrix $\mathbf{M} \in \mathbb{R}^{4 \times 4}$ and an activation probability $a \in \mathbb{R}$. Expectation-Maximization algorithm is used to iteratively update the cluster of capsules with similar votes.

Let we denote the capsules in layer $L$ as $\Omega_L$, trainable weights between each capsule $i$ in layer $i$ and each capsule $j$ in layer $L+1$ as $\mathbf{W}_{ij} \in \mathbb{R}^{4 \times 4}$.

The iterative update of pose matrix and activations of all capsules in layer $L+1$ leverages a non-linera routing procedure to get input $\mathbf{V}_{ij}$ and $a_i$ for all $i \in \Omega_L, j \in \Omega_{L+1}$.

Spread loss directly maximizes the gap between the activation of the target class $a_t$ and the activation of other classes.

where $m = 0.2$ initially and linearly increases during training to 0.9, equal to squared Hinge loss with $m=1$.

# References

1. 1.Sabour, S., Frosst, N. and Hinton, G.E., 2017. Dynamic routing between capsules. In Advances in neural information processing systems (pp. 3856-3866).
2. 2.Hinton, G.E., Sabour, S., and Frosst, N., 2018. Matrix capsules with EM routing.
3. 3.
4. 4.
5. 5.
6. 6.