Fork me on GitHub

An Introduction to Capsules

A capsule is defined as a group of neuron instantiations whose parameters represent specific properties of a specific type of entity. Here is a brief note of Capsule networks[1][2].

Convolutional Neural Networks (CNNs) extract local-region features with fixed strides, followed by max-pooling, which may only retain the remarkable features but ignore the fine-grain features such as overlapping entities in images. Capsules are regarded as a better solution to handle this.

Capsules with dynamic routing

Standard inputs and outputs of neural layers are scalar features, while Capsule Networks leverage capsules of vectors to represent features of entities.


For each capsule, the input and the output are vectors.[1]

For all except the first layer of capsules, the input to a capsule is a weighted sum over all “predicted vector” , which is linearly transformed with a learnable weight matrix .

where coupling coefficients between capsule $i$ and all all capsules in the layer above are scaled with a “routing softmax” to sum to 1. We will introduce this in the next section.

A squashing activation function rather than ReLU is used on , ensuring that the short and long vectors to approach to the length of almost zero and slighly below 1, respectively.

Here for is too small, while when that is large.

The weight is computed with iteration:

where the log probabilities measure the probability that capsule $i$ should be coupled to capsule $j$ above, and are iteratively updated with:

where measures the agreement between the current output of capsule $j$ in above layer and in capsule $i$.

Dynamic routing

Margin loss

For each capsule $k$, seperate margin loss is:

where iff class $k$ is present, $m^+ =0.9, m^-=0.1, \lambda = 0.5$. The total loss just simply sum all capsules.

Matrix capsules with EM routing

Capsules consist of a pose matrix $\mathbf{M} \in \mathbb{R}^{4 \times 4}$ and an activation probability $a \in \mathbb{R}$. Expectation-Maximization algorithm is used to iteratively update the cluster of capsules with similar votes.

Let we denote the capsules in layer $L$ as , trainable weights between each capsule $i$ in layer $i$ and each capsule $j$ in layer $L+1$ as $\mathbf{W}_{ij} \in \mathbb{R}^{4 \times 4}$.

The iterative update of pose matrix and activations of all capsules in layer $L+1$ leverages a non-linera routing procedure to get input and for all .

Spread loss

Spread loss directly maximizes the gap between the activation of the target class and the activation of other classes.

where $m = 0.2$ initially and linearly increases during training to 0.9, equal to squared Hinge loss with $m=1$.


  1. 1.Sabour, S., Frosst, N. and Hinton, G.E., 2017. Dynamic routing between capsules. In Advances in neural information processing systems (pp. 3856-3866).
  2. 2.Hinton, G.E., Sabour, S., and Frosst, N., 2018. Matrix capsules with EM routing.
  3. 3.Blog: Understanding Dynamic Routing between Capsules
  4. 4.Blog: Understanding Matrix capsules with EM Routing
  5. 5.FreeCodeCamp: Understanding Capsule Networks
  6. 6.TensorFlow code