Notes of clustering approaches.

# Gaussian Mixture Models (GMMs)

## EM algorithm

Given the fully observed variable $\mathbf{x}_i$ of which $i$ indicates the $i$-th variable, $\mathbf{z}_i$ be the hidden or missing variables. The objective is to maximize the log likelihood of observed data:

It is hard to direct optimize due to the $\log$ cannot be merged into the inner sum.

EM define the complete data log likelihood as:

This also cannot be computed since $\mathbf{z}_i$ is unknown.

Then EM defines the expected complete data log likelihood as:

where $t$ is the current iteratio number, $Q$ is the auxiliary function. The expectation is take w.r.t the old parameters $\theta^{t-1}$ and the observed data $\mathcal{D}$.

In the E-step, compute the expected sufficient statistics (ESS) $Q(\theta, \theta^{t-1})$

In the M-step, the Q function is optimized w.r.t $\theta$:

To perform MAP estimation, the M step is modified as:

• EM algorithm monotonically increase the log likelihood of the observed data (plus the log prior when doing MAP)

## EM for GMMs

Let $K$ be the # of mixture components.

### Auxiliary function

the expected complete data log likelihood is:

where $r_{ik} \triangleq p(z_i \ k \vert \mathbf{x}_i, \theta^{t-1})$ is the responsibility that cluster $k$ takes for data point $i$, which is computed at E-step.

### M step

In the M step, EM optimizes $Q$ w.r.t $\pi$ and $\theta_k$:

where $r_k \triangleq \sum_i r_{ik}$ is the weighted number of points assigned to cluster $k$.

The log likelihood:

Thus,

After computing such new estimates, set $\theta^t = (\pi_k, \mu_k, \Sigma_k)$ for $k=1:K$ and go to the next $E$ step.

• The mean of cluster $k$ is just the weighted average of all poitns assigned to cluster $k$.
• The covariance is proportional to the weighted empirical scatter matrix.

Multiplying multiple small probabilities can cause the arithmetic underflow, i.e.

To circumvent this, a common trick called log sum exponential trick:

## K-means

K-means algorithm is a variant of EM algorithm for GMM, which can be seen as a GMM with such assumptions: $\Sigma_k = \sigma^2 \mathbf{I}D$ and $\pi_k = \frac{1}{K}$ are fixed, only the cluster centers $\mathbf{\mu}_k \in \mathbb{R}^D$ will be estimated.

In the E step,

where $z_i^* = \arg\max_k p(z_i=k \vert \mathbb{x}_i, \theta)$, which is also called hard EM since K-means makes the hard assignment of the points to clusters.

### E step

Since the equal spherical convariance matrix is assumed, the most probable cluster for $\mathbf{x}_i$ can be computed using the Euclidian distance between $N$ data points and $K$ cluster centroids:

This is equivalence to minimizing the pairwise squared deviations of points in the same cluster.

### M step

Given the hard cluster assignments, the M step update the cluster centroid by computing the mean of all points assigned to it:

# Deep Embedded Clustering (DEC)

Given a set of $n$ points $\{ x_i \in X \}_{i=1}^n$ into $k$ clusters, each represented by a centroid $\mu_j$, $j=1, \cdots, k$.

Firstly apply linear transformation $f_\theta : X \rightarrow Z$, where $\theta$ are learnable parameters, $Z$ is the latent feature space.

Deep Embedded Clustering (DEC)[1] has two steps:

1. parametrize initialization with a deep autoencoder
2. clustering, by computing an auxiliary target distribution and minimize KL divergence.

## DEC Clustering

Given initial cluster centroids $\{ u_j\}_{k=1}^k$, DEC firstly computes the soft assignment between the embedded points and the cluster centroids; afterwards, update the deep mapping $f_\theta$ with Stochastic Gradient Descent and refine the cluster centroids using an auxiliary target distribution. This process is repeated until convergence.

### Soft Assignment

DEC applies Student’s $t$-distribution as a kernel to measure the similarity between embedded point $z_i$ and centroid $\mu_j$:

where $z_i = f_\theta (x_i) \in Z$ corresponds to $x_i \in X$ after embedding, $\alpha$ (set to 1) are the degrees of freedom of the Student’s $t$-distribution and $q_{ij}$ can be interpreted as the probability of assigning sample $i$ to cluster $j$ (i.e., soft assignment).

### KL Divergence

DEC iteratively refines the clusters by learning from their high confidence assignments with the auxiliary target distribution. KL divergence is computed beween the soft assignments $q_i$ and the auxiliary distribution $p_i$:

The choice of target distribution $P$ is critical.

where $f_j = \sum_i q_{ij}$ are soft cluster frequencies.

# References

1. 1.Xie, J., Girshick, R.B., & Farhadi, A. (2015). Unsupervised Deep Embedding for Clustering Analysis. ICML.
2. 2.