Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Clustering Methods: A Note

Notes of clustering approaches.

Gaussian Mixture Models (GMMs)

EM algorithm

Given the fully observed variable of which $i$ indicates the $i$-th variable, be the hidden or missing variables. The objective is to maximize the log likelihood of observed data:

It is hard to direct optimize due to the $\log$ cannot be merged into the inner sum.

EM define the complete data log likelihood as:

This also cannot be computed since is unknown.

Then EM defines the expected complete data log likelihood as:

where $t$ is the current iteratio number, $Q$ is the auxiliary function. The expectation is take w.r.t the old parameters $\theta^{t-1}$ and the observed data $\mathcal{D}$.

In the E-step, compute the expected sufficient statistics (ESS)

In the M-step, the Q function is optimized w.r.t $\theta$:

To perform MAP estimation, the M step is modified as:

  • EM algorithm monotonically increase the log likelihood of the observed data (plus the log prior when doing MAP)

EM for GMMs

Let $K$ be the # of mixture components.

Auxiliary function

the expected complete data log likelihood is:

where is the responsibility that cluster $k$ takes for data point $i$, which is computed at E-step.

E step

M step

In the M step, EM optimizes $Q$ w.r.t $\pi$ and :

where is the weighted number of points assigned to cluster $k$.

The log likelihood:

Thus,

After computing such new estimates, set for $k=1:K$ and go to the next $E$ step.

  • The mean of cluster $k$ is just the weighted average of all poitns assigned to cluster $k$.
  • The covariance is proportional to the weighted empirical scatter matrix.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
class GMM(Clustering):
def __init__(self, dim, n_clusters, init_centroids=None, seed=2020):
super(GMM, self).__init__(dim, n_clusters, seed)
self.points = tf.placeholder(tf.float64, [None, dim], name='points')
num_points = tf.shape(self.points)[0]
# choose N centroids -> (n_clusters, D)
if init_centroids is None:
self.mu = tf.Variable(tf.slice(tf.random.shuffle(self.points, seed=seed), [0, 0], [n_clusters, -1]))
else:
self.mu = tf.Variable(init_centroids, dtype=tf.float64, name='init')

# expand dim -> (1, n_points, D)
xs_expanded = tf.expand_dims(self.points, 0)
# (n_clusters, 1, D)
centroid_expanded = tf.expand_dims(self.mu, 1)
# init variances
self.var = tf.Variable(tf.cast(tf.ones([n_clusters, self.dim]), tf.float64) / n_clusters)
# init weights
self.weight = tf.Variable(tf.cast(tf.fill([n_clusters], 1. / n_clusters), tf.float64))

log_2piD = tf.constant(np.log(2 * np.pi) * dim, dtype=tf.float64)
# E step: recompute the responsibilities w.r.t the current parameters
distances = tf.squared_difference(xs_expanded, centroid_expanded)
distance_x_inv_var = tf.reduce_sum(distances / tf.expand_dims(self.var, 1), 2)
log_coef = tf.expand_dims(log_2piD + tf.reduce_sum(tf.log(self.var), 1), 1)
log_comp = -.5 * (log_coef + distance_x_inv_var)
log_weighted = log_comp + tf.expand_dims(tf.log(self.weight), 1)
log_shift = tf.reduce_max(log_weighted, axis=0, keepdims=True)
weight = tf.exp(log_weighted - log_shift)
weight_sum = tf.reduce_sum(weight, 0)
gamma = weight / weight_sum

# M step: maximizing parameters w.r.t the computed responsibilities
gamma_sum = tf.reduce_sum(gamma, 1)
gamma_weighted = gamma / tf.expand_dims(gamma_sum, 1)
gamma_expanded = tf.expand_dims(gamma_weighted, 2)
self.mu_ = tf.reduce_sum(xs_expanded * gamma_expanded, 1)
distances = tf.squared_difference(xs_expanded, tf.expand_dims(self.mu_, 1))
self.var_ = tf.reduce_sum(distances * gamma_expanded, 1)
self.weight_ = gamma_sum / tf.cast(num_points, tf.float64)

ll = tf.reduce_sum(tf.log(weight_sum)) + tf.reduce_sum(log_shift)
self.mean_ll = ll / tf.cast(num_points * dim, tf.float64)

self.train_op = tf.group(
self.mu.assign(self.mu_),
self.var.assign(self.var_),
self.weight.assign(self.weight_)
)

def update(self, points, n_iters=1000, TOLERANCE=1e-8):
prev_ll = - np.inf
with tf.compat.v1.Session() as sess:
sess.run(tf.compat.v1.global_variables_initializer())
for i in range(n_iters):
cur_ll, _ = sess.run(
[self.mean_ll, self.train_op],
feed_dict={self.points: points}
)

if i > 0:
difference = np.abs(cur_ll - prev_ll)
print(f'GMM step-{i}:\t mean likelihood {cur_ll} \t difference {difference}')
if difference < TOLERANCE:
break
else:
print(f'GMM step-{i}:\t mean likelihood {cur_ll}')
prev_ll = cur_ll
mu = self.mu.eval(sess)
var = self.var.eval(sess)
return mu, var


if __name__ == '__main__':
import numpy as np
n_points = 200
n_clusters = 3
D = 100
points = np.random.uniform(0, 10, (n_points, D)) # random
   cluster = GMM(D, n_clusters, init_centroids=None)
mu, var = cluster.update(points)

Multiplying multiple small probabilities can cause the arithmetic underflow, i.e.

To circumvent this, a common trick called log sum exponential trick:

K-means

K-means algorithm is a variant of EM algorithm for GMM, which can be seen as a GMM with such assumptions: and are fixed, only the cluster centers will be estimated.

In the E step,

where , which is also called hard EM since K-means makes the hard assignment of the points to clusters.

E step

Since the equal spherical convariance matrix is assumed, the most probable cluster for can be computed using the Euclidian distance between $N$ data points and $K$ cluster centroids:

This is equivalence to minimizing the pairwise squared deviations of points in the same cluster.

M step

Given the hard cluster assignments, the M step update the cluster centroid by computing the mean of all points assigned to it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import tensorflow as tf

class KMeans(object):
""" KMeans clustering """

def __init__(self, points, n_clusters, n_iter=300):
self.points = points
self.n_clusters = n_clusters
self.n_iter = n_iter
# randomly choose N centroids -> (n_clusters, D)
self.centroids = tf.Variable(tf.slice(tf.random.shuffle(points), [0, 0], [n_clusters, -1]))
# expand dim -> (1, n_points, D)
xs_expanded = tf.expand_dims(points, 0)
# (n_clusters, 1, 2)
centroid_expanded = tf.expand_dims(self.centroids, 1)
# calculate the distance between points and centroids -> (n_clusters, n_points)
distances = tf.reduce_sum(tf.square(tf.subtract(xs_expanded, centroid_expanded)), -1)
# (n_points,)
self.assignments = tf.argmin(distances, 0)
# get mean points
means = []
for c in range(n_clusters):
idx = tf.reshape(tf.where(tf.equal(self.assignments, c)), [1, -1])
c_points = tf.gather(points, idx)
means.append(tf.reduce_mean(c_points, reduction_indices=[1]))
# update centroids -> (n_cluster, D)
new_centroids = tf.concat(means, 0)
self.update_centroids = tf.compat.v1.assign(self.centroids, new_centroids)

def update(self):
"""
KMeans iterate
:return:
points_vals: data points
centroid_vals: centroid points
assignment_vals: cluster categories
"""
with tf.compat.v1.Session() as sess:
sess.run(tf.compat.v1.global_variables_initializer())
for step in range(self.n_iter):
_, centroid_vals, points_vals, assignment_vals = sess.run(
[self.update_centroids, self.centroids, self.points, self.assignments])
# print(f'centroids: {centroid_vals}')
return centroid_vals, points_vals, assignment_vals


if __name__ == '__main__':
n_points = 200
n_clusters = 3
n_iter = 100
D = 100
points = tf.constant(np.random.uniform(0, 10, (n_points, D))) # random
cluster = KMeans(points, n_clusters, n_iter)
centroid_vals, points_vals, assignment_vals = cluster.update()

Deep Embedded Clustering (DEC)

Given a set of $n$ points into $k$ clusters, each represented by a centroid , $j=1, \cdots, k$.

Firstly apply linear transformation , where $\theta$ are learnable parameters, $Z$ is the latent feature space.

Deep Embedded Clustering (DEC)[1] has two steps:

  1. parametrize initialization with a deep autoencoder
  2. clustering, by computing an auxiliary target distribution and minimize KL divergence.

DEC Clustering

Given initial cluster centroids , DEC firstly computes the soft assignment between the embedded points and the cluster centroids; afterwards, update the deep mapping with Stochastic Gradient Descent and refine the cluster centroids using an auxiliary target distribution. This process is repeated until convergence.

Soft Assignment

DEC applies Student’s $t$-distribution as a kernel to measure the similarity between embedded point and centroid :

where corresponds to after embedding, $\alpha$ (set to 1) are the degrees of freedom of the Student’s $t$-distribution and can be interpreted as the probability of assigning sample $i$ to cluster $j$ (i.e., soft assignment).

KL Divergence

DEC iteratively refines the clusters by learning from their high confidence assignments with the auxiliary target distribution. KL divergence is computed beween the soft assignments and the auxiliary distribution :

The choice of target distribution $P$ is critical.

where are soft cluster frequencies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
"""
q_{ij} = \frac{(1+\Vert z_i - u_j \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j^\prime} (1+\Vert z_i - u_{j^\prime} \Vert^2 / \alpha)^{-\frac{\alpha+1}{2}}}
"""
# tensorflow code
# given data variables `zs` -> shape (?, D) ; `us` -> shape (K, D)
# where D is the dimension, K is the cluster number
zs = tf.expand_dims(zs, 1) # (?, 1, D)
us = tf.expand_dims(us, 0) # (1, K, D)
q = tf.pow(tf.reduce_sum(tf.math.squared_difference(zs, us), axis=-1) / model_config.alpha + 1, - (model_config.alpha + 1) / 2)
q /= tf.reduce_sum(q, axis=1, keepdims=True)


"""
p_{ij} = \frac{q_{ij}^2 / f_j}{\sum_{j^\prime q_{ij}^2 / f_{j^\prime}}}
"""
# norm along seq axis
p = tf.square(q) / tf.reduce_sum(q, axis=-1, keepdims=True)
# norm along cluster axis
p /= tf.reduce_sum(p, axis=1, keepdims=True)
kl_div = tf.reduce_sum(p * tf.math.log(p / q))

References


  1. 1.Xie, J., Girshick, R.B., & Farhadi, A. (2015). Unsupervised Deep Embedding for Clustering Analysis. ICML.
  2. 2.Kmeans clustering