K-means clustering with the Euclidean distance inherently assumes that each pair of clusters is linearly separable, which may not be the case in practice. In this problem you will derive a strategy for dealing with this limitation that we did not discuss in class. Specifically, you will show that like so many other algorithms we have discussed in class, K-means can be “kernelized.” In the following,
we consider a dataset {xi}N .
-
Let zij≜ 1 xi j , where j denotes the jth Show that the rule for updating the
jth cluster center mj given this cluster assignment can be expressed as
N
mj = αijxi (1)
i=1
by expressing αij as a function of the zij’s.
-
Giventwo points x1 and x2, show that ∥x1 − x2∥2 can be computed using only linear com-
binations of inner products.
-
Given the results of the previous parts, show how to compute ∥xi−mj∥2 using only(linear
combinations of) inner products between the data points {xi}N .
-
Describehow to use the results from the previous parts to “kernelize” the K-means clustering algorithm described in
In this problem we consider the scenario seen in class, where x is drawn uniformly on [−1, 1] and y = sin(πx), for which we are given N = 2 training samples. Here, we will consider an alternative approach to fitting a line to the data based on Tikhonov regularization. Specifically, we let
y = y1 y2
A = 1 x1
1 x2
] θ = [ b
] (2)
We will then consider Tikhonov regularized least squares estimators of the form
θˆ ≜ (A⊺A + Γ⊺Γ)−1A⊺y. (3)
-
Howshould we set Γ to reduce this estimator to fitting a constant function (i.e., finding an h(x) of the form h(x) = b)? (Hint: For the purposes of this problem, it is sufficient to set Γ in a way that just makes a 0. To make a = 0 exactly requires setting Γ in a way that makes the matrix A⊺A + Γ⊺Γ singular, but note that this does not mean that the regularized least-squares optimization problem cannot be solved; you must just use a different formula than the one in (3).
-
Howshould we set Γ to reduce this estimator to fitting a line of the form h(x) = ax + b that passes through the observed data points (x1, y1) and (x2, y2)?
-
(Optional)Play around and see if you can find a (diagonal) matrix Γ that results in a smaller risk than either of the two approaches we discussed in You will need to do this numeri- cally using Python or MATLAB. Report the Γ that gives you the best results. (You can restrict your search to diagonal Γ to simplify this.)