成功案例设置

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: zz-x2580

Unsupervised Learning HW2

All homeworks (including this one) should be typesetted properly in pdf format. Late homeworks
or handwritten solutions will not be accepted. You must include your name and UNI in your home-
work submission. To receive credit, a typesetted copy of the homework pdf must be uploaded to
Gradescope by the due date. You must show your work to receive full credit. Discussing possible
solutions for homework questions is encouraged on the course discussion board and with your peers,
but everyone must write their own individual solutions. You must cite all external references you
used (including the names of individuals you discussed the solutions with) to complete the home-
work.
1 Max Diameter Clustering
Consider clustering a set of n datapoints x1, . . . , xn in a metric space (X, ρ) into two clusters which
minimizes the maximum cluster diameter1. Show that the “max diameter” decision problem into two
cluster reduces to 2-SAT. That is, formally, show that D ≤P 2-SAT, where the problem D defined as
Input: x1, . . . , xn ∈ X , and r ∈ R
Output: Yes or No. Yes if an only if there exists a clustering of the n points into two clusters
each with diameter at most r.
(Hint: if a point xi is assigned to cluster 1, then no point xj that is distance > r can be assigned
to cluster 1.)
It is instructive to note that 2-SAT is “easy” (that is, 2-SAT ∈ P), which implies so is D.
2 Mass - Spring Embeddings
In this problem, we work with a connected graph G. Consider the following one-dimensional graph
embedding, motivated from the mass-spring problem. Each vertex corresponds to a point mass, and
each edge corresponds to a spring (assume the unit mass/spring constant). Let f ∈ RV define an
embedding of G into Rd. To define the problem, we need two concepts from physics:
• energy: a spring that is stretched a distance ` contributes `2 of units of energy to the total
spring system. Thus, if each spring e ∈ E is stretched to length `e, the amount of energy E
1Diameter of a subset S in a metric space (X, ρ) is defined as supx,x′∈S ρ(x, x′).
1
required is:
E =
∑
e∈E
`2e.
• moment of inertia: let L be a line passing through the center of mass. The moment of inertia
with respect to L, denoted IL, is defined as the sum of the points’ squared distances to L.
Sometimes, we want axis-independence (e.g. when d = 1). The polar moment of inertia J is
defined without the axis: it is the sum of squared distances to the center of mass.
IL =
∑
i∈V
x2i,L J =
∑
i∈V
x2i
xi,L is the distance of point i from the line L; xi is the distance from the center of mass.
These two concepts are quite useful from a clustering/embeddings perspective. As more energy
is required to separate vertices that share many edges, we’d want to minimize the energy, to keep
similar points together. On the other hand, maximizing the moment of inertia allows us to discern
different points better. Also, to make the embedding balanced, we want the moment of inertia to be
equal with respect to any line L passing through the center of mass.
Thus, our optimization question: can we embed the graph in such a way that keeps similar
points together (minimize energy) while separating natural clusters (maximize moment of inertia)?
In particular, suppose that all the masses are initially at 0 (i.e. the position is described by 0 ∈ RV ;
the initial moment of inertia is 0. Prove a tight lower bound on the energy required to increase
the polar moment of inertia of the spring system from 0 to 1, while keeping the moment of
inertia equal to all axes through the center of mass. What embedding f ∈ RV achieves this
bound? Hint: can you write the energy in terms of the Laplacian? For which f ∈ RV is the center
of mass at 0?
3 Discovering Butterfly Species
For this problem, feel free to use any external libraries on ML, graph analysis, etc. As always,
make sure to cite all external sources and references. The purpose of this problem is to gain
hands-on programming experience with many clustering algorithms, so the questions are very open-
ended. Don’t worry about getting the “right” answer, our goal is to get a reasonable output.
In this problem, we apply some unsupervised learning techniques to real-world data.
Download the files SS-Butterfly labels.tsv.gz and SS-Butterfly weights.tsv.gz.
To construct the graph, the networkx package in Python is useful:
import networkx as nx
G = nx.read_weighted_edgelist(’SS-Butterfly_weights.tsv’,nodetype = int)
3.1 Visualizing the Data
There are many ways to visualize the graph. Since the focus of this question is not on dimension
reduction, we can use a built-in graph drawer. The details of how the drawer works can be found
in https://schneide.blog/tag/fruchterman-reingold/: it is an extension of the
“spring” idea from the previous question. However, the embedding is not aware of the edge weights.
2
pos = nx.spring_layout(G)
Now, pos has a dictionary of the 2D embeddings of each point.
We get the figure after coloring each point with a color corresponding to its species, or “true clus-
ters”, as provided in SS-Butterfly labels.tsv.gz.2 Since many of the clusters are mixed,
it doesn’t sound feasible to run clustering after we make the embedding. Still, points belonging to
the same cluster are mapped to nearby areas, so this can be a useful tool to informally check how
well our clustering algorithm works.
(1a) What metric would you use to evaluate the performance of your clustering algorithm? You
may use information on the “true clusters”, but you don’t have to. Explain the pros and cons
of your metric.
3.2 Preprocessing
The input graph is very dense. There are two ways to sparsify the input graph. First, we can make
an -neighborhood graph where we remove edges that have similarities less than . Alternatively,
we can make a k-nearest neighbor graph.
(2a) What are the downsides of using -neighborhood graphs only, and of using k-nearest neighbor
graphs only?
(2b) Implement a “combination” of the above two methods that accounts for the problems you
identified in (ii) to make a sparser graph.
Submit the spring layout of your graph, colored with the “true clusters”.
3.3 Spectral Clustering
We can perform spectral clustering based on the similarity matrix. However, there is a problem: we
don’t know the optimal number of clusters! We have to use heuristics to find K, the optimal number
of clusters. One method we can use is the eigengap heuristic, which is to list the eigenvalues
of the generalized eigensystem λ1, ..., λn in nondecreasing order, and choose the value of K that
maximizes λK+1 − λK .
2Since we are using unsupervised learning techniques, we should not use the label data during our clustering process.
3
(3a) Plot the results of your clustering algorithm in (1) the spring layout you previously computed,
and (2) the 2-dimensional spectral layout of the graph.
(3b) Evaluate the performance of your algorithm using the metrics you mentioned in part 1a.
3.4 Agglomerative Clustering
In this subsection, we implement agglomerative clustering, also called hierarchical clustering. Here,
we start with each data point as its own cluster, and we merge pairs of clusters until all elements
belong to the same cluster. At each step, the clusters that are closest to each other are combined. The
result can be presented in a tree diagram called dendrogram, which shows the sequence of clusters
that are fused as well as the cluster distance involved in each fusion. An example dendrogram
construction can be found in
https://scikit-learn.org/stable/auto examples/cluster/plot agglomerative dendrogram.html
What criteria can we use to choose which two clusters to merge? We need to specify how the
distance between two clusters are computed. For example, in single-linkage clustering, the distance
is defined to be the distance of the closest pair of elements belonging to different clusters.3 In
complete-linkage clustering, the farthest pair of elements are chosen.
(4a) Plot the results of your clustering algorithm in (1) the spring layout you previously computed,
and in (2) a simplified dendrogram. (not all data points need to be shown, such as in the
dendrogram you can find in the example link)
(4b) Explain how we can heuristically choose the number of clusters in agglomerative clustering.
Evaluate the performance of your clustering algorithm after you use this heuristic.
3.5 Density-Based Clustering
Density-based clustering algorithms collect points in “dense” areas together to form a bigger cluster.
One example is the DBSCAN (density-based spatial clustering with applications in noise), which, as
the name suggests, is different from other clustering algorithms we have seen because it also detects
noise. The algorithm classifies points as follows: on hyperparameters and m,
• A core point has at least m points in its -neighborhood.
• A border point has an -neighborhood of less than m points but it includes a core point.
• A noise point is a point that is neither a core point or border point.
After we made the classification, we make a graph of core points where there is an edge if the two
points are within distance . The connected components of this graph form the cluster of core points.
Border points belong to the cluster of any core point within its -neighborhood; noise points belong
to no cluster.
(5a) Plot the results of your clustering algorithm in the spring layout you previously computed.
Mark the noise points separately.
(5b) Plot the border points: you can plot them on the previous part’s plot using a different format,
or submit a new plot. Is the plot consistent with the name “border” point? Evaluate the
performance of your clustering algorithm.
3Note that this is equivalent to Kruskal’s Algorithm for computing minimum spanning trees!
4
Submission Instructions for Question 3.
All of your plots, calculations, and analysis should be included in your pdf submission. Submit
all of your code in either (i) four .py files, one each for parts 3.2, 3.3, 3.4, 3.5; or (ii) a single
.ipynb notebook that has all the plots inside. Your .ipynb notebook should run sequentially, i.e.
makes no errors when “Restart and Run All”. Submit your code as per the instructions on the course
discussion board.
4 Semidefinite Programming
Recall that linear programs (LP) are optimization problems of the following standard form:
maxx cT x
subject to Ax = b
x ≥ 0
Here, we explore a similar class of optimization problems, called semidefinite programs (SDP). We
make some changes to the LP problem:
1. Instead of x in the vector space Rn, we use a matrix X in the vector space Sn of symmetric
n× n matrices.
2. For the inner product, instead of cT x we use trace: Tr(CX) =
∑
i,j∈[n] CijXij .
3. For the positivity constraint, instead of componentwise x ≥ 0, we use positive semidefinite-
ness of X , denoted X 0.
We can now define the standard form for SDP:
maxX Tr(CX)
subject to A(X) = b
X 0
where C is a symmetric n× n matrix and A : Sn → Rm is a linear operator. Alternatively, we can
write Tr(AiX) = bi for all i ∈ [m], where A is an n× n matrix.
(i) The (2-way) Max Cut problem aims to maximize the weight of edges crossing a cut (S, V −S)
of a weighted graph G(V,E) with weighted adjacency matrix W . Show that we can solve the
Max Cut problem by partitioning the vertices according to the signs of the first row of X in
the below rank-constrained SDP:
maxX Tr(−WX)
subject to diag(X) = 1
X 0, rank(X) = 1
(ii) The above formulation is NP-hard to solve because of the rank constraint. We instead work
on the relaxation of the problem with the rank constraint dropped. How do we recover the
partition from X? First, we decompose X = V TV . The method we use is randomized
hyperplane rounding: we take a random vector h from the unit sphere and partition the vertices
5
according to the signs of V Th. Prove that, in expectation, the resulting weight of the cut is at
least 78 of the optimal.
Hint: Let vi be the ith column of V . Consider a 0-centered circle containing vi, vj and use
θ
pi ≥ 78 · 12 (1− cos θ).
(iii) Let G be a clique of 8 vertices, with edge weights all 1. For the SDP relaxation, report (a)
the optimal cut value (b) the optimal X for the SDP (c) the expected size of the cut created.
You may use a convex optimization library, such as cvx in Matlab or cvxopt in Python. For
(c) you can approximate using 10000 trials. Alternatively, you may solve everything by hand.
Submit your code or derivation in your pdf submission.
5 Of Vectors and (Random) Matrices
Given a set of n vectors {x1, . . . , xn} = X ⊂ RD, show the following.
Pick any > 0 and d ∈ Z+. Let G be a d × D matrix with entries i.i.d. N(0, 1/d), and c be a
universal constant.
(i) Show that for any fixed pair of distinct vectors xi and xj in X ,
Pr
G
[
‖Gxi −Gxj‖2 > (1 + )2‖xi − xj‖2
]
< e−cd
2
.
(Hint: Any non-zero vector v can be written as v‖v‖‖v‖.)
(ii) Show that:
Pr
G
[
∃ a distinct pair xi and xj in X s.t. ‖Gxi −Gxj‖2 > (1 + )2‖xi − xj‖2
]
< n2e−cd
2
.
(Hint: Recall the union bound, Pr[A or B] ≤ Pr[A] + Pr[B].)
Don’t need to show the following, but the following also holds true!
Pr
G
[
∃ a distinct pair xi and xj in X s.t. ‖Gxi −Gxj‖2 < (1− )2‖xi − xj‖2
]
< n2e−cd
2
.
(iii) Find a good (i.e. the smallest) value of d (as a function of n, and c), for which the following
statement holds true.
With probability at least 3/4 (over the choice of the random draw of G with entries i.i.d.
N(0, 1/d)), for all xi and xj in X ,
(1− )2‖xi − xj‖2 ≤ ‖Gxi −Gxj‖2 ≤ (1 + )2‖xi − xj‖2.