# スタンフォード/CS131/宿題5-1 クラスタ化アルゴリズム

Stanford University/CS131/宿題5-1をやる。今回の宿題はClustering Algorithm(クラスタ化アルゴリズム)のK-Means Clustering(k平均クラスタ化)とHierarchical Agglomerative Clustering(階層的凝集型クラスタリング)をカバーしている。

スポンサーリンク

## Introduction¶

In this assignment, you will use clustering algorithms to segment images. You will then use these segmentations to identify foreground and background objects.

• Clustering algorithms: Implement K-Means clustering and Hierarchical Agglomerative Clustering.
k平均法/階層的凝集クラスタリング実装

• Pixel-level features: Implement a feature vector that combines color and position information and implement feature normalization.
色と位置情報を結合する特徴ベクトルと特徴正規化を実装する。

• Quantitative Evaluation: Evaluate segmentation algorithms with a variety of parameter settings by comparing your computed segmentations against a dataset of ground-truth segmentations.
算出したセグメンテーションを正解セグメンテーションデータセットで比較することで、実装したセグメンテーションアルゴリズムを、複数のパラメータ設定を用いて評価する。

## Clustering Algorithms¶

from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
from skimage import io
from __future__ import print_function

%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 10.0) # set default size of plots
plt.rcParams["font.size"] = "17"
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Generate random data points for clustering

# Cluster 1
mean1 = [-1, 0]
cov1 = [[0.1, 0], [0, 0.1]]
X1 = np.random.multivariate_normal(mean1, cov1, 100)

# Cluster 2
mean2 = [0, 1]
cov2 = [[0.1, 0], [0, 0.1]]
X2 = np.random.multivariate_normal(mean2, cov2, 100)

# Cluster 3
mean3 = [1, 0]
cov3 = [[0.1, 0], [0, 0.1]]
X3 = np.random.multivariate_normal(mean3, cov3, 100)

# Cluster 4
mean4 = [0, -1]
cov4 = [[0.1, 0], [0, 0.1]]
X4 = np.random.multivariate_normal(mean4, cov4, 100)

# Merge two sets of data points
X = np.concatenate((X1, X2, X3, X4))

# Plot data points
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal')
plt.show()


### K-Means Clustering¶

As discussed in class, K-Means is one of the most popular clustering algorithms. We have provided skeleton code for K-Means clustering in the file segmentation.py. Your first task is to finish implementing kmeans in segmentation.py. This version uses nested for loops to assign points to the closest centroid and compute new mean of each cluster.

import random
from scipy.spatial.distance import squareform, pdist
from skimage.util import img_as_float

### Clustering Methods
def kmeans(features, k, num_iters=100):
""" Use kmeans algorithm to group features into k clusters.
K-Means algorithm can be broken down into following steps:
1. Randomly initialize cluster centers
2. Assign each point to the closest center
3. Compute new center of each cluster
4. Stop if cluster assignments did not change
5. Go to step 2
Args:
features - Array of N features vectors. Each row represents a feature
vector.
k - Number of clusters to form.
num_iters - Maximum number of iterations the algorithm will run.
Returns:
assignments - Array representing cluster assignment of each point.
(e.g. i-th point is assigned to cluster assignments[i])
"""
N, D = features.shape
assert N >= k, 'Number of clusters cannot be greater than number of points'
# Randomly initalize cluster centers
idxs = np.random.choice(N, size=k, replace=False)
centers = features[idxs]
assignments = np.zeros(N)
for n in range(num_iters):
for i in range(N):
assignments[i] = np.argmin(np.sum((features[i]\
-centers)**2,axis=1))
tmp = centers.copy()
for j in range(k):
centers[j] = np.mean(features[assignments==j],axis=0)
if np.allclose(tmp, centers):
break
return assignments

np.random.seed(0)
start = time()
assignments = kmeans(X, 4)
end = time()

kmeans_runtime = end - start

print("kmeans running time: %f seconds." % kmeans_runtime)

for i in range(4):
cluster_i = X[assignments==i]
plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()

kmeans running time: 0.046777 seconds.


We can use numpy functions and broadcasting to make kmeans faster. Implement kmeans_fast in segmentation.py. This should run at least 10 times faster than the previous implementation.
numpy関数とブロードキャストを使ってk平均法を高速化できる。segmentation.pyにkmeans_fastを実装せよ。このバージョンは、前の実装に比べて少なくとも10倍高速なはずである。

def kmeans_fast(features, k, num_iters=100):
""" Use kmeans algorithm to group features into k clusters.
This function makes use of numpy functions and broadcasting to speed up the
first part(cluster assignment) of kmeans algorithm.
Hints
- You may find np.repeat and np.argmin useful
Args:
features - Array of N features vectors. Each row represents a feature
vector.
k - Number of clusters to form.
num_iters - Maximum number of iterations the algorithm will run.
Returns:
assignments - Array representing cluster assignment of each point.
(e.g. i-th point is assigned to cluster assignments[i])
"""
N, D = features.shape
assert N >= k, 'Number of clusters cannot be greater than number of points'
# Randomly initalize cluster centers
idxs = np.random.choice(N, size=k, replace=False)
centers = features[idxs]
assignments = np.zeros(N)
for n in range(num_iters):
a = np.tile(features,(k, 1))
b = np.repeat(centers,N,axis=0)
assignments = np.argmin(np.sum((a-b)**2,\
axis=1).reshape(k,N),axis=0)
c = centers.copy()
for j in range(k):
centers[j] = \
np.mean(features[assignments==j],axis=0)
if np.allclose(c,centers):
break
return assignments

np.random.seed(0)
start = time()
assignments = kmeans_fast(X, 4)
end = time()

kmeans_fast_runtime = end - start
print("kmeans running time: %f seconds." % kmeans_fast_runtime)
print("%f times faster!" % (kmeans_runtime / kmeans_fast_runtime))

for i in range(4):
cluster_i = X[assignments==i]
plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()

kmeans running time: 0.002029 seconds.
23.057469 times faster!


### Hierarchical Agglomerative Clustering¶

Another simple clustering algorithm is Hierarchical Agglomerative Clustering, which is sometimes abbreviated as HAC. In this algorithm, each point is initially assigned to its own cluster. Then cluster pairs are merged until we are left with the desired number of predetermined clusters (see Algorithm 1).
もう一つの簡単なクラスタ化アルゴリズムは、HACと略されることもある階層的凝集型クラスタリングだ。このアルゴリズムでは、各点は初めに自己のクラスタに割り当てられている。その後、クラスタペアは、所定クラスタの所望の数になるまでマージされる。

Implement hiererachical_clustering in segmentation.py.
hiererachical_clusteringを実装する。

def hierarchical_clustering(features, k):
""" Run the hierarchical agglomerative clustering algorithm.
The algorithm is conceptually simple:
Assign each point to its own cluster
While the number of clusters is greater than k:
Compute the distance between all pairs of clusters
Merge the pair of clusters that are closest to each other
We will use Euclidean distance to defeine distance between two clusters.
Recomputing the centroids of all clusters and the distances between all
pairs of centroids at each step of the loop would be very slow. Thankfully
most of the distances and centroids remain the same in successive
iterations of the outer loop; therefore we can speed up the computation by
only recomputing the centroid and distances for the new merged cluster.
Even with this trick, this algorithm will consume a lot of memory and run
very slowly when clustering large set of points. In practice, you probably
do not want to use this algorithm to cluster more than 10,000 points.
Args:
features - Array of N features vectors. Each row represents a feature
vector.
k - Number of clusters to form.
Returns:
assignments - Array representing cluster assignment of each point.
(e.g. i-th point is assigned to cluster assignments[i])
"""
N, D = features.shape
assert N >= k, 'Number of clusters cannot be greater than number of points'
# Assign each point to its own cluster
assignments = np.arange(N)
centers = np.copy(features)
n_clusters = N
while n_clusters > k:
d = pdist(centers)
dMat = squareform(d)
dMat = np.where(dMat!=0.0,dMat,1e5)
mi,mj = np.unravel_index(dMat.argmin(),\
dMat.shape)
a = min(mi,mj)
b = max(mi,mj)
assignments = np.where(assignments!=b, \
assignments,a)
assignments = np.where(assignments<b, \
assignments,assignments-1)
centers = np.delete(centers,b,axis=0)
c = np.where(assignments==a)
centers[a] = np.mean(features[c],axis=0)
n_clusters -= 1
return assignments

start = time()
assignments = hierarchical_clustering(X, 4)
end = time()

print("hierarchical_clustering running time: %f seconds." % (end - start))

for i in range(4):
cluster_i = X[assignments==i]
plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()

hierarchical_clustering running time: 0.107632 seconds.

スポンサーリンク

フォローする