スタンフォード/CS131/宿題5-1 クラスタ化アルゴリズム

Stanford University/CS131/宿題5-1をやる。今回の宿題はClustering Algorithm(クラスタ化アルゴリズム)のK-Means Clustering(k平均クラスタ化)とHierarchical Agglomerative Clustering(階層的凝集型クラスタリング)をカバーしている。

スポンサーリンク

Introduction

In this assignment, you will use clustering algorithms to segment images. You will then use these segmentations to identify foreground and background objects.
今回の宿題では、クラスタリングアルゴリズムを使用して画像を分割する。その後、分割した画像を使って前景と背景の物体を識別する。

Your assignment will involve the following subtasks:
宿題には以下のサブタスクを含む。

  • Clustering algorithms: Implement K-Means clustering and Hierarchical Agglomerative Clustering.
    k平均法/階層的凝集クラスタリング実装

  • Pixel-level features: Implement a feature vector that combines color and position information and implement feature normalization.
    色と位置情報を結合する特徴ベクトルと特徴正規化を実装する。

  • Quantitative Evaluation: Evaluate segmentation algorithms with a variety of parameter settings by comparing your computed segmentations against a dataset of ground-truth segmentations.
    算出したセグメンテーションを正解セグメンテーションデータセットで比較することで、実装したセグメンテーションアルゴリズムを、複数のパラメータ設定を用いて評価する。

Clustering Algorithms

from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
from skimage import io
from __future__ import print_function

%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 10.0) # set default size of plots
plt.rcParams["font.size"] = "17"
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# Generate random data points for clustering

# Cluster 1
mean1 = [-1, 0]
cov1 = [[0.1, 0], [0, 0.1]]
X1 = np.random.multivariate_normal(mean1, cov1, 100)

# Cluster 2
mean2 = [0, 1]
cov2 = [[0.1, 0], [0, 0.1]]
X2 = np.random.multivariate_normal(mean2, cov2, 100)

# Cluster 3
mean3 = [1, 0]
cov3 = [[0.1, 0], [0, 0.1]]
X3 = np.random.multivariate_normal(mean3, cov3, 100)

# Cluster 4
mean4 = [0, -1]
cov4 = [[0.1, 0], [0, 0.1]]
X4 = np.random.multivariate_normal(mean4, cov4, 100)

# Merge two sets of data points
X = np.concatenate((X1, X2, X3, X4))

# Plot data points
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal')
plt.show()

K-Means Clustering

As discussed in class, K-Means is one of the most popular clustering algorithms. We have provided skeleton code for K-Means clustering in the file segmentation.py. Your first task is to finish implementing kmeans in segmentation.py. This version uses nested for loops to assign points to the closest centroid and compute new mean of each cluster.
授業で論じたように、k平均法は最も人気のあるクラスタリングアルゴリズムの1つだ。segmentation.pyファイルにk平均法クラスタリングのスケルトンコードを用意してあるので、kmeans関数の実装を完成させる。このバージョンは、ネスト化forループを使って最短重心に点を割り当て、各クラスタの新しい平均を算出する。

import random
from scipy.spatial.distance import squareform, pdist
from skimage.util import img_as_float

### Clustering Methods
def kmeans(features, k, num_iters=100):
    """ Use kmeans algorithm to group features into k clusters.
    K-Means algorithm can be broken down into following steps:
        1. Randomly initialize cluster centers
        2. Assign each point to the closest center
        3. Compute new center of each cluster
        4. Stop if cluster assignments did not change
        5. Go to step 2
    Args:
        features - Array of N features vectors. Each row represents a feature
            vector.
        k - Number of clusters to form.
        num_iters - Maximum number of iterations the algorithm will run.
    Returns:
        assignments - Array representing cluster assignment of each point.
            (e.g. i-th point is assigned to cluster assignments[i])
    """
    N, D = features.shape
    assert N >= k, 'Number of clusters cannot be greater than number of points'
    # Randomly initalize cluster centers
    idxs = np.random.choice(N, size=k, replace=False)
    centers = features[idxs]
    assignments = np.zeros(N)
    for n in range(num_iters):
        ### YOUR CODE HERE
        for i in range(N):
            assignments[i] = np.argmin(np.sum((features[i]\
                                      -centers)**2,axis=1))
        tmp = centers.copy()
        for j in range(k):
            centers[j] = np.mean(features[assignments==j],axis=0)
        if np.allclose(tmp, centers):
            break
        ### END YOUR CODE
    return assignments
np.random.seed(0)
start = time()
assignments = kmeans(X, 4)
end = time()

kmeans_runtime = end - start

print("kmeans running time: %f seconds." % kmeans_runtime)

for i in range(4):
    cluster_i = X[assignments==i]
    plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()
kmeans running time: 0.046777 seconds.

We can use numpy functions and broadcasting to make kmeans faster. Implement kmeans_fast in segmentation.py. This should run at least 10 times faster than the previous implementation.
numpy関数とブロードキャストを使ってk平均法を高速化できる。segmentation.pyにkmeans_fastを実装せよ。このバージョンは、前の実装に比べて少なくとも10倍高速なはずである。

def kmeans_fast(features, k, num_iters=100):
    """ Use kmeans algorithm to group features into k clusters.
    This function makes use of numpy functions and broadcasting to speed up the
    first part(cluster assignment) of kmeans algorithm.
    Hints
    - You may find np.repeat and np.argmin useful
    Args:
        features - Array of N features vectors. Each row represents a feature
            vector.
        k - Number of clusters to form.
        num_iters - Maximum number of iterations the algorithm will run.
    Returns:
        assignments - Array representing cluster assignment of each point.
            (e.g. i-th point is assigned to cluster assignments[i])
    """
    N, D = features.shape
    assert N >= k, 'Number of clusters cannot be greater than number of points'
    # Randomly initalize cluster centers
    idxs = np.random.choice(N, size=k, replace=False)
    centers = features[idxs]
    assignments = np.zeros(N)
    for n in range(num_iters):
        ### YOUR CODE HERE
        a = np.tile(features,(k, 1))
        b = np.repeat(centers,N,axis=0)
        assignments = np.argmin(np.sum((a-b)**2,\
                   axis=1).reshape(k,N),axis=0)
        c = centers.copy()
        for j in range(k):
            centers[j] = \
            np.mean(features[assignments==j],axis=0)
        if np.allclose(c,centers):
            break
        ### END YOUR CODE
    return assignments
np.random.seed(0)
start = time()
assignments = kmeans_fast(X, 4)
end = time()

kmeans_fast_runtime = end - start
print("kmeans running time: %f seconds." % kmeans_fast_runtime)
print("%f times faster!" % (kmeans_runtime / kmeans_fast_runtime))

for i in range(4):
    cluster_i = X[assignments==i]
    plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()
kmeans running time: 0.002029 seconds.
23.057469 times faster!

Hierarchical Agglomerative Clustering

Another simple clustering algorithm is Hierarchical Agglomerative Clustering, which is sometimes abbreviated as HAC. In this algorithm, each point is initially assigned to its own cluster. Then cluster pairs are merged until we are left with the desired number of predetermined clusters (see Algorithm 1).
もう一つの簡単なクラスタ化アルゴリズムは、HACと略されることもある階層的凝集型クラスタリングだ。このアルゴリズムでは、各点は初めに自己のクラスタに割り当てられている。その後、クラスタペアは、所定クラスタの所望の数になるまでマージされる。

Implement hiererachical_clustering in segmentation.py.
hiererachical_clusteringを実装する。

def hierarchical_clustering(features, k):
    """ Run the hierarchical agglomerative clustering algorithm.
    The algorithm is conceptually simple:
    Assign each point to its own cluster
    While the number of clusters is greater than k:
        Compute the distance between all pairs of clusters
        Merge the pair of clusters that are closest to each other
    We will use Euclidean distance to defeine distance between two clusters.
    Recomputing the centroids of all clusters and the distances between all
    pairs of centroids at each step of the loop would be very slow. Thankfully
    most of the distances and centroids remain the same in successive
    iterations of the outer loop; therefore we can speed up the computation by
    only recomputing the centroid and distances for the new merged cluster.
    Even with this trick, this algorithm will consume a lot of memory and run
    very slowly when clustering large set of points. In practice, you probably
    do not want to use this algorithm to cluster more than 10,000 points.
    Args:
        features - Array of N features vectors. Each row represents a feature
            vector.
        k - Number of clusters to form.
    Returns:
        assignments - Array representing cluster assignment of each point.
            (e.g. i-th point is assigned to cluster assignments[i])
    """
    N, D = features.shape
    assert N >= k, 'Number of clusters cannot be greater than number of points'
    # Assign each point to its own cluster
    assignments = np.arange(N)
    centers = np.copy(features)
    n_clusters = N
    while n_clusters > k:
        ### YOUR CODE HERE
        d = pdist(centers)
        dMat = squareform(d)
        dMat = np.where(dMat!=0.0,dMat,1e5)
        mi,mj = np.unravel_index(dMat.argmin(),\
                                    dMat.shape)
        a = min(mi,mj)
        b = max(mi,mj)
        assignments = np.where(assignments!=b, \
                           assignments,a)
        assignments = np.where(assignments<b, \
                           assignments,assignments-1)
        centers = np.delete(centers,b,axis=0)
        c = np.where(assignments==a)
        centers[a] = np.mean(features[c],axis=0)
        n_clusters -= 1
        ### END YOUR CODE
    return assignments
start = time()
assignments = hierarchical_clustering(X, 4)
end = time()

print("hierarchical_clustering running time: %f seconds." % (end - start))

for i in range(4):
    cluster_i = X[assignments==i]
    plt.scatter(cluster_i[:, 0], cluster_i[:, 1])

plt.axis('equal')
plt.show()
hierarchical_clustering running time: 0.107632 seconds.