Summary about Clustering Algorithms

Objects in the same cluster should be similar to each other, while in different clusters should be dissimilar.
k-center: find the k center set with the smallest radius r*
- NP-hard
- an optimal k-circle: a 2-approximate k circle cover [1]
  - returning a k-center set with radius at most 2 _ r_
  - choose a random point firstly, then choose the MAX distance to the points
k-mean
- k random points as the initial centroid, form k clusters by assigning all points to the closest centroid + the centroid is the average of all the coordinates of the points in this cluster + terminate until the the centorid set don't update.
- k-means alg always terminates
  - only a finite number of centroid sets that can possibily be produced at the end of each round
  - after each round, the cost (the distance) of the centroid set is strictly lower than that of the old centroid set
- the accuracy guarantee [1]
  - k-seeding : the seed choice (David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA 2007: 1027-1035.) each point is chosen as the centroid with a probability proportional to ( D(p)^2 ).
    - if 100%, that's k-center
    - this gives the fact that the initial centroid set is picked too arbitrarily. By doing so more carefully, we can significantly improve the approximation ratio.
- the limitation of k-mean
  - differing sizes, differing density, Non-globular shapes

Why
- when a clustering needs, different users can explore the hierarchy to obtain various clustering results efficiently
How: the agglomerative method
- merge the most similar two clusters until only one cluster is left
Given a dendrogram (the merging history can be represented as a tree), we can obtain k clusters
- the alg:
  - binary search tree (BST) T is used to store the distances of all pairs of the current clusters
  - each time, remove the smallest cluster-pair distance from T, and merge them into a new cluster
  - O(n^2 * log n)
- distance function is the key
  - distance graph G(V,E) (TODO)