Clustering - divide a set of objects into meaningful groups
Centroid-based partitioning
-
Objects in the same cluster should be similar to each other, while in different clusters should be dissimilar.
-
k-center: find the k center set with the smallest radius r*
- NP-hard
-
an optimal k-circle: a 2-approximate k circle cover [1]
-
returning a k-center set with radius at most 2 _ r_
-
choose a random point firstly, then choose the MAX distance to the points
-
-
k-mean
-
k random points as the initial centroid, form k clusters by assigning all points to the closest centroid + the centroid is the average of all the coordinates of the points in this cluster + terminate until the the centorid set don't update.
-
k-means alg always terminates
- only a finite number of centroid sets that can possibily be produced at the end of each round
- after each round, the cost (the distance) of the centroid set is strictly lower than that of the old centroid set
-
the accuracy guarantee [1]
-
k-seeding : the seed choice (David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA 2007: 1027-1035.) each point is chosen as the centroid with a probability proportional to ( D(p)^2 ).
-
if 100%, that's k-center
-
this gives the fact that the initial centroid set is picked too arbitrarily. By doing so more carefully, we can significantly improve the approximation ratio.
-
-
-
the limitation of k-mean
- differing sizes, differing density, Non-globular shapes
-
Hierarchical Methods
-
Why
- when a clustering needs, different users can explore the hierarchy to obtain various clustering results efficiently
-
How: the agglomerative method
- merge the most similar two clusters until only one cluster is left
-
Given a dendrogram (the merging history can be represented as a tree), we can obtain k clusters
-
the alg:
- binary search tree (BST) T is used to store the distances of all pairs of the current clusters
- each time, remove the smallest cluster-pair distance from T, and merge them into a new cluster
- O(n^2 * log n)
-
distance function is the key
- distance graph G(V,E) (TODO)
-
Density-based
- TODO