Summary about Clustering Algorithms Clustering - divide a set of objects into meaningful groups Centroid-based partitioning Objects in the same cluster should be similar to each other, while in different clusters should be dissimilar. k-center: find the k center set with the smallest radius r* NP-hard an optimal k-circle: a 2-approximate k circle cover [1] returning a k-center set with radius at most 2 _ r_ choose a random point firstly, then choose the MAX distance to the points k-mean k random points as the initial centroid, form k clusters by assigning all points to the closest centroid + the centroid is the average of all the coordinates of the points in this cluster + terminate until the the centorid set don't update. k-means alg always terminates only a finite number of centroid sets that can possibily be produced at the end of each round after each round, the cost (the distance) of the centroid set is strictly lower than that of the old centroid set the accuracy guarantee [1] k-seeding : the seed choice (David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA 2007: 1027-1035.) each point is chosen as the centroid with a probability proportional to ( D(p)^2 ). if 100%, that's k-center this gives the fact that the initial centroid set is picked too arbitrarily. By doing so more carefully, we can significantly improve the approximation ratio. the limitation of k-mean differing sizes, differing density, Non-globular shapes Hierarchical Methods Why when a clustering needs, different users can explore the hierarchy to obtain various clustering results efficiently How: the agglomerative method merge the most similar two clusters until only one cluster is left Given a dendrogram (the merging history can be represented as a tree), we can obtain k clusters the alg: binary search tree (BST) T is used to store the distances of all pairs of the current clusters each time, remove the smallest cluster-pair distance from T, and merge them into a new cluster O(n^2 * log n) distance function is the key distance graph G(V,E) (TODO) Density-based TODO reference Data Mining and Knowledge Discovery FLANN lib