Similarity Search Algorithms
Similarity search refers to finding objects that have similar characteristics to the query object.
Similarity Search
-
Similarity Search in high-dimensional spaces becomes increasingly important in databases, data mining, and search engines,particularly for content-based search of feature-rich data such as audio recordings, digital photos, digital videos and other sensor data. Since feature-rich data objects are typically represented as high-dimensional feature vectors.
-
The problem of similarity search refers to finding objects that have similar characteristics to the query object. Similarity search is usually implemented as K-Nearest Neighbor (KNN) or Approximate Nearest Neighbors (ANN) search in high-dim feature-vector space.
- KNN: find the K objects that are closest to q according to a distance function
- ANN: find K objects whose distances are within a small factor (1 + x) of the true K-nearest neighbors's distances
-
An ideal indexing scheme for similarity search:
- Accurate: very close to those of the brute-force, linear-scan approach
- Time efficient: O(logN)
- Space efficient: the index data structure may even fit into main memory
- High-dimensional: the indexing scheme should work well for datasets with very high intrinsic
dimensionalities
The related approaches