-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Searching with ANNOY index
ANNOY (Approximate Nearest Neighbors Oh Yeah) is an index that uses a hyperplane to divide a high-dimensional space into multiple subspaces, and then stores them in a tree structure.
When searching for vectors, ANNOY follows the tree structure to find subspaces closer to the target vector, and then compares all the vectors in these subspaces (The number of vectors being compared should not be less than search_k
) to obtain the final result. Obviously, when the target vector is close to the edge of a certain subspace, sometimes it is necessary to greatly increase the number of searched subspaces to obtain a high recall rate. Therefore, ANNOY uses n_trees
different methods to divide the whole space, and searches all the dividing methods simultaneously to reduce the probability that the target vector is always at the edge of the subspace.
-
Index building parameters
Parameter Description Range n_trees
The number of methods of space division. [1, 1024] -
Search parameters
Parameter Description Range search_k
The number of nodes to search. -1 means 5% of the whole data. {-1} ∪ [ top_k
, n ×n_trees
]
For example:
In python
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# create a collection
collection_name = "milvus_ANNOY"
default_fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]default_schema = CollectionSchema(fields=default_fields, description="test collection")
print(f"\nCreate collection...")
collection = Collection(name= collection_name, schema=default_schema)
# insert data
import random
vectors = [[random.random() for _ in range(8)] for _ in range(10)]
entities = [vectors]
mr = collection.insert(entities)
print(collection.num_entities)
# create index
collection.create_index(field_name=field_name,
index_params={'index_type': 'ANNOY',
'metric_type': 'L2',
'params': {
"n_trees": 8 # int. 1~1024
}})
collection.load()
# search
top_k = 10
results = collection.search(vectors[:5], anns_field="vector",param={
"search_k": -1 # int. {-1} U [top_k, n*n_trees], n represents vectors count.
}, limit=top_k) # show results
for result in results:
print(result.ids)
print(result.distance)