-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer array type conversion in array_dist to compute hamming distance #191
Integer array type conversion in array_dist to compute hamming distance #191
Conversation
Can you update the tests? I believe there's a test case in Curious, did |
It seems to me that This also means we are computing distances twice (recomputing them for the top candidates returned after usearch), but I think this is the nature of integrating with postgres. Maybe could be a possible optimization later to keep the distances returned by usearch.
Will do. |
@@ -330,7 +356,7 @@ Datum hamming_dist(PG_FUNCTION_ARGS) | |||
{ | |||
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0); | |||
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1); | |||
PG_RETURN_INT32(array_dist(a, b, usearch_metric_hamming_k)); | |||
PG_RETURN_INT32((int32)array_dist(a, b, usearch_metric_hamming_k)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is returning the distance between vectors and not the vectors themselves. Why should this distance be an integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hamming distance is always an integer and we also pass it to PG_RETURN_INT32 which returns an integer in postgres. So I added this explicit cast since array_dist
returns a float4
src/hnsw.c
Outdated
bx = (float4*)ARR_DATA_PTR(b); | ||
} | ||
|
||
float4 result = usearch_dist(ax, bx, metric_kind, a_dim, usearch_scalar_f32_k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this really solves the issue. We should not be measuring hamming distance of float arrays. We just get lucky here that when we cast ints to floats, they end up having the same byte arrangement and hamming distance calculation (an inherently, bitwise operation) results in the same number it would result in if we directly passed int arrays.
the right approach is to have usearch_dist
support integer or byte typed arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I can modify usearch_dist
to take an integer typed array and compute hamming distance appropriately. Let me know
…l use in the index (builds, inserts, scans), and updated tests
It turns out that usearch's hamming distance function doesn't care about scalar type, and we can pass I updated tests as well. Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! just small nit.
@@ -44,8 +44,8 @@ SELECT ROUND(hamming_dist(v, '{0,0}')::numeric, 2) FROM small_world_ham ORDER BY | |||
------- | |||
0.00 | |||
2.00 | |||
4.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still anything wrong with this query?
If not, it should be removed from hnsw_todo.sql
(and moved to hnsw_dist_func.sql unless an equivalent test is already there)
If something is still wrong, please add a relevant comment.
The
hamming_dist
SQL function takes inINTEGER[]
parameters, but we immediately cast the parameter arrays to afloat4
array in the functionarray_dist
. This can cause the hamming distance to be improperly computed.I first noticed this because I saw the
hamming_dist
distance values in certain queries would not be ordered non-decreasingly down the column, as they should be if we're ordering vectors by nearest distance. This indicates that the distance computed by this SQL function was not the same as what usearch internally calls on our vectors and uses to sort distances. I believe this PR fixes that.For an example of the current hamming_dist function not computing distances that are properly ordered in queries, change the last vector of
small_world_array.sql
in the tests to be{1,4,5}
instead, and run thehnsw_dist_func
test.Tests might fail because hamming_distance values displayed in queries will change.