Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetCollStatResponseWrapper randomly returns 0 size for collections in 2.3.x #1038

Open
r0x07k opened this issue Aug 13, 2024 · 4 comments
Open

Comments

@r0x07k
Copy link

r0x07k commented Aug 13, 2024

Hi,

The GetCollStatResponseWrapper randomly returns a zero row count for some collections. For others it still works ok, so it's unclear what the reason is.

For example, here is the collection in a format compatible with LangChain:

{'collection_name': 'test',
 'auto_id': False,
 'num_shards': 1,
 'description': '',
 'fields': [{'field_id': 100,
   'name': 'id',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 36},
   'is_primary': True},
  {'field_id': 101,
   'name': 'text',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 65535}},
  {'field_id': 102,
   'name': 'metadata',
   'description': '',
   'type': <DataType.JSON: 23>,
   'params': {}},
  {'field_id': 103,
   'name': 'vector',
   'description': '',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 768}}],
 'aliases': [],
 'collection_id': 451819797554279738,
 'consistency_level': 0,
 'properties': {},
 'num_partitions': 1,
 'enable_dynamic_field': True}

The real row count:

[{'count(*)': 27}]

The Java code that returns 0:

R<GetCollectionStatisticsResponse> respCollectionStatistics = milvusClient.getCollectionStatistics(
    GetCollectionStatisticsParam.newBuilder()
      .withCollectionName(name)
      .build()
    );
GetCollStatResponseWrapper wrapperCollectionStatistics = new GetCollStatResponseWrapper(respCollectionStatistics.getData());
System.out.println(wrapperCollectionStatistics.getRowCount());

0

I use SDK 2.3.4 which is tied to LangChain4J.

@r0x07k
Copy link
Author

r0x07k commented Aug 13, 2024

I tried to debug it further, and now I have two identical collections of size 27 (with different names), but wrapperCollectionStatistics returns 0 for one and the correct 27 for the other.

@yhmo
Copy link
Contributor

yhmo commented Aug 14, 2024

The function of MilvusClient.getCollectionStatistics() in Java SDK is equal to the Collection.num_entities in Milvus Python SDK. This API returns a raw number of entities. It gets the number from Etcd by summing up row numbers of all sealed segments.

As we know, when users call insert() to insert entities into a collection, the insert request is passed to Pulsar, and consumed by querynode/datanode asynchronously. The datanode accumulates entities in a memory buffer, once the buffer size exceeds a threshold, the datanode flushes the buffer to be a sealed segment. Only when a sealed segment is persisted, its row number is recorded into Etcd.

So, the number returns from MilvusClient.getCollectionStatistics() is not accurate.
To get an accurate number, use "count(*)".

This is an example of MilvusClientV2 to get row number:
It is a query request. Use the ConsistencyLevel to control the data visibility. "ConsistencyLevel.STRONG" means this query will wait until all data is consumed by querynode.
Note: the data in pulsar cannot be queried.

        QueryResp queryResp = client.query(QueryReq.builder()
                .collectionName(collectionName)
                .filter("")
                .outputFields(Collections.singletonList("count(*)"))
                .consistencyLevel(ConsistencyLevel.STRONG)
                .build());
        List<QueryResp.QueryResult> queryResults = queryResp.getQueryResults();
        return (long)queryResults.get(0).getEntity().get("count(*)");

@r0x07k
Copy link
Author

r0x07k commented Aug 14, 2024

Thank you, @yhmo. We’ll proceed with this approach.

Could you also let me know if there are any plans to deprecate MilvusClient.getCollectionStatistics()?

@yhmo
Copy link
Contributor

yhmo commented Aug 15, 2024

getCollectionStatistics() is much faster than query("count(*)") because getCollectionStatistics() quickly picks the number from Etcd but query() requires the collection to be loaded and iterates all the segments to sum up the number. Sometimes users only want to know a raw number and don't intend to load the collection. So I think the getCollectionStatistics() should not be marked as deprecated.

In the python sdk, the Collection.num_entities is not deprecated either:
https://github.com/milvus-io/pymilvus/blob/master/pymilvus/orm/collection.py#L265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants