You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All Map serializers must extend AbstractMapSerializer.
Format:
| length(unsigned varint) | key value chunk data | ... | key value chunk data |
map key-value chunk data
Map iteration is too expensive, Fury won't compute the header like for list since it introduce considerable overhead.
Users can use MapFieldInfo annotation to provide the header in advance. Otherwise Fury will use first key-value pair
to predict header optimistically, and update the chunk header if the prediction failed at some pair.
Fury will serialize the map chunk by chunk, every chunk has 255 pairs at most.
KV header will be a header marked by MapFieldInfo in java. For languages such as golang, this can be computed in
advance for non-interface types most times. The implementation can generate different deserialization code based read
header, and look up the generated code from a linear map/list.
Why serialize chunk by chunk?
When fury will use first key-value pair to predict header optimistically, it can't know how many pairs have same
meta(tracking kef ref, key has null and so on). If we don't write chunk by chunk with max chunk size, we must write at
least X bytes to take up a place for later to update the number which has same elements, X is the num_bytes for
encoding varint encoding of map size.
And most map size are smaller than 255, if all pairs have same data, the chunk will be 1. This is common in golang/rust,
which object are not reference by default.
Also, if only one or two keys have different meta, we can make it into a different chunk, so that most pairs can share
meta.
The implementation can accumulate read count with map size to decide whether to read more chunks.
Feature Request
Chunk by chunk predictive map serialization protocol can be 2x faster than current one in pyfury. we should implement this new protocol.
See #925 for more details
Is your feature request related to a problem? Please describe
No response
Describe the solution you'd like
https://fury.apache.org/docs/specification/fury_xlang_serialization_spec/#map has a formulized spec:
Format:
map key-value chunk data
Map iteration is too expensive, Fury won't compute the header like for list since it introduce
considerable overhead.
Users can use
MapFieldInfo
annotation to provide the header in advance. Otherwise Fury will use first key-value pairto predict header optimistically, and update the chunk header if the prediction failed at some pair.
Fury will serialize the map chunk by chunk, every chunk has 255 pairs at most.
KV header:
0b1
of the header to flag it.0b10
of the header to flag it. If ref tracking is enabled for thiskey type, this flag is invalid.
0b100
of the header to flag it.0b1000
of the header to flag it.0b10000
of the header to flag it.0b100000
of the header to flag it. If ref tracking is enabled for thisvalue type, this flag is invalid.
0b1000000
header to flag it.0b10000000
of the header to flag it.If streaming write is enabled, which means Fury can't update written
chunk size
. In such cases, map key-value dataformat will be:
KV header
will be a header marked byMapFieldInfo
in java. For languages such as golang, this can be computed inadvance for non-interface types most times. The implementation can generate different deserialization code based read
header, and look up the generated code from a linear map/list.
Why serialize chunk by chunk?
When fury will use first key-value pair to predict header optimistically, it can't know how many pairs have same
meta(tracking kef ref, key has null and so on). If we don't write chunk by chunk with max chunk size, we must write at
least
X
bytes to take up a place for later to update the number which has same elements,X
is the num_bytes forencoding varint encoding of map size.
And most map size are smaller than 255, if all pairs have same data, the chunk will be 1. This is common in golang/rust,
which object are not reference by default.
Also, if only one or two keys have different meta, we can make it into a different chunk, so that most pairs can share
meta.
The implementation can accumulate read count with map size to decide whether to read more chunks.
Describe alternatives you've considered
No response
Additional context
#925
The text was updated successfully, but these errors were encountered: