Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Schema evolution support for type backward/forward compatibility #1938

Open
chaokunyang opened this issue Nov 8, 2024 · 0 comments

Comments

@chaokunyang
Copy link
Collaborator

chaokunyang commented Nov 8, 2024

Feature Request

If schema evolution mode is enabled globally when creating fury, and enabled for current type, type meta will be written
using one of the following mode. Which mode to use is configured when creating fury.

  • Normal mode(meta share not enabled):

    • If type meta hasn't been written before, add type def
      to captured_type_defs: captured_type_defs[type def] = map size.
    • Get index of the meta in captured_type_defs, write that index as | unsigned varint: index |.
    • After finished the serialization of the object graph, fury will start to write captured_type_defs:
      • Firstly, set current to meta start offset of fury header

      • Then write captured_type_defs one by one:

        buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs))
        for type_meta in writting_type_defs:
            if not type_meta.is_stub():
                type_meta.write_type_def(buffer)
        writing_type_defs = copy(schema_consistent_type_def_stubs)
  • Meta share mode: the writing steps are same as the normal mode, but captured_type_defs will be shared across
    multiple serializations of different objects. For example, suppose we have a batch to serialize:

    captured_type_defs = {}
    stream = ...
    # add `Type1` to `captured_type_defs` and write `Type1`
    fury.serialize(stream, [Type1()])
    # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before.
    fury.serialize(stream, [Type1(), Type2()])
    # `Type1` and `Type2` are written before, no need to write meta.
    fury.serialize(stream, [Type1(), Type2()])
  • Streaming mode(streaming mode doesn't support meta share):

    • If type meta hasn't been written before, the data will be written as:

      | unsigned varint: 0b11111111 | type def |
      
    • If type meta has been written before, the data will be written as:

      | unsigned varint: written index << 1 |
      

      written index is the id in captured_type_defs.

    • With this mode, meta start offset can be omitted.

The normal mode and meta share mode will forbid streaming writing since it needs to look back for update the start
offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure
deserialization failure in meta share mode doesn't lost shared meta.

Type Def

Here we mainly describe the meta layout for schema evolution mode:

|      8 bytes meta header      |   variable bytes   |  variable bytes   | variable bytes |
+-------------------------------+--------------------+-------------------+----------------+
| 7 bytes hash + 1 bytes header |  current type meta |  parent type meta |      ...       |

Type meta are encoded from parent type to leaf type, only type with serializable fields will be encoded.

Meta header

Meta header is a 64 bits number value encoded in little endian order.

  • Lowest 4 digits 0b0000~0b1110 are used to record num classes. 0b1111 is preserved to indicate that Fury need to
    read more bytes for length using Fury unsigned int encoding. If current type doesn't has parent type, or parent
    type doesn't have fields to serialize, or we're in a context which serialize fields of current type
    only, num classes will be 1.
  • The 5th bit is used to indicate whether this type needs schema evolution.
  • Other 56 bits are used to store the unique hash of flags + all layers type meta.
Single layer type meta
| unsigned varint | var uint |  field info: variable bytes   | variable bytes  | ... |
+-----------------+----------+-------------------------------+-----------------+-----+
|   num_fields    | type id  | header + type id + field name | next field info | ... |
  • num fields: encode num fields as unsigned varint.
    • If the current type is schema consistent, then num_fields will be 0 to flag it.
    • If the current type isn't schema consistent, then num_fields will be the number of compatible fields. For example,
      users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema
      consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization,
      Fury will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent
      fields, then use fields info in meta for deserializing compatible fields.
  • type id: the registered id for the current type, which will be written as an unsigned varint.
  • field info:
    • header(8
      bits): 3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag.
      Users can use annotation to provide those info.
      • 2 bits field name encoding:
        • encoding: UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID
        • If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be 11.
      • size of field name:
        • The 3 bits size: 0~7 will be used to indicate length 1~7, the value 7 indicates to read more bytes,
          the encoding will encode size - 7 as a varint next.
        • If encoding is TAG_ID, then num_bytes of field name will be used to store tag id.
      • ref tracking: when set to 1, ref tracking will be enabled for this field.
      • nullability: when set to 1, this field can be null.
      • polymorphism: when set to 1, the actual type of field will be the declared field type even the type if
        not final.
    • field name: If tag id is set, tag id will be used instead. Otherwise meta string encoding [length] and data will
      be written instead.
    • type id:
      • For registered type-consistent classes, it will be the registered type id.
      • Otherwise it will be encoded as OBJECT_ID if it isn't final and FINAL_OBJECT_ID if it's final. The
        meta for such types is written separately instead of inlining here is to reduce meta space cost if object of
        this type is serialized in current object graph multiple times, and the field value may be null too.

Field order are left as implementation details, which is not exposed to specification, the deserialization need to
resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and
using a more compact encoding.

Other layers type meta

Same encoding algorithm as the previous layer.

Is your feature request related to a problem? Please describe

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

#1556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant