Support for serialization of records with no default constructor #510

danielearwicker · 2024-05-06T13:24:56Z

C# 9 added record, which formalises a pattern where a type has a "primary constructor" whose parameters names and types exactly match the names and types of its public properties, which are init-only. This implies a corresponding serialisation/deserialisation pattern, where serialisation reads the public properties and deserialisation calls the primary constructor.

This PR extends Parquet.NET to support serialising/deserialising records, as well as ordinary classes that have no default constructor and follow the same pattern as records in the naming of constructor parameters.

Altering the Parquet.NET serialisation code to support this pattern directly is next to impossible, because it necessarily reads whole columns of a row-group and updates the corresponding properties of objects that have already been allocated, i.e. it needs objects with write-enabled properties.

But if each record type R had a corresponding placeholder type P that had the necessary properties, deserialisation could perform a first pass that constructs a set of P, and then a second pass that constructs each R from a P. The drawback of this is that it implies a lot of additional allocation for large row-groups.

But there is a solution that avoids this: we can use R itself as the placeholder type. The CLR provides a way to allocate an instance of a type without yet calling its constructor (call this a "pre-constructed" object). Its fields/properties will have default values, exactly as they do at the start of the constructor. So a pre-constructed R can be safely serialised into.

The second pass executes the constructors on all the R types in the row-group, passing the property values into the primary constructor. A ConstructorInfo can be called via reflection exactly like an instance method, running the constructor on a pre-constructed object, so no new object is allocated. The parameters are re-assigned to the properties that already have those values, which is unavoidable, but necessary because the primary constructor of a record can contain additional user-defined code to initialise fields from the parameter values (see tests in this PR).

To make this fast, code-generation can be used, as it is in existing Parquet.NET serialisation. The Expression-based approach has a limitation: it can't invoke a constructor on a pre-constructed object. But IL-generation (like reflection) has no such limitation. So a "post-constructor" operation can be generated for a type.

If a type has a default (no-params) constructor, that constructor continues to be used and the type does not require post-construction.

Even so, a type may contain nested type references within it (e.g. a property that is a list of records), and this case must also be handled by generated code that visits records nested within the hierarchy.

If the type's full hierarchy does not contain any types requiring post-construction, the post-constructor operation is generated as a no-op. This should be the case for all existing client code of Parquet.NET.

Until now serialisation methods have constraint the type with T : new(). This restriction is removed in this PR, but a suitably relaxed check is performed at runtime.

Note that recursive types (e.g. tree of nodes) cannot be serialised, but the code-gen could be enhanced to allow this if required.

aloneguid · 2024-05-22T11:08:50Z

I think this is great but I need to think about it and maybe postpone till v5. Records are not a natural fit yet due to limitations you have mentioned, but I'd love to suppor tthis.

# Conflicts: # src/Parquet/Serialization/ParquetSerializer.cs

danielearwicker · 2024-05-25T13:17:29Z

I understand, it's fairly hefty bit of new code and a quirky way of constructing objects.

In the meantime I have some simple methods (also using code gen) in my own codebase that work with records, but it can only do simple properties, i.e. they don't implement Dremel at all. It would be great to have a single serialize/deserialize system that fully supports hierarchical data and works with immutable records as well.

Support for serialization of records with no default constructor

a7fabe7

Merge branch 'refs/heads/master' into fork/serializeRecords

ca364b9

# Conflicts: # src/Parquet/Serialization/ParquetSerializer.cs

Merge fixups

cdec978

aloneguid added the future improvement label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for serialization of records with no default constructor #510

Support for serialization of records with no default constructor #510

danielearwicker commented May 6, 2024 •

edited

Loading

aloneguid commented May 22, 2024

danielearwicker commented May 25, 2024

Support for serialization of records with no default constructor #510

Are you sure you want to change the base?

Support for serialization of records with no default constructor #510

Conversation

danielearwicker commented May 6, 2024 • edited Loading

aloneguid commented May 22, 2024

danielearwicker commented May 25, 2024

danielearwicker commented May 6, 2024 •

edited

Loading