Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null values for non active union members #466

Open
quartox opened this issue Dec 9, 2023 · 3 comments
Open

Null values for non active union members #466

quartox opened this issue Dec 9, 2023 · 3 comments

Comments

@quartox
Copy link
Contributor

quartox commented Dec 9, 2023

I am converting a series of capnp messages into a columnar format (arrow specifically). One of the challenges with unions is non-active union fields. I recursively create a vector of dynamic value readers for each field and then convert that into arrays of arrow memory. When the field is a member of a union and active this works fine. When the field is not active then this creates fake data instead of null.

For example the schema:

struct TestUnion {
  union {
    foo @0 :UInt16;
    bar @1 :UInt32;
  }
}

With data: [{"foo": 1}, {"bar": 1}] generates the output: [{"foo": 1, "bar": 0}, {"foo": 0, "bar": 1}]. What I would expect is [{"foo": 1, "bar": null}, {"foo": null, "bar": 1}]. I have tried creating dynamic_value::Reader::Void when the field is non-active, but this is challenge with nested struct and list types.

For structs I have tried creating a new empty dynamic_struct::StructReader using the private layout:

match capnp_field.get_type().which() {
    introspect::TypeVariant::Struct(st) => {
        dynamic_value::Reader::Struct(dynamic_struct::Reader::new(layout::StructReader::new_default(), schema::StructSchema::new(st)))
    }
}

This still leads to primitive ints with 0 value.

Is it possible to create readers with null values?
Would it make sense to have non-active union fields have null values (I assume the expectation is users check has to find active values and ignore non-active values)?

@dwrensha
Copy link
Member

dwrensha commented Dec 9, 2023

What are you using to convert your dynamic values into JSON?

The stringify.rs logic is an example of how to iterate through the fields of a dynamic struct while accounting for union fields:

dynamic_value::Reader::Struct(st) => {
let schema = st.get_schema();
let union_fields = cvt(schema.get_union_fields())?;
let non_union_fields = cvt(schema.get_non_union_fields())?;
if union_fields.len() + non_union_fields.len() == 0 {
return formatter.write_str("()");
}
formatter.write_str("(")?;
let indent2 = indent.next();
let mut union_field = match cvt(st.which())? {
None => None,
Some(field) => {
// If it's not the default descriminant, then we always need to print it.
if field.get_proto().get_discriminant_value() != 0 || cvt(st.has(field))? {
Some(field)
} else {
None
}
}
};
let mut first = true;
for field in non_union_fields {
if let Some(ff) = union_field {
if ff.get_index() < field.get_index() {
// It's time to print the union field.
if first {
first = false
} else {
indent2.comma(formatter)?;
}
indent2.maybe_newline(formatter)?;
formatter.write_str(cvt(cvt(ff.get_proto().get_name())?.to_str())?)?;
formatter.write_str(" = ")?;
print(cvt(st.get(ff))?, formatter, indent2)?;
union_field = None;
}
}
if cvt(st.has(field))? {
if first {
first = false
} else {
indent2.comma(formatter)?;
}
indent2.maybe_newline(formatter)?;
formatter.write_str(cvt(cvt(field.get_proto().get_name())?.to_str())?)?;
formatter.write_str(" = ")?;
print(cvt(st.get(field))?, formatter, indent2)?;
}
}
if let Some(ff) = union_field {
// Union field comes last.
if !first {
indent2.comma(formatter)?;
}
indent2.maybe_newline(formatter)?;
formatter.write_str(cvt(cvt(ff.get_proto().get_name())?.to_str())?)?;
formatter.write_str(" = ")?;
print(cvt(st.get(ff))?, formatter, indent2)?;
}
indent.maybe_newline(formatter)?;
formatter.write_str(")")
}

@quartox
Copy link
Contributor Author

quartox commented Dec 10, 2023

The json was just a visual example. I am actually converting into arrow arrays and then Polars series for a Polars dataframe.

I will dig into the stringify to see if that has the logic I am missing. My problem may be different because I am going from row-wise into columnar.

My real input are binary files with an unknown number of messages. I create the arrow schema with all of the same fields as the capnp schema. Then iterate through the fields in the schema to create a vector of capnp readers. Then the capnp readers are converted into arrow arrays.

The main problem with nested types is that I need to represent a struct and all the types within it even if it is not active in the union.

struct OuterStruct {
  struct InnerStruct {
    textField @0 :Text;
  }
  union {
    intField @0 :UInt16;
    structField @1 :InnerStruct;
  }
}

If we have three messages (I would actually convert to binary before running them in tests):

{"structField": {"textField": "first"}}
{"intField": 2}
{"structField": {"textField": "third"}}

I need to create three arrow arrays: a UInt16Array for intField, a Utf8Array for textField, and a StructArray for structField. This gives a dataframe that looks basically like this json:

{
"intField": [null, 2, null],
"structField": [{"textField": "first"}, {"textField": null}, {"textField": "third"}] 
}

To help make these arrays my plan is to make the following capnp readers in this psuedocode (values of primitives in comments):

use capnp::dynamic_value::Reader;
let int_field = vec![Reader::Void, Reader::UInt16, Reader::Void]; // null, 2, null
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Void), Reader::Struct(Reader::Text)]; // "first", null, "third"

The challenge is getting a struct with a null textField. Making a struct reader with all the primitive types being replaced by Void readers is the main challenge I don't know how to solve. The entire reason I am working with Void readers at all is the my recursive traversal of the schema gives the following output:

use capnp::dynamic_value::Reader;
let int_field = vec![Reader::UInt16, Reader::UInt16, Reader::UInt16]; // 0, 2, 0
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Text), Reader::Struct(Reader::Text)]; // "first", "", "third"

Another option would be to have the primitive readers that are non-active fields yield null values. This is the line in my code that extracts the primitive values. Note that the code I am testing on unions has not been pushed.

Does this help explain the problem?

@tv42
Copy link

tv42 commented Feb 2, 2024

Have you tried making the first member of your union a dummy unset @0 :Void?

See https://capnproto.org/language.html#unions

By default, when a struct is initialized, the lowest-numbered field in the union is “set”. If you do not want any field set by default, simply declare a field called “unset” and make it the lowest-numbered field.

Said differently, in capnproto unions are not messages, the union is not a pointer that can be left null, the union members are inline in that place in the message, and leaving that as all-zeroes just means @0 with zero values for all fields. See "Wait, why aren’t unions first-class types?" in https://capnproto.org/language.html#unions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants