Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retaining byte string serialization variants #132

Open
chrysn opened this issue Feb 15, 2023 · 2 comments
Open

Retaining byte string serialization variants #132

chrysn opened this issue Feb 15, 2023 · 2 comments

Comments

@chrysn
Copy link
Contributor

chrysn commented Feb 15, 2023

Byte strings have two wide-spread serialization variants: 'text' and h'74657874' (and the rarer b32, h32 and b64, which I personally don't care about but hey they're there) prefixes. It would be nice if this could be preserved, maybe as an extra Option property of ByteString.

Looking at RFC8610 Appendix G Extended Diagnostic Notation provides even more options (including internal whitespace and embedded CBOR); they are more complex and not really on my wish-list, but it might be good to be aware of it when implementing to not duplicate work if that later becomes relevant.

This would be especially convenient when building a diagnostic notation programmatically.

This would probably share patterns with #117, in that it is a property that is set when coming from DN, but unavailable when coming from CBOR. Filling out those gaps when going from arbitrary CBOR to DN could be done by the user at the AST stage by applying arbitrary heuristics, some of which may be provided by cbor-diag-rs, but that's ultimately application specific. (For example, a simple universal heuristic would be taking the ratio of printable ASCII characters; a more application specific choice might be guided by CDDL).

@chrysn
Copy link
Contributor Author

chrysn commented Jul 26, 2023

Looking at the implementation a bit more closely, I noticed when serializing into diagnostic notation, the tags that indicate special handling on the JSON side are conveniently used to also guide display in diagnostics.

While it's perfectly practical to keep handling of those tags in there, a full solution to retaining serialization details would allow moving that code out into a more heuristic annotation step. It could look like this: Binary CBOR doesn't get any diagnostic-format hints at parsing time, and all unannotated byte strings are expresssed by the diagnostic encoder's default. But if the tree is passed to a mutating walker inbetween that fills hints, some being to interpret the tags, then that step would fill the gaps. As a benefit, there'd be the option for the user to either preserve the serialization types for data ingested from diagnostic notation, or to clear them all out to purely apply the encoder's preferences, or to replace the original versions with what the (or, moreover, some) annotator sets.

@chrysn
Copy link
Contributor Author

chrysn commented Jul 27, 2023

One aspect of this is that not only do the strings have serialization variants, their diagnostic notation may also be a concatenation of differently encoded chunks. I'm not sure what would be a good level of modelling here. Full round-tripping of arbitrarily diagnostic notation strings may or may not be desirable; if it is not (and I wouldn't need it), preservation of diagnostic notation would be best-effort. (So for example, h'4141' and 'AA' might roundtrip, but h'41' 'A' would become either of them).

If we went for full roundtripping, options would be to have ByteString contain a single Vec and a parallel Vec<(length, encoding)> hints (easy to manipulate on the CBOR side), or a Vec of single encoded byte strings (easy to manipulate on the diagnostic notation side). But I'm not sure we need it, hope we don't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant