-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing the serialization_format
field requires prior knowledge of the serialization format
#435
Comments
The text states
So in theory, a tool could try to regex match or magic number match for a binary format, based on these things. It is a bit of a chicken and egg thing though. Currently, my parser just looks at the file extension and then does some trial decode and a fallback. |
So, to an extent you're right @marten-seemann, you need some idea of what you're dealing with (i.e., binary vs plaintext vs something like JSON). However, only basing this on for example file extension is too naive (especially since people use .json EVERYWHERE). The current code in qvis imo shows this well, since I want to support both netlog files (also annoyingly .json) and various JSON-based qlog variants (normal JSON, JSON-SEQ, NDJSON). I don't want to have to "trial parse" each of those options, since the parsers might be somewhat expensive to initialize / don't necessarily work in a streaming fashion/might even have to be offloaded to the backend. Having the While I agree it's not ideal, it's a good practical solution that I'd like to keep. The recent switch to making it equivalent with the media types I feel is elegant and clear and consistent in ways that the previous |
I totally get that, though I'd argue that this is not the situation we should optimize for:
|
My tool also supports netlog json. New documents could define new log formats or serialization formats. The peeking code that Robin suggests is a simple way to accomodate for a range of future possibilities. |
Is there precedent for this kind of peeking logic in other IETF serialization formats, or other data formats outside of the IETF? This seems extremely hacky to me. |
Yes, this is typically referred to as "content sniffing" or "MIME sniffing" - it has drawbacks (that we should document further if we are keeping the text) but to my knowledge is commonly used to work around imperfect or incorrect metadata. Wikipedia highlights that the unix
and
As far as I understand, they accept new contributions, so it's entirely feasible we could add this for qlog serialization formats. In future, a binary encoding format could define some additional magic numbers to aid such sniffing. Registration requests for media types, use this template, which includes an optional Magic Numbers part. Beyond
The MIME sniffing doc itself is very detailed. To pick some relevant parts, there's
There looks to have been an attempt to write something up in an I-D but it seems it wasn't adopted https://datatracker.ietf.org/doc/html/draft-abarth-mime-sniff-06. Not sure why, maybe someone more familiar with the history knows. But AFAIK web content sniffing is commonplace. |
Very interesting, thank you for researching this @LPardue! It's kind of sad that we don't have a better of way of determining the file type, but it seems like we're not doing something that's outrageously out of line here. |
Discussed during editors meeting. Seems like @marten-seemann agrees this is acceptable? If so: please close the issue :) |
From section 5:
If I don't know that the file is serialized as JSON-SEQ, I won't be able to parse the header that tells me that it's serialized as JSON-SEQ.
What would we lose by not including the
serialization_format
?The text was updated successfully, but these errors were encountered: