-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comply with RFC 8259 in strict mode #194
base: master
Are you sure you want to change the base?
Conversation
Pull request is in draft mode for now as I need to make a couple changes to fix parsing multiple JSON objects at once and improve validation for primitives, but I wanted to go ahead and get it out there in case there is any potential discussion. |
e910987
to
82f378a
Compare
82f378a
to
cedd377
Compare
Ensure primitives are "true", "false", "null", or an RFC 8259 compliant number. (Still need to add test cases.)
String parsing previously did not differ between strict and non-strict modes, but was not fully compliant with RFC 8259. RFC 8259 requires that control characters (code points < 0x20) be escaped. This is now enforced in strict mode. In addition, non-strict mode now does *no* validations on string contents, much like primitives in non-strict mode.
fd2e0e4
to
a3168f0
Compare
The remaining changes are now complete, and I've added test cases for the new functionality. Marking as "Ready to review." |
This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser. My jsmn_t ends up being:
and the jsmn_parser ends up as
With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:
and then check parser->expected against what the next item actually is
This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what. |
I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.
|
That is an interesting approach. Do you plan to share the implementation in a pull request? |
It seems you have already noticed, but for the benefit of other readers, I use a new
I probably could have used explicit constants for individual bits to make the code more readable and maintainable (in fact, I probably should add it still):
The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time. |
One other thing I forgot to put in the description: calling I will add an extra bullet point to the description for this. |
@themobiusproject I made some improvements based on your suggestion to use bitwise logic for the enum values. I think it helps readability a lot. |
I've added a branch with a fourth non-strict extension in my own fork:
I already have a bunch of pull requests pending here, so I'm not going to submit another one right now, but I thought I should at least mention it since this was originally allowed under non-strict mode. |
Let me finish up cleaning up a few bits so all of the previous tests pass again and I will at least put in a link to my fork. I have a lot of changes beyond this (cmake, visual studio, xcode projects, doxygen comments, tests in cmocka, extra json.{c,h} for wrapping jsmn) that don't necessarily go with just this issue and would have to pull the correct pieces for a clean pull request. |
@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests. master, right now, only has two tests left as FIXME and they are non-strict tests which I haven't been looking into. I cleared out some of the original strict type checks as the parser->expected versions covered them. I also like your idea of setting a flag to avoid strict checks a second time if it has already been done once to find the number of tokens.
Also, regardless of how long the enum is, it won't touch change compile size or performance, so having each and every option does seem like a good idea; I'll just ignore looking at it in the code. |
Just want to clarify: if you weren't already planning to create pull requests, don't feel like you have to just because I brought it up. I was just curious. |
I was looking at it previously because I like what I have come up with and would be happy to share it with others, especially through the official repo. Since this pushed me to finish the bugs that I had, I was going to get back to it. |
case '\r': | ||
case '\n': | ||
case ' ': | ||
case ',': | ||
case ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the rfc, I don't believe that a primitive can be followed by a ':'
, '{'
, or '['
and can only be followed by '\0'
when it is the lone element in the json string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch's purpose is simply to ensure there aren't extra characters in the primitive, like 1.234true
. As long as the character after is not allowed in a primitive, the check is satisfied. Proper validation of the next character happens elsewhere (in the main parsing loop).
This is important because jsmn will parse multiple "JSON texts" (in RFC 8259 terminology) in one string, if there are more than one present. For example, this is perfectly valid input for jsmn and will simply parse to two separate arrays:
["a", "b", "c"] [1, 2, 3]
Since this pull request adds the ability for strings and primitives to be complete JSON texts on their own, this check ensures that 1.234true
will be rejected rather than parsed as two separate primitives. (Strictly speaking, you could say they should be parsed as two primitives, since consecutive objects and arrays need not be separated by whitespace to be parsed by jsmn. But, consider a case like 1.234.5
: this would become ambiguous under those rules. It could be 1.2 34.5
or 1.23 4.5
. For that matter, even 1.234true
could be 1.234 true
, 1.23 4 true
, 1.2 34 true
, etc. Requiring at least one non-primitive character as separation resolves this ambiguity.)
Parsing multiple JSON texts is also the reason we might see a {
or [
following a primitive. :
cannot follow a primitive in strict RFC 8259 compliance, but it can in non-strict mode. In another pull request (#195), I add the ability to select specific non-strict features, so it is possible to have strict primitive validation but allow primitives as object keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing multiple JSON texts is also the reason we might see a { or [ following a primitive. : cannot follow a primitive in strict RFC 8259 compliance, but it can in non-strict mode. In another pull request (#195), I add the ability to select specific non-strict features, so it is possible to have strict primitive validation but allow primitives as object keys.
For the cases of :
, {
, and [
, your non-strict mode starts down on line 367 and the code up here is all for strict RFC 8259 compliance. It is true that \0
should be caught later if the end of the js string is deeper in an object or array, but it is an easy and inexpensive check right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all valid input strings in strict mode:
true["a", "b", "c"]
(Parses as a primitive followed by an array.[
follows a primitive.)true{"a": 1}
(Parses as a primitive followed by an object.{
follows a primitive.)true
(Parses as an unenclosed primitive.\0
follows a primitive.)
Then, regarding the colon:
:
can never follow a primitive in strict mode, but I put it there anticipating enhancements like PR #195, which is built on top of this one. I encourage anyone reading to take a look at that PR for the full details, but to summarize the relevant part: It takes each use of #if[n]def JSMN_STRICT
and replaces it with #if[n]def JSMN_PERMISSIVE_PRIMITIVES
, JSMN_PERMISSIVE_STRINGS
, or JSMN_PRIMITIVE_KEYS
, allowing specific non-strict features to be selected. Notably, this means if JSMN_PRIMITIVE_KEYS
is defined but JSMN_PERMISSIVE_PRIMITIVES
is not, that :
in the switch statement becomes important.
I have a quick question regarding the improvements to strict mode: do these changes also fix the following?
Even with strict mode enabled, this results in the same tokens as if a comma had been present between the two items. jsonlint.com rejects the JSON above as not being valid. If it's fixed in this PR then great; otherwise, I'll make a separate issue. |
@x6herbius, this PR should fix that. That said, I'm not sure if this repository has been abandoned. The maintainers seem to have disappeared. Thus, while I had some other ideas for improvements and even submitted PRs for two of them (#195, #196), I've given up on making further contributions. I do have a couple more changes I made on my own fork (see PR #197 may also be worth checking out. They've made similar sorts of improvements (including proper RFC compliance, I believe) through a different style of implementation. |
@dominickpastore I'm still here. I haven't done much as it's been difficult for me due to how big the changes feel. While JSMN is not in the best shape at the moment, it is being dependent on by quite some people and I honestly fear breaking anything just because I did not analyse the changes properly. To give this repo a bit of life I will try to make a new experimental branch and possibly merge some of the PRs there. Maybe that way it will be possible to get better feedback from community about the changes. And I hope you won't feel like your contributions are a waste. |
@pt300 That's good to hear. I was a bit concerned when @themobiusproject and I added some discussion to some of the roadmap issues (#159 and #154) and it seemed to go unnoticed. I don't want to assume my PRs are the best for the project, either (especially now that there is also PR #197 for a lot of the same stuff). I made the changes because they were what I needed. But they did align with some parts of the roadmap, so I figured I'd submit them to help kickstart the discussion again. Other thing I would say: This is just my opinion, but it sounds like you are putting too much pressure on yourself. I'd bet, chances are, as long as JSMN can parse standard JSON, 90% of people will be happy. For the rest, it's up to them to speak up if they have a problem with the roadmap. And even if they don't speak up, nothing is forcing them to stop using the current version--that's the beauty of open source. |
@dominickpastore @pt300 I am still hanging around here as well. Both @dominickpastore's PR #194, #195, and #196 and my PR #197 are all valid options. I had once tried to break my PR into smaller changes but continued having chicken and egg problems so all of the changes are lumped into one. I also will comment, @pt300, dominick's and my changes won't merge together as we have both made many fundamental changes that won't jive together (though I know I have swiped some ideas from dominick's code to fix issues in mine). I certainly don't feel like my contributions have been a waste. There was someone who posted in my PR with a problem that he was having so it seems like someone has been able to benefit what I have done. |
it may be useful to note which branch is the most up to date and recommended for new projects, on whatever repository/fork, while issues of backwards compatibility and breaking get resolved. |
This PR brings strict mode in compliance with RFC 8259 (as well as some other changes):
There is an important caveat with primitives at the root level related to Incremental parsing doesn't work with unquoted stuff when JSMN_STRICT is not defined #179: If a number (or any primitive in non-strict mode) is split across incremental parses at the root level, an incomplete token will likely be produced on the first pass instead of
JSMN_ERROR_PART
. The parser has no way of knowing the token is incomplete. It will be corrected on the second pass when the full input string is available.jsmn_parse()
withtokens
set toNULL
skips a lot of the validation, even in strict mode. (The assumption being that this is almost always called right before ajsmn_parse()
withtokens
allocated, so for almost all use cases, this saves from doing the validations twice.)At the same time, this led to some significant changes to non-strict mode (consistent with comments I previously made on issue #154). In summary, non-strict mode is now limited to the following three deviations from RFC 8259: 1) Primitives can be made of any character that doesn't have special meaning in JSON (whitespace and any of
{}[],:"
), 2) Strings can contain any character and invalid backslash escapes, and 3) Primitives can be object keys, not just strings. Reasoning for these changes is below. (But see also comment later.)Unenclosed objects, missing values, and missing commas
Non-strict mode previously allowed the following examples to parse:
I believe this was intentional; however, this relaxed validation carried over into strict mode in several places (see issues #52, #131, #158). I added a simple state machine to the parser to fix the issues in strict mode, but fixing strict mode without affecting non-strict mode would have been difficult. As a result, non-strict mode no longer allows the above examples. I can think of ways we could make it work, but the added complexity didn't seem worth it.
Arrays and objects as keys
Non-strict mode no longer allows arrays and objects as object keys. This eliminates issue #193, but breaks backward compatibility for some non-RFC-8259-compliant JSON. Here is the reasoning:
size
1size
and its children'ssize
s, plus one for the object itself (very convenient for skipping array items or members of an object)size
is already used for their own contents. This is the cause of issue Parsing issues with objects and arrays as keys in non-strict mode #193. I could think of three potential fixes:size
for a key when it's an array or object. This fixes Parsing issues with objects and arrays as keys in non-strict mode #193 but still breaks the promises above, and adds noticeable complexity to the code. I don't like this solution.size
: either a newkey
field injsmntok_t
or modifyjsmntype_t
:keytok
field alongsidesupertok
injmsn_parser
. This could be enabled only whenJSMN_STRICT
is not defined. I don't hate this idea, but it is extra work for what is ultimately non-RFC-8259-compliant JSON. It still doesn't seem with the effort.Conclusion
These are some pretty significant changes to jsmn, but seeing the number of issues about invalid input being accepted, I hope they are well received. I have some ideas for future improvements also, but I will discuss that on the Roadmap issue (#159).