Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONReader needs to fail on bad input #107

Closed
dustmop opened this issue Apr 2, 2018 · 7 comments
Closed

JSONReader needs to fail on bad input #107

dustmop opened this issue Apr 2, 2018 · 7 comments
Assignees
Labels

Comments

@dustmop
Copy link
Collaborator

dustmop commented Apr 2, 2018

In order to proceed with fuzz testing (#103) the parser needs to reject invalid input so that the fuzzer library can properly mutate inputs and find the best edge cases. Currently, the JSONReader does not fail on most bad input text, instead returning some harmless entries before eventually returning EOF.

Here are some example invalid inputs (found through fuzzing) that are currently accepted without error:

{"""
{""
{""\n\r

[{"":0.891359201686479766013060971498190079908139321726943530014330540939446345918554318339765605212255964066145455497729631139948085803712198719971664381257402829111505715176981172691289708366:}

{""^M:""}

{"":78767417684105446,"":161921607789870092:"":81524627972118562,"":83456047028550530,

[{}{}

[itZatzVtFXY
fFUtEbc6Iuyt4MQJBAItQ+phy2U
fQtuUE9tcPg/FvytQDU
y2ptGsuSmgUtAqVXVlxt9Oeos0UUothgiDktJBzV+5Otd input"":"

[225940565588577426,97785349712147642s:111079331564477517I:63464490706047245:90488487091652254F"6xBlO8psh1cf4TONZUS1N5gIWD"

Specific things that need to be checked:
Objects have quoted keys, followed by a colon, then a value, followed by either a comma or end of object.
Objects end with closing curly braces and arrays end with closing square braces.
Invalid characters (the in the last example is the hex character 0xff) don't appear.
Braces are properly balanced.
Objects have unique keys.
Invalid alphabetic characters don't appear in number constants.

@dustmop dustmop self-assigned this Apr 2, 2018
@b5
Copy link
Member

b5 commented Apr 2, 2018

Fantastic. Seems like you're already working out some classifications of errors (invalid characers, unbalanced braces, etc...). In a perfect world these types of errors would be codified & reflected in godoc, but I think we need to have a larger conversation about errors & error types before we get there. In the meantime it'd be great to see different types of errors called out in comments or through consistent error message conventions, it'll make our lives easier down the road when we up the rigor of our error system.

@dustmop
Copy link
Collaborator Author

dustmop commented Apr 2, 2018

Seems like we can get all of this by replacing the current parser with one based upon Decoder.Token. It tokenizes json, returning delimiters, integers, booleans, floats, or strings, and automatically handles separators (commas, colons) and balanced braces.

See ExampleDecoder_Token: https://golang.org/src/encoding/json/example_test.go#L87

@b5
Copy link
Member

b5 commented Apr 2, 2018

so much this. I was wondering if this didn't already exist for json in the standard lib, clearly it does. Let's do it!

I haven't been this pumped pumped for a PR that deletes code and returns more errors in a while.

@b5 b5 added the bug label Apr 3, 2018
@dustmop
Copy link
Collaborator Author

dustmop commented Apr 3, 2018

I've been investigating using Decoder.Token to rewrite the JSON parser code, but unfortunately, running the benchmarks shows that new implementation to be much slower, about twice as slow. Perhaps encoding/json is not performant, or there's still too much memory copying going on. I'm going to investigate other json parsing libraries (https://github.com/json-iterator/go claims to be high-performance) to see if they do better. For now, the new implementation with Decoder.Token does detect more errors; is it worth checking in just for that, or is the performance loss not worth it?

One other thing: there were two failing unit tests I didn't understand: TestEntryBuffer in entry_buffer_test.go and TestEachEntry in entry_test.go. These seem to be something like a pipe that connect together a reader and writer? With the new implementation that I wrote, which verifies that JSON opens correctly with a curly brace and closes with the same, it seems like these classes won't initialize correctly since they start out with empty buffers. What are their exact semantics?

@b5
Copy link
Member

b5 commented Apr 3, 2018

Innnnnnnnteresting.

re: slowness, my guess is encoding/json's Decoder is much slower because it makes extensive use of the reflect package. I'd like to keep the current version and instead focus on improving error handling for a few reasons:

  • In nearly all cases the input data is being passed through rigid parsers like json.Unmarshal or a csv.Reader before the data is recorded as a dataset, so we have a degree of insulation here (for now at least)
  • Slowness is a real concern given that these codepaths are intended to behave like a database, we don't have the constraint of needing to encode to various target types, b/c case we know ahead of time what we're decoding to
  • thinking long term, I'd rather stick with our own home-rolled implementation to have the freedom to continue to form-fit the solution to our needs

TestEntryBuffer tests the real-world use case of using a buffer to do data format conversion (read from a csv-formatted EntryReader, and write to a json-formatted EntryBuffer). This is our current trick for getting out of one format and into another. TestEachEntry is a test that needs work, but is supposed to confirm the EachEntry package func works as expected, iterating each entry in a reader. Both need to work as they're currently implemented, which is another reason I'm happy to leave things as is.

@dustmop
Copy link
Collaborator Author

dustmop commented Apr 4, 2018

Sounds good. One thing about TestEntryBuffer is that the failing tests were due to my changing the semantics of the JSONReader simply because I didn't understand this particular use case. My new version was assuming that the constructor NewJSONReader should begin validating the underlying json text by looking for the opening "{" or "[". However that doesn't need to happen, it can start doing so lazily the first time ReadEntry is called, so that TestEntryBuffer works correctly. That can be fixed in any future implementation, regardless of how we choose to proceed.

@b5
Copy link
Member

b5 commented Aug 20, 2018

This has been mostly addressed in #113 & #114. Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants