Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't work on huge or improperly formatted json files #10

Open
davidawad opened this issue Feb 28, 2018 · 6 comments
Open

Doesn't work on huge or improperly formatted json files #10

davidawad opened this issue Feb 28, 2018 · 6 comments

Comments

@davidawad
Copy link

davidawad commented Feb 28, 2018

I've been having a lot of specific problems that your tool would be PERFECT for, but I might be running against some bugs that are unique to my problem.

So normally I'd use jj to take a file that might not be perfect JSON (meaning that it may contain trailing slashes) and use jj to read this json file and parse it out, and then print it in the proper format in order for something else to consume that. I'm finding that this isn't working with larger json files and jj is blowing up their size (taking files from 44MB -> 5GB)

To be more specific, here's what I"m doing.

$ echo '[{"id": 1, "name": "Arthur", "age": "21"},{"id": 2, "name": "Richard", "age": "32"}]' | jj -p
# works fine
$ echo '[{"id": 1, "name": "Arthur", "age": "21"},{"id": 2, "name": "Richard", "age": "32"}]' | jq --color-output
# works fine

This works perfectly fine. I input a correct json object and both jq and jj are able to handle that.

JJ is better for me because I have what I'll call lossy JSON. My lossy json has things like trailing commas in the file that cause it to fail typical validation.

echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p
# works fine

echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jq --color-output
#jq errors on me 
parse error: Expected another key-value pair at line 1, column 43

I need to read these JSON files in other programs after parsing them and making sure that they don't have these commas.
So I've been passing them through JJ and then sending them to jq to validate them.

$ echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p  | jq --color-output 

Now I've noticed that it's only when i use jj -p that JJ does this convenient functionality that I'm leveraging to clear the trailing commas.

I want the file to be small but I also want it to be valid.
So I'm now doing something like this:

$ echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p  | jj -u | jq --color-output 

Now I'm getting the best of both worlds. It's expensive, but I don't care since I'm more interested in making sure the data is formatted correctly and don't care about my cpu grinding this out for a few more minutes.

The problem is that passing larger files through JJ -p is not blowing up their file size like crazy. (43 -> 732M)

Here's a screenshot of what I'm seeing that I think is causing these huge size increases.

screen shot 2018-02-27 at 9 42 19 pm

Size of course, is no issue if the data is usable, however it doesn't seem to be so.

When attempting to parse the output file from just jj -p I get unusual problems.

>>> import json
>>> data = json.load(open('alabama_2012_expanded.json'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
RuntimeError: maximum recursion depth exceeded while calling a Python object

I've even tried expanding the python recursion depth to 10,000 and it still doesn't work.

I've tried going from jj -p | jj -u and then passing that to jq to validate it. But when I validate it I get problems like this.

$ cat lossy_alabama_2012.json | jj -p | jj -u  > alabama_2012_cleaned.json
$ cat alabama_2012_cleaned.json | jq --color-output
parse error: Unfinished JSON term at EOF at line 1, column 42370384

Here's an example of one of these huge JSON files I'm working with it's 43M

TL; DR : jj -p is blowing up filesizes, and passing jj -p | jj -u is making files that aren't valid.

My question for you, do you have any idea what could be going on here? I can't seem to get a proper version of this file saved the way I need it to be.

@pkoppstein
Copy link

@davidawad - I have some good news regarding jj -u and a tool called rjson (a script that can be installed e.g. by: yarn global add relaxed-json).

@tidwall - Apart from the good news mentioned above, there is quite a lot of what might be regarded as "bad news" if jj is supposed to handle the same file in a uniform manner.

In the following:

  • the original alabama_2012 "non-JSON JSON" file is named alabama_2012.qjson
  • other file names are derived by appending a string reflecting subsequent processing.

GOOD NEWS: rjson produces valid JSON (an array of length 46) in about 3s.

 jq length alabama_2012.qjson.rjson
 46

GOOD NEWS: rjson can be applied to the output of jj -u to yield valid JSON that agrees with the rjson output:

alabama_2012.qjson.jj-u.rjson == alabama_2012.qjson.rjson

GOOD NEWS: jj -p is IDEMPOTENT on this file.

BAD NEWS: jj -u produces invalid JSON (though it can be repaired by rjson, as mentioned above).

BAD NEWS: "jj -p" produces invalid JSON that cannot be repaired by rjson:

rjson alabama_2012.json.jj-p > alabama_2012.json.jj-p.rjson
Error on line 318939: Unexpected token: end-of-file, expected json object

BAD NEWS: jsonlint -S fails

@davidawad
Copy link
Author

davidawad commented Feb 28, 2018

So I actually resorted to a cheap trick that solved my problem. I was able to modify the program creating this lossy JSON and now I just include an extra empty {} at the end of every array being generated.

This wastes space (and is generally terrible), but saves massive time since I don't have to now parse the files to fix the messyness after the fact.

I still contend that this is an issue for JJ, given that it inflates filesize like crazy but I don't have any need for it to be fixed anymore.

Thanks for your help

@pkoppstein
Copy link

@davidawad - Glad to hear you have an easy way out of the messiness. Having an extra {} is a small price to pay for conforming with a standard, which is the whole point about JSON -- it's a simple but expressive standard with a small price to pay.

I've come to the conclusion that although in some cases JJ handles wonky JSON the way you'd want, there's usually some kind of "gotcha" -- which isn't surprising -- see above about having a worthwhile standard.

@tidwall
Copy link
Owner

tidwall commented Feb 28, 2018

While JJ can handle broken JSON in many cases, that is not it's intended purpose. Under the hood it uses the gjson Get function and the README states

"The Get* and Parse* functions expects that the json is well-formed. Bad json will not panic, but it may return back unexpected results."

There are some known consistencies with bad json that JJ can handle, such as missing or trailing tokens. But I doubt that every scenario is recoverable at the moment. Unterminated elements being one.

Perhaps in the future I'll build a more defined approach to handling undefined JSON. 🤔

@pkoppstein
Copy link

@tidwall - Thanks for the clarification. The surprising and unfortunate thing, I think, is that the -u and -p options do sometimes give inconsistent results when presented with quasi-JSON, thus inevitably raising the question of whether they always give consistent results when presented with strictly valid JSON. The good news (at least for me) is that the testing I've done on some very large (valid) JSON data files (up to 1,271,577,470 bytes when formatted) yields no surprises.

@tidwall
Copy link
Owner

tidwall commented Mar 1, 2018

Valid json will not give inconsistent results. If you find that they do then please let me know.

@davidawad davidawad changed the title Doesn't work on huge json files Doesn't work on huge or improperly formatted json files Mar 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants