Doesn't work on huge or improperly formatted json files #10

davidawad · 2018-02-28T02:16:21Z

I've been having a lot of specific problems that your tool would be PERFECT for, but I might be running against some bugs that are unique to my problem.

So normally I'd use jj to take a file that might not be perfect JSON (meaning that it may contain trailing slashes) and use jj to read this json file and parse it out, and then print it in the proper format in order for something else to consume that. I'm finding that this isn't working with larger json files and jj is blowing up their size (taking files from 44MB -> 5GB)

To be more specific, here's what I"m doing.

$ echo '[{"id": 1, "name": "Arthur", "age": "21"},{"id": 2, "name": "Richard", "age": "32"}]' | jj -p
# works fine
$ echo '[{"id": 1, "name": "Arthur", "age": "21"},{"id": 2, "name": "Richard", "age": "32"}]' | jq --color-output
# works fine

This works perfectly fine. I input a correct json object and both jq and jj are able to handle that.

JJ is better for me because I have what I'll call lossy JSON. My lossy json has things like trailing commas in the file that cause it to fail typical validation.

echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p
# works fine

echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jq --color-output
#jq errors on me 
parse error: Expected another key-value pair at line 1, column 43

I need to read these JSON files in other programs after parsing them and making sure that they don't have these commas.
So I've been passing them through JJ and then sending them to jq to validate them.

$ echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p  | jq --color-output

Now I've noticed that it's only when i use jj -p that JJ does this convenient functionality that I'm leveraging to clear the trailing commas.

I want the file to be small but I also want it to be valid.
So I'm now doing something like this:

$ echo '[{"id": 1, "name": "Arthur", "age": "21", },{"id": 2, "name": "Richard", "age": "32",}]' | jj -p  | jj -u | jq --color-output

Now I'm getting the best of both worlds. It's expensive, but I don't care since I'm more interested in making sure the data is formatted correctly and don't care about my cpu grinding this out for a few more minutes.

The problem is that passing larger files through JJ -p is not blowing up their file size like crazy. (43 -> 732M)

Here's a screenshot of what I'm seeing that I think is causing these huge size increases.

Size of course, is no issue if the data is usable, however it doesn't seem to be so.

When attempting to parse the output file from just jj -p I get unusual problems.

>>> import json
>>> data = json.load(open('alabama_2012_expanded.json'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
RuntimeError: maximum recursion depth exceeded while calling a Python object

I've even tried expanding the python recursion depth to 10,000 and it still doesn't work.

I've tried going from jj -p | jj -u and then passing that to jq to validate it. But when I validate it I get problems like this.

$ cat lossy_alabama_2012.json | jj -p | jj -u  > alabama_2012_cleaned.json
$ cat alabama_2012_cleaned.json | jq --color-output
parse error: Unfinished JSON term at EOF at line 1, column 42370384

Here's an example of one of these huge JSON files I'm working with it's 43M

TL; DR : jj -p is blowing up filesizes, and passing jj -p | jj -u is making files that aren't valid.

My question for you, do you have any idea what could be going on here? I can't seem to get a proper version of this file saved the way I need it to be.

The text was updated successfully, but these errors were encountered:

pkoppstein · 2018-02-28T08:56:48Z

@davidawad - I have some good news regarding jj -u and a tool called rjson (a script that can be installed e.g. by: yarn global add relaxed-json).

@tidwall - Apart from the good news mentioned above, there is quite a lot of what might be regarded as "bad news" if jj is supposed to handle the same file in a uniform manner.

In the following:

the original alabama_2012 "non-JSON JSON" file is named alabama_2012.qjson
other file names are derived by appending a string reflecting subsequent processing.

GOOD NEWS: rjson produces valid JSON (an array of length 46) in about 3s.

 jq length alabama_2012.qjson.rjson
 46

GOOD NEWS: rjson can be applied to the output of jj -u to yield valid JSON that agrees with the rjson output:

alabama_2012.qjson.jj-u.rjson == alabama_2012.qjson.rjson

GOOD NEWS: jj -p is IDEMPOTENT on this file.

BAD NEWS: jj -u produces invalid JSON (though it can be repaired by rjson, as mentioned above).

BAD NEWS: "jj -p" produces invalid JSON that cannot be repaired by rjson:

rjson alabama_2012.json.jj-p > alabama_2012.json.jj-p.rjson
Error on line 318939: Unexpected token: end-of-file, expected json object

BAD NEWS: jsonlint -S fails

davidawad · 2018-02-28T17:44:44Z

So I actually resorted to a cheap trick that solved my problem. I was able to modify the program creating this lossy JSON and now I just include an extra empty {} at the end of every array being generated.

This wastes space (and is generally terrible), but saves massive time since I don't have to now parse the files to fix the messyness after the fact.

I still contend that this is an issue for JJ, given that it inflates filesize like crazy but I don't have any need for it to be fixed anymore.

Thanks for your help

pkoppstein · 2018-02-28T17:55:32Z

@davidawad - Glad to hear you have an easy way out of the messiness. Having an extra {} is a small price to pay for conforming with a standard, which is the whole point about JSON -- it's a simple but expressive standard with a small price to pay.

I've come to the conclusion that although in some cases JJ handles wonky JSON the way you'd want, there's usually some kind of "gotcha" -- which isn't surprising -- see above about having a worthwhile standard.

tidwall · 2018-02-28T18:19:16Z

While JJ can handle broken JSON in many cases, that is not it's intended purpose. Under the hood it uses the gjson Get function and the README states

"The Get* and Parse* functions expects that the json is well-formed. Bad json will not panic, but it may return back unexpected results."

There are some known consistencies with bad json that JJ can handle, such as missing or trailing tokens. But I doubt that every scenario is recoverable at the moment. Unterminated elements being one.

Perhaps in the future I'll build a more defined approach to handling undefined JSON. 🤔

pkoppstein · 2018-03-01T08:18:47Z

@tidwall - Thanks for the clarification. The surprising and unfortunate thing, I think, is that the -u and -p options do sometimes give inconsistent results when presented with quasi-JSON, thus inevitably raising the question of whether they always give consistent results when presented with strictly valid JSON. The good news (at least for me) is that the testing I've done on some very large (valid) JSON data files (up to 1,271,577,470 bytes when formatted) yields no surprises.

tidwall · 2018-03-01T16:35:26Z

Valid json will not give inconsistent results. If you find that they do then please let me know.

davidawad changed the title ~~Doesn't work on huge json files~~ Doesn't work on huge or improperly formatted json files Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't work on huge or improperly formatted json files #10

Doesn't work on huge or improperly formatted json files #10

davidawad commented Feb 28, 2018 •

edited

Loading

pkoppstein commented Feb 28, 2018

davidawad commented Feb 28, 2018 •

edited

Loading

pkoppstein commented Feb 28, 2018

tidwall commented Feb 28, 2018

pkoppstein commented Mar 1, 2018

tidwall commented Mar 1, 2018

Doesn't work on huge or improperly formatted json files #10

Doesn't work on huge or improperly formatted json files #10

Comments

davidawad commented Feb 28, 2018 • edited Loading

pkoppstein commented Feb 28, 2018

davidawad commented Feb 28, 2018 • edited Loading

pkoppstein commented Feb 28, 2018

tidwall commented Feb 28, 2018

pkoppstein commented Mar 1, 2018

tidwall commented Mar 1, 2018

davidawad commented Feb 28, 2018 •

edited

Loading

davidawad commented Feb 28, 2018 •

edited

Loading