Fix line splitting from ripgrep --json output #16

danipozo · 2023-09-22T16:13:52Z

ripgrep --json results may contain characters that are considered newlines by str.splitlines, because it only treats \n or \r\n as newlines (this file is an example where that happens). ripgrep's output itself is separated by standard newlines.

Tested with dependent SeaGOAT.

securisec · 2023-09-24T07:29:42Z

@danipozo could you give an example of how you expect ripgreps response to look like with the --json flag for the example file you provided vs how ripgrepy is currently outputing the result?

danipozo · 2023-09-24T09:58:54Z

ripgrepy currently throws an exception with the --json flag for the example file:

>>> Ripgrepy('.', '0').json().run().as_dict
ERROR:root:
Traceback (most recent call last):
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

This exception is caused by splitting the output of ripgrep --json in a newline character inside a JSON object. This (fragment) match object returned by ripgrep:

{
  "type": "match",
  "data": {
    "path": {
      "text": "01666_blns_long.reference"
    },
    "lines": {
      "text": "['undefined','undef','null','NULL','(null)','nil','NIL','true','false','True','False','TRUE','FALSE','None','hasOwnProperty','then','\\\\','\\\\\\\\','0','1','1.00','$1.00','1/2','1E2','1E02','1E+02','-1','-1.00','-$1.00','-1/2','-1E2','-1E02','-1E+02','1/0','0/0','-2147483648/-1','-9223372036854775808/-1','-0','-0.0','+0','+0.0','0.00','0..0','.','0.0.0','0,00','0,,0',',','0,0,0','0.0/0','1.0/0.0','0.0/0.0','1,0/0,0','0,0/0,0','--1','-','-.','-,','999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','NaN','Infinity','-Infinity','INF','1#INF','-1#IND','1#QNAN','1#SNAN','1#IND','0x0','0xffffffff','0xffffffffffffffff','0xabad1dea','123456789012345678901234567890123456789','1,000.00','1 000.00','1\\'000.00','1,000,000.00','1 000 000.00','1\\'000\\'000.00','1.000,00','1 000,00','1\\'000,00','1.000.000,00','1 000 000,00','1\\'000\\'000,00','01000','08','09','2.2250738585072011e-308',',./;\\'[]\\\\-=','<>?:\"{}|_+','!@#$%^&*()`~','\\\\u0001\\\\u0002\\\\u0003\\\\u0004\\\\u0005\\\\u0006\\\\u0007\\b\\\\u000e\\\\u000f\\\\u0010\\\\u0011\\\\u0012\\\\u0013\\\\u0014\\\\u0015\\\\u0016\\\\u0017\\\\u0018\\\\u0019\\\\u001a\\\\u001b\\\\u001c\\\\u001d\\\\u001e\\\\u001f\u007f','<U+0080><U+0081><U+0082><U+0083>
<U+0084><U+0086><U+0087><U+0088><U+0089><U+008A><U+008B><U+008C><U+008D><U+008E><U+008F><U+0090><U+0091><U+0092><U+0093><U+0094><U+0095><U+0096><U+0097>
<U+0098><U+0099><U+009A><U+009D><U+009E><U+009F>','\\t\\\\u000b\\f <U+0085>             <U+2028><U+2029>  　','؀؁؂؃؄؅؜۝܏᠎

is broken at the <U+0085> character, therefore rendering an invalid JSON object, which causes the exception above. The JSON objects representing matches are themselves only separated by standard newlines, which shouldn´t appear in matches because ripgrep does line by line processing.

securisec · 2023-09-24T15:31:37Z

I am not confident that the PR will solve this issue besides on systems that uses \n like linux. It may create an issue on windows in its current implementation. Furthermore, the sample data also seems to cause issues with anything written in python. i.e. kitty terminal which hangs on rg '0' against the file.

I think a better PR would be to pass the split_by character via a function paramater for as_dict and as_json.

For example:

def as_dict(self, split_by: Union[str, None] = None):
    ...
    if split_by is not None:
        out = self._output.split(split_by)
    else:
        out = self._output.splitlines()
    ...

You have to help me understand for line in out[:-1]: in the PR also, as to why we are ignoring the last line.

RecRanger · 2024-05-27T08:15:39Z

I believe I've encountered this problem as I too am getting decoding errors. It seems like a bug if ripgrep is emitting newlines within JSON though?

What about an approach where it splits by newline, and then tries to decode each time it hits a newline. If it fails, it stores it in a running undecoded_partial_json variable (adding the stripped newline character back too), and then tries re-decoding the undecoded_partial_json each time it gets a new chunk?

Fix line splitting from ripgrep --json output

7173271

danipozo mentioned this pull request Sep 22, 2023

Try to ignore binary files and detect proper encoding kantord/SeaGOAT#240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix line splitting from ripgrep --json output #16

Fix line splitting from ripgrep --json output #16

danipozo commented Sep 22, 2023

securisec commented Sep 24, 2023

danipozo commented Sep 24, 2023

securisec commented Sep 24, 2023

RecRanger commented May 27, 2024

Fix line splitting from ripgrep --json output #16

Are you sure you want to change the base?

Fix line splitting from ripgrep --json output #16

Conversation

danipozo commented Sep 22, 2023

securisec commented Sep 24, 2023

danipozo commented Sep 24, 2023

securisec commented Sep 24, 2023

RecRanger commented May 27, 2024