Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line splitting from ripgrep --json output #16

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danipozo
Copy link

ripgrep --json results may contain characters that are considered newlines by str.splitlines, because it only treats \n or \r\n as newlines (this file is an example where that happens). ripgrep's output itself is separated by standard newlines.

Tested with dependent SeaGOAT.

@securisec
Copy link
Owner

@danipozo could you give an example of how you expect ripgreps response to look like with the --json flag for the example file you provided vs how ripgrepy is currently outputing the result?

@danipozo
Copy link
Author

ripgrepy currently throws an exception with the --json flag for the example file:

>>> Ripgrepy('.', '0').json().run().as_dict
ERROR:root:
Traceback (most recent call last):
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

This exception is caused by splitting the output of ripgrep --json in a newline character inside a JSON object. This (fragment) match object returned by ripgrep:

{
  "type": "match",
  "data": {
    "path": {
      "text": "01666_blns_long.reference"
    },
    "lines": {
      "text": "['undefined','undef','null','NULL','(null)','nil','NIL','true','false','True','False','TRUE','FALSE','None','hasOwnProperty','then','\\\\','\\\\\\\\','0','1','1.00','$1.00','1/2','1E2','1E02','1E+02','-1','-1.00','-$1.00','-1/2','-1E2','-1E02','-1E+02','1/0','0/0','-2147483648/-1','-9223372036854775808/-1','-0','-0.0','+0','+0.0','0.00','0..0','.','0.0.0','0,00','0,,0',',','0,0,0','0.0/0','1.0/0.0','0.0/0.0','1,0/0,0','0,0/0,0','--1','-','-.','-,','999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','NaN','Infinity','-Infinity','INF','1#INF','-1#IND','1#QNAN','1#SNAN','1#IND','0x0','0xffffffff','0xffffffffffffffff','0xabad1dea','123456789012345678901234567890123456789','1,000.00','1 000.00','1\\'000.00','1,000,000.00','1 000 000.00','1\\'000\\'000.00','1.000,00','1 000,00','1\\'000,00','1.000.000,00','1 000 000,00','1\\'000\\'000,00','01000','08','09','2.2250738585072011e-308',',./;\\'[]\\\\-=','<>?:\"{}|_+','!@#$%^&*()`~','\\\\u0001\\\\u0002\\\\u0003\\\\u0004\\\\u0005\\\\u0006\\\\u0007\\b\\\\u000e\\\\u000f\\\\u0010\\\\u0011\\\\u0012\\\\u0013\\\\u0014\\\\u0015\\\\u0016\\\\u0017\\\\u0018\\\\u0019\\\\u001a\\\\u001b\\\\u001c\\\\u001d\\\\u001e\\\\u001f\u007f','<U+0080><U+0081><U+0082><U+0083>
<U+0084><U+0086><U+0087><U+0088><U+0089><U+008A><U+008B><U+008C><U+008D><U+008E><U+008F><U+0090><U+0091><U+0092><U+0093><U+0094><U+0095><U+0096><U+0097>
<U+0098><U+0099><U+009A><U+009D><U+009E><U+009F>','\\t\\\\u000b\\f <U+0085>             ​<U+2028><U+2029>   ','­؀؁؂؃؄؅؜۝܏᠎

is broken at the <U+0085> character, therefore rendering an invalid JSON object, which causes the exception above. The JSON objects representing matches are themselves only separated by standard newlines, which shouldn´t appear in matches because ripgrep does line by line processing.

@securisec
Copy link
Owner

I am not confident that the PR will solve this issue besides on systems that uses \n like linux. It may create an issue on windows in its current implementation. Furthermore, the sample data also seems to cause issues with anything written in python. i.e. kitty terminal which hangs on rg '0' against the file.

I think a better PR would be to pass the split_by character via a function paramater for as_dict and as_json.

For example:

def as_dict(self, split_by: Union[str, None] = None):
    ...
    if split_by is not None:
        out = self._output.split(split_by)
    else:
        out = self._output.splitlines()
    ...

You have to help me understand for line in out[:-1]: in the PR also, as to why we are ignoring the last line.

@RecRanger
Copy link

I believe I've encountered this problem as I too am getting decoding errors. It seems like a bug if ripgrep is emitting newlines within JSON though?

What about an approach where it splits by newline, and then tries to decode each time it hits a newline. If it fails, it stores it in a running undecoded_partial_json variable (adding the stripped newline character back too), and then tries re-decoding the undecoded_partial_json each time it gets a new chunk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants