Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffered reader is default for jsonl #446

Closed
markcurtis1970 opened this issue Oct 10, 2024 · 6 comments · Fixed by #482
Closed

Buffered reader is default for jsonl #446

markcurtis1970 opened this issue Oct 10, 2024 · 6 comments · Fixed by #482

Comments

@markcurtis1970
Copy link

markcurtis1970 commented Oct 10, 2024

Describe the bug
When using the jsonl source larger files might exceed the standard bufio limit

%  % sq add 131467comments.json
sq: bufio.Scanner: token too long

To Reproduce
Read in a large file

Expected behavior
Ideally we should cope with larger lines

sq version

Paste the output of sq version --yaml into the code block below:

# $ sq version --yaml
# REPLACE THIS WITH YOUR OUTPUT
version: v0.48.3
commit: 3ae4104
timestamp: 2024-03-12T02:55:43Z
latest_version: v0.48.3
host:
  platform: darwin
  arch: amd64
  kernel: Darwin
  kernel_version: 23.3.0
  variant: macOS
  variant_version: 14.3.1

Source details

Source didnt load so had no further detail

Logs

ESC[2m21:46:07.829955ESC[0m ESC[33mINFESC[0m ESC[34mfiles.(*Files).doCacheSweepESC[0m ESC[2mfiles/cache.go:448ESC[0m ESC[33mSwept cache dirESC[0m ESC[2mpid=ESC[0mESC[36;2m60415ESC[0m ESC[2mdir=ESC[0mESC[36;2m/Users/mc/Library/Caches/sq/60f4d1f1ESC[0m ESC[2mdeleted_dirs=ESC[0mESC[36;2m0EESC[0m
ESC[2m21:46:07.830702ESC[0m ESC[1;91mERRESC[0m ESC[34mcli.PrintErrorESC[0m ESC[2mcli/error.go:55ESC[0m ESC[1;91mEXECUTION FAILEDESC[0m ESC[2mpid=ESC[0mESC[36;2m60415ESC[0m ESC[2merr.msg=ESC[0mESC[36;2m"bufio.Scanner: token too long"ESC[0m ESC[2merr.type=ESC[0mESC[36;2m*errz.errzESC[0m EESC[2merr.cause=ESC[0mESC[36;2m*errors.errorStringESC[0m ESC[2mcmd=ESC[0mESC[36;2maddESC[0m
ESC[92;2m*errz.errz: *errz.errz: *errors.errorStringESC[0m
ESC[1;33mbufio.Scanner: token too longESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).detectTypeESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/libsq/files/detect.go:180ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).DetectTypeESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/libsq/files/detect.go:73ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.execSrcAddESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/cli/cmd_add.go:217ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.addCmd.func2ESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/cli/cli.go:355ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).executeESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:983ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteCESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039ESC[0m
ESC[0;35mgithub.com/spf13/cobra.(*Command).ExecuteContextESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:1032ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.ExecuteWithESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/cli/cli.go:217ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.ExecuteESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/cli/cli.go:76ESC[0m
ESC[0;35mmain.mainESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/main.go:56ESC[0m
ESC[0;35mruntime.mainESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/proc.go:271ESC[0m
ESC[0;35mruntime.goexitESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695ESC[0m

ESC[92;2m*errz.errz: *errors.errorStringESC[0m
ESC[1;33mbufio.Scanner: token too longESC[0m
ESC[0;35mgithub.com/neilotoole/sq/cli.FinishRunInit.DetectJSONL.func5ESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/drivers/json/ingest_jsonl.go:79ESC[0m
ESC[0;35mgithub.com/neilotoole/sq/libsq/files.(*Files).detectType.func3ESC[0m
ESC[0;35m  /Users/runner/work/sq/sq/libsq/files/detect.go:161ESC[0m
ESC[0;35mgolang.org/x/sync/errgroup.(*Group).Go.func1ESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78ESC[0m
ESC[0;35mruntime.goexitESC[0m
ESC[0;35m  /Users/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695ESC[0m

Screenshots
n/a

Additional context

When using a bufio.NewScanner the default is 64k

https://pkg.go.dev/bufio#Scanner.Buffer

I have used this example in code before with success

https://stackoverflow.com/questions/39859222/golang-how-to-overcome-scan-buffer-limit-from-bufio/66876416#66876416

However in the this part of the code the file isnt present so we cant really get the info from file.Stat()

https://github.com/neilotoole/sq/blob/master/drivers/json/ingest_jsonl.go#L38

@neilotoole
Copy link
Owner

@markcurtis1970 When I wrote that ingest code using bufio.Scanner, in my soul I knew I'd have to answer to that 64K limit some day...

If you have a specific (cleansed) file you can share that breaks the ingester, it could be added as a test case.

I don't think this is an insurmountable problem. The Scanner is basically only being used for line breaks.

Note, in general: that jsonl ingester code is pretty awful (and the other JSON ingesters too). There's been very little attempt at any sort of optimization (or even doing it right). The whole thing should probably be thrown away and rewritten.

I'm very open to a PR, and contributors in general, not just for this part of the codebase, but for the entire project.

@markcurtis1970
Copy link
Author

...in my soul I knew I'd have to answer to that 64K limit some day...

Haha yeah I know that feeling! (im gonna have to circle back to this)

Sure, thanks - I was going to try a PR but there wasnt a file object to get the stat from and my "go-fu" is intermediate at best ;-) otherwise I would love to help!

I'll see if I can sanitise that file in the meantime!

@YeharaDananjaya

This comment has been minimized.

@neilotoole neilotoole linked a pull request Jan 19, 2025 that will close this issue
neilotoole added a commit that referenced this issue Jan 20, 2025
* JSON TypeDetectFuncs: additional checks to exit early if not possible to be valid JSON
* Refactoring of `bufio.Scanner` usage
@neilotoole
Copy link
Owner

@markcurtis1970: The latest sq version v0.48.5 addresses this issue.

  • The default buffer max size is significantly larger.

  • There's a new config option, tuning.scan-buffer-limit, that allows the user to increase the buffer limit even higher. E.g.

    $ sq config set tuning.scan-buffer-limit 64MB   # or 1024B, 64KB, 1GB, etc.
    

Can you give the new version a try, and re-open this issue if there's still trouble?

Thanks for your patience.

@markcurtis1970
Copy link
Author

markcurtis1970 commented Jan 21, 2025

Ok I tested this with the original file that gave the issue

% sq version
sq v0.48.5

% sq config set tuning.scan-buffer-limit 1B
% sq add ./131467comments.json
sq: detect: jsonl: bufio.Scanner: token too long: maybe adjust config 'tuning.scan-buffer-limit'

% sq config set tuning.scan-buffer-limit 64MB
% sq add ./131467comments.json
@h131467comments  jsonl  131467comments.json
%

Looks good to me thanks!

@neilotoole
Copy link
Owner

Excellent, thank you @markcurtis1970 🙏

tmeijn pushed a commit to tmeijn/dotfiles that referenced this issue Jan 28, 2025
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [neilotoole/sq](https://github.com/neilotoole/sq) | patch | `v0.48.4` -> `v0.48.5` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>neilotoole/sq (neilotoole/sq)</summary>

### [`v0.48.5`](https://github.com/neilotoole/sq/blob/HEAD/CHANGELOG.md#v0485---2025-01-19)

[Compare Source](neilotoole/sq@v0.48.4...v0.48.5)

##### Fixed

-   \[[#&#8203;446](neilotoole/sq#446)]: A [`bufio.ErrTooLong`](https://pkg.go.dev/bufio#ErrTooLong) was being returned
    by [`bufio.Scanner`](https://pkg.go.dev/bufio#Scanner), when splitting
    lines from input that was too long (larger than
    [`bufio.MaxScanTokenSize`](https://pkg.go.dev/bufio#MaxScanTokenSize), i.e. `64KB`). This meant that
    `sq` wasn't able to parse large JSON files, amongst other problems. The maximum buffer size is
    now configurable via the new [`tuning.scan-buffer-limit`](https://sq.io/docs/config/#tuningscan-buffer-limit)
    option. Note that the buffer will start small and grow as needed, up to the limit.

    ```plaintext
    $ sq config set tuning.scan-buffer-limit 64MB   # or 1024B, 64KB, 1GB, etc.
    ```

    A more useful error message is also now returned when the buffer limit is exceeded
    (the error suggests adjusting `tuning.scan-buffer-limit`).

##### Changed

-   Renamed config option `tuning.buffer-mem-limit` to [`tuning.buffer-spill-limit`](https://sq.io/docs/config/#tuningbuffer-spill-limit).
    The new name better reflects the purpose of the option.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4xMjYuMSIsInVwZGF0ZWRJblZlciI6IjM5LjEyNi4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@neilotoole @markcurtis1970 @YeharaDananjaya and others