Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: give kinda helpful message if too many open files #1110

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

leondz
Copy link
Collaborator

@leondz leondz commented Feb 24, 2025

OS can get upset if parallel_attempts goes too high. Give a clearer error message about this.

(garak) 09:13:05 x1:~/dev/garak [main] $ python -m garak -m nim -n meta/llama-3.2-3b-instruct -p phrasing.PastTenseMini --parallel_attempts 1000 -g 5
garak LLM vulnerability scanner v0.10.2.post1 ( https://github.com/NVIDIA/garak ) at 2025-02-24T09:13:12.943850
📜 logging to /home/lderczynski/.local/share/garak/garak.log
🦜 loading generator: NIM: meta/llama-3.2-3b-instruct
📜 reporting to /home/lderczynski/.local/share/garak/garak_runs/garak.fb21a28e-16c8-4496-bd9e-b0f694333003.report.jsonl
🕵️  queue of probes: phrasing.PastTenseMini
probes.phrasing.PastTenseMini:   0%|                                                                                                                        | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 14, in <module>
    main()
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 9, in main
    cli.main(sys.argv[1:])
  File "/home/lderczynski/dev/garak/garak/cli.py", line 594, in main
    command.probewise_run(
  File "/home/lderczynski/dev/garak/garak/command.py", line 237, in probewise_run
    probewise_h.run(generator, probe_names, evaluator, buffs)
  File "/home/lderczynski/dev/garak/garak/harnesses/probewise.py", line 107, in run
    h.run(model, [probe], detectors, evaluator, announce_probe=False)
  File "/home/lderczynski/dev/garak/garak/harnesses/base.py", line 123, in run
    attempt_results = probe.probe(model)
                      ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 219, in probe
    attempts_completed = self._execute_all(attempts_todo)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 181, in _execute_all
    with Pool(_config.system.parallel_attempts) as attempt_pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 282, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 65, in _launch
    child_r, parent_w = os.pipe()
                        ^^^^^^^^^
OSError: [Errno 24] Too many open files

Verification

List the steps needed to make sure this thing works

  • try garak -m test -p test.Test --parallel_attempts 1000, the new error should pop up on CLI and in log. If it doesn't, try a higher number, or reduce ulimit.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me for parallel_attempts, what are your thoughts on adding a similar guard in generators/base.py related to parallel_requests as well?

In theory, if both were set the error would bubble up from the generator sub-processes however since parallel_requests is independent a generator that requires a single request per call could produce a similar error when parallel_attempts was not set.

At the same time I wonder about the value of catching OSError like this, are we going down a path that will require additional handlers for various resource limitation errors across supported operating systems?

Consider the command used to test this, run on a Windows installl with only 4GB of RAM can raise:

  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\popen_spawn_win32.py", line 75, in __init__
    hp, ht, pid, tid = _winapi.CreateProcess(
                       ^^^^^^^^^^^^^^^^^^^^^^
OSError: [WinError 1455] The paging file is too small for this operation to complete

@leondz
Copy link
Collaborator Author

leondz commented Feb 26, 2025

Amendments:

  • add the help for parallel_requests also
  • set configurable max_workers value and check this during CLI validation
  • cap worker pool sizes for parallel_requests, parallel_attempts

Validation:

  • set config.system.max_workers to 1000 (on the high side) first

  • requests:

    • garak -m test -p test.Test --parallel_requests 2000 - rejected before run starts
    • garak -m test -p test.Test --parallel_requests 1000 - no crash (linux (sometimes))
    • garak -m test -p test.Test --parallel_requests 1000 -g 1000 - crash
  • attempts:

    • garak -m test -p test.Test --parallel_attempts 2000 - rejected before run starts
    • garak -m test -p test.Test --parallel_attempts 1000 - no crash (linux (sometimes))
    • garak -m test -p continuation.ContinueSlursReclaimedSlurs --parallel_attempts 1000 - crash

-- i hope the windows message is alright - i don't have a great idea of how this goes wrong

@leondz leondz requested a review from jmartin-tech February 26, 2025 09:15
@leondz leondz changed the title give kinda helpful message if too many open files feature: give kinda helpful message if too many open files Feb 26, 2025
@jmartin-tech jmartin-tech self-assigned this Feb 28, 2025
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extending to parallel_requests this looks ready.

Comment on lines +166 to +170
pool_size = min(
generations_this_call,
_config.system.parallel_requests,
_config.system.max_workers,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct access to config here suggest we should have a helper that owns process or thread pools that references _config.system.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the churn, final validation identified that the new system.max_workers is not taking overrides into account.

garak.site.yaml:

system:
  max_workers: 2000
python -m garak: error: argument --parallel_attempts: Parallel worker count capped at 500 (config.system.max_workers)

iworkers = int(workers)
if iworkers <= 0:
raise argparse.ArgumentTypeError("Need >0 workers (int)" % workers)
if iworkers > _config.system.max_workers:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing show this is not really configurable as _config has not yet loaded garak.site.yaml and also has not loaded --config supplied file if passed on cli.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks, will amend and I guess write a test for

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, without undue gymnastics, we can either:

  1. have a hardcoded, non-configurable cap on CLI param validation (in where? garak.cli isn't universal, but otoh argparse only applies within this module; top-level values in garak._config is the opposite of the direction we're trying to move in)
  2. drop CLI max_workers validation but let the config-based one be enforced

Having params that are only configurable in garak.core.yaml isn't a good option

wdyt? are there other options that make sense? currently leaning toward (2)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do the cli provided validation call after config is loaded and before starting to instantiate things say around here?

garak/garak/cli.py

Lines 369 to 371 in 27d4554

# base config complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants