Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker can enter a state where there is constant model swapping #336

Open
tazlin opened this issue Nov 2, 2024 · 0 comments
Open

Worker can enter a state where there is constant model swapping #336

tazlin opened this issue Nov 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@tazlin
Copy link
Member

tazlin commented Nov 2, 2024

A regression as of at least #309 and reported by a beta tester of that branch, it is possible for the worker to enter a state where two models are constantly loaded/unloaded on a non-working process. The scenario reported was such:

  • max_threads == 2
  • flux fp8 was currently working (other inference blocked due to large model blocking logic)
  • Two queued models alternatively would be loaded into a single process.

Likely candidates for root cause include the keep_single_inference(...) and the get_next_job_and_process(...) functions.

See this log exceprt:

reGen  | 2024-11-02 19:20:44.556 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:45.568 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model Juggernaut XL to system RAM. Loading took 1.14 seconds.
reGen  | 2024-11-02 19:20:45.669 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model Juggernaut XL
reGen  | 2024-11-02 19:20:45.772 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading Juggernaut XL
reGen  | 2024-11-02 19:20:46.684 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model AlbedoBase XL (SDXL) to system RAM. Loading took 1.11 seconds.
reGen  | 2024-11-02 19:20:46.887 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:46.990 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4061 - Process info:
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4063 - Process 0: (SAFETY) WAITING_FOR_JOB 
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4063 - Process 1 (INFERENCE_STARTING):  (Flux.1-Schnell fp16 (Compact) [last event: 171.25 secs ago: START_INFERENCE]
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4063 - Process 2 (PRELOADING_MODEL):  (Juggernaut XL [last event: 0.81 secs ago: PRELOAD_MODEL]
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4063 - Process 3 (WAITING_FOR_JOB):  (AbsoluteReality [last event: 302.43 secs ago: START_INFERENCE]
reGen  | 2024-11-02 19:20:47.702 | INFO     | [HWRPM]:print_status_method:4066 - dreamer_name: worker-23423432432432 | (v9.2.0) | horde user: worker-23423432432432#312152 | num_models: 187 | max_power: 64 (1448x1448) | max_threads: 2 | queue_size: 1 | safety_on_gpu: True
reGen  | 2024-11-02 19:20:47.703 | INFO     | [HWRPM]:print_status_method:4121 - Jobs: <aa1e6144-8234-4a9d-9b56-a3bfef10fee6: Flux.1-Schnell fp16 (Compact)>, <e6319011-a7b0-43d1-9e75-dea10e0022c1: Juggernaut XL>, <31dcabdb-10b1-4248-9a25-531e52af9029: AlbedoBase XL (SDXL)>
reGen  | 2024-11-02 19:20:47.703 | INFO     | [HWRPM]:print_status_method:4129 - Active models: {'Flux.1-Schnell fp16 (Compact)', 'Juggernaut XL', 'AbsoluteReality'}
reGen  | 2024-11-02 19:20:47.703 | SUCCESS  | [HWRPM]:print_status_method:4145 - Session job info: currently popped: 3 (eMPS: 112) | submitted: 23 | faulted: 0 | slow_jobs: 0 | process_recoveries: 0 | 0.00 seconds without jobs
reGen  | 2024-11-02 19:20:47.804 | INFO     | [HWRPM]:_process_control_loop:3862 - Blocking further inference because batch or slow_model inference in process.
reGen  | 2024-11-02 19:20:48.006 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model Juggernaut XL to system RAM. Loading took 1.14 seconds.
reGen  | 2024-11-02 19:20:48.109 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model Juggernaut XL
reGen  | 2024-11-02 19:20:48.211 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading Juggernaut XL
reGen  | 2024-11-02 19:20:49.430 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model AlbedoBase XL (SDXL) to system RAM. Loading took 1.36 seconds.
reGen  | 2024-11-02 19:20:49.533 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:49.635 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:50.650 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model Juggernaut XL to system RAM. Loading took 1.13 seconds.
reGen  | 2024-11-02 19:20:50.754 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model Juggernaut XL
reGen  | 2024-11-02 19:20:50.856 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading Juggernaut XL
reGen  | 2024-11-02 19:20:51.867 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model AlbedoBase XL (SDXL) to system RAM. Loading took 1.09 seconds.
reGen  | 2024-11-02 19:20:51.970 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:52.073 | INFO     | [HWRPM]:preload_models:2002 - Already preloading 1 models, waiting for one to finish before preloading AlbedoBase XL (SDXL)
reGen  | 2024-11-02 19:20:53.083 | INFO     | [HWRPM]:receive_and_handle_process_messages:1748 - Process 2 moved model Juggernaut XL to system RAM. Loading took 1.16 seconds.
reGen  | 2024-11-02 19:20:53.186 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model Juggernaut XL
@tazlin tazlin added the bug Something isn't working label Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant