Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service user to Dockerfile #517

Open
wants to merge 9 commits into
base: rc_054
Choose a base branch
from

Conversation

theobjectivedad
Copy link
Contributor

This PR introduces several enhancements to the primary Dockerfile including:

  • Added a user to /etc/passwd to run aphrodite-engine
  • Separated the user home from the application home, this way all dotfiles are written to /home/aphrodite-engine and avoid permission issues
  • Moved all build arguments to the top of the file for readability
  • Added the ability to set the UID and GID of the service account at build time, useful for reading host-mounted volumes
  • Added the ability to set the aphrodite-engine branch at build time
  • Added apt-get clean to (slightly) reduce layer size
  • Added build argument for TORCH_CUDA_ARCH_LIST, added sm_87 to the default list (Jetson, DRIVE, Clara)
  • Parameterized ENTRYPOINT with APP_HOME
  • Formatting changes for consistency & readability

No application paths were changed as part of this enhancement to maintain backwards compatibility, however please be aware that /aphrodite-engine is now read only.

Motivation

When loading a GPTQ quantized version of CommandR+, aphrodite-engine exited after building the KV cache with the following message: "KeyError: 'getpwuid(): uid not found: 1000'". The root cause was ultimately running the process under a UID that didn't have a corresponding entry in /etc/passwd. I'm not clear why this wasn't happening with other models I've tested with.

(RayWorkerAphrodite pid=3988) INFO:     Model weights loaded. Memory usage: 13.97 GiB x 4 = 55.88 GiB [repeated 2x across cluster]
...
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 63, in codegen_nodes
[rank0]:     return self._triton_scheduling.codegen_nodes(nodes)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3255, in codegen_nodes
[rank0]:     return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3427, in codegen_node_schedule
[rank0]:     kernel_name = self.define_kernel(src_code, node_schedule)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3537, in define_kernel
[rank0]:     basename, _, kernel_path = get_path(code_hash(src_code.strip()), "py")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 349, in get_path
[rank0]:     subdir = os.path.join(cache_dir(), basename[1:3])
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/utils.py", line 739, in cache_dir
[rank0]:     sanitized_username = re.sub(r'[\\/:*?"<>|]', "_", getpass.getuser())
[rank0]:   File "/usr/lib/python3.10/getpass.py", line 169, in getuser
[rank0]:     return pwd.getpwuid(os.getuid())[0]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]: KeyError: 'getpwuid(): uid not found: 1000'
[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]: You can suppress this exception and fall back to eager by setting:
[rank0]:     import torch._dynamo
[rank0]:     torch._dynamo.config.suppress_errors = True

As stated above, this introduces a better design where the service account user's home directory is different than the application directory. This allows any dotfiles (such as .cache) created by that user to reside outside the main application directory.

@AlpinDale AlpinDale changed the base branch from dev to rc_054 July 1, 2024 21:11
docker/Dockerfile Outdated Show resolved Hide resolved
@AlpinDale
Copy link
Member

@theobjectivedad thanks for the PR and sorry for the late review! I made some changes including rebasing it to the release candidate branch. Please let me know if this is ready to merge. Hopefully I didn't mess up the Dockerfile when rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants