Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model conversion process failed. Unable to find bin files #2365

Open
joshight opened this issue Sep 5, 2024 · 0 comments
Open

Model conversion process failed. Unable to find bin files #2365

joshight opened this issue Sep 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@joshight
Copy link

joshight commented Sep 5, 2024

Description

(A clear and concise description of what the bug is.)

Seeing the following error during conversion when attempting to deploy a v1.4_llama3 fine tuned LLM with tensorrtllm.

LLM Inference Container:
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124

Bin files exist in s3 path, but cannot be found by the conversion process

Please note this works fine for vllm, but not tensorrt:

VLLM properties:

engine=Python
option.model_id=s3_path
option.tensor_parallel_degree=1
option.trust_remote_code=true
option.rolling_batch=vllm
option.entryPoint=djl_python.huggingface
option.max_model_len=16384
option.max_rolling_batch_size=16
option.enable_streaming=false

TRTLLM properties:

engine=Python
option.model_id=s3_path
option.tensor_parallel_degree=1
option.trust_remote_code=true
option.rolling_batch=trtllm
option.entryPoint=djl_python.huggingface
option.max_model_len=16384
option.max_rolling_batch_size=128
option.enable_streaming=false

Expected Behavior

(what's the expected behavior?)

Expect for the model conversion process to succeed just as it does for vllm config.

Error Message

(Paste the complete error message, including stack trace.)

[INFO ] LmiUtils - convert_py: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/.djl.ai/download/cffe5246b14faa11e217a6f21535dff1719c39ba/pytorch_model-00001-of-00004.bin'

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.) Can be reproduced in sagemaker

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Output logs of model deployment process

What have you tried to solve it?

  1. Tried changing instance size/type
  2. Validated .bin files are in place and correct path in s3
@joshight joshight added the bug Something isn't working label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant