Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when I run on single GPU #1694

Open
kailashg26 opened this issue Sep 26, 2024 · 9 comments
Open

Bug when I run on single GPU #1694

kailashg26 opened this issue Sep 26, 2024 · 9 comments

Comments

@kailashg26
Copy link

kailashg26 commented Sep 26, 2024

Command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
Output:

INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  checkpoint_files:
  - model-00001-of-00004.safetensors
  - model-00002-of-00004.safetensors
  - model-00003-of-00004.safetensors
  - model-00004-of-00004.safetensors
  model_type: LLAMA3
  output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 64
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/lora_finetune_output
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  apply_lora_to_mlp: false
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  lora_dropout: 0.0
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/lora_finetune_output
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/lora_finetune_output/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 5
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3188944798. Local seed is seed + rank = 3188944798 + 0
Writing logs to /tmp/lora_finetune_output/log_1727379753.txt
Traceback (most recent call last):
  _File "/home/kailash/miniconda3/envs/llm/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 185, in _run_cmd
    self._run_single_device(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 94, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 739, in <module>
    sys.exit(recipe_main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 733, in recipe_main
    recipe.setup(cfg=cfg)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 215, in setup
    checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 148, in load_checkpoint
    self._checkpointer = config.instantiate(
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 106, in instantiate
    return _instantiate_node(OmegaConf.to_object(config), *args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 31, in _instantiate_node
    return _create_component(_component_, args, kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 348, in __init__
    self._checkpoint_paths = self._validate_hf_checkpoint_files(checkpoint_files)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 389, in _validate_hf_checkpoint_files
    checkpoint_path = get_path(self._checkpoint_dir, f)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_utils.py", line 95, in get_path
    raise ValueError(f"No file with name: {filename} found in {input_dir}.")
ValueError: No file with name: model-00001-of-00004.safetensors found in /tmp/Meta-Llama-3.1-8B-Instruct._

Can any help me with this?

@joecummings
Copy link
Contributor

Hi @kailashg26 - did you download the model to the location specified at the top of the config?

tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"

@kailashg26
Copy link
Author

Hi @joecummings

I did try that now. But unfortunately, I get an error:

_tune download: error: Failed to download meta-llama/Meta-Llama-3.1-8B-Instruct with error: 'An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.' and traceback: Traceback (most recent call last):
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
response.raise_for_status()
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1746, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
r = _request_wrapper(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
response = _request_wrapper(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
hf_raise_for_status(response)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 468, in hf_raise_for_status
raise _format(HfHubHTTPError, message, response) from e
huggingface_hub.errors.HfHubHTTPError: (Request ID: Root=1-66f5ccd5-3d403ea47109c46d5f04a9c4;0a3f40fb-d4e3-400c-b9b6-18fd89e310b7)

403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository..
Cannot access content at: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/0e9e39f249a16976918f6564b8830bc894c89659/.gitattributes.
Make sure your token has the correct permissions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/download.py", line 126, in _download_cmd
true_output_dir = snapshot_download(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 290, in snapshot_download
thread_map(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 264, in _inner_hf_hub_download
return hf_hub_download(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1212, in hf_hub_download
return _hf_hub_download_to_local_dir(
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1461, in _hf_hub_download_to_local_dir
_raise_on_head_call_error(head_call_error, force_download, local_files_only)
File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1857, in raise_on_head_call_error
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

@joecummings
Copy link
Contributor

The Llama models are "gated". This simply means you need to fill out some information in order to download the model. This can all be done from the model card page on the Hugging Face Hub: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.

After you do this, you should see a tag like below that says you've been granted access to the model. The whole process should take less than 10 minutes.
Screenshot 2024-09-26 at 10 29 24 PM

After this, try the above command again. Should work!

@kailashg26
Copy link
Author

kailashg26 commented Sep 27, 2024

Hi @joecummings I did see that I have been granted access. I had to change some permissions and it worked now. Thanks!
Btw, how long does it take to fine-tune with alpaca dataset with llama 3.1? Also, is there any way that I could monitor the model level and system level events like memory, energy, memory bandwidth?

  • Throughput (tokens per second)
    
  • Latency (total response time (TRT)):  the number of seconds it takes to output 100 tokens
    
  • Latency (time to first chunk (TTFC))?
    

@ebsmothers
Copy link
Contributor

@kailashg26 training time will depend on a bunch of things, including whether you're running full/LoRA/QLoRA finetune, whether you have activation checkpointing enabled, whether you're performing sample packing, bsz, seq len, how many (and what kind of devices) you're running on, and more. I recently ran some tests and was able to run an epoch of QLoRA training on an A100 (with Llama3, not 3.1, but should be similar) in as fast as 36 minutes (but again, depends on all of the above). You can check out slides 23-29 of this presentation for more details on this. As a starting point, I'd recommend setting compile=True and dataset.packed=True if you want to reduce your training time. For packed dataset you also need to set tokenizer.max_seq_len. This may require some experimentation depending on how much memory you have, you can try e.g. 2048 as a starting point though.

For monitoring system-level metrics, we log tokens/second by default (see here) and will also log peak memory stats if you set log_peak_memory_stats=True. We support different logging backends like WandB, Tensorboard, Comet logger if you use any of those. If you also want to log time-to-first-batch or other custom metrics we don't currently support, I'd recommend copying the recipe, then modifying your local version (happy to provide pointers on where/how any particular metrics should be logged).

@kailashg26
Copy link
Author

Thanks @ebsmothers
I'll take a look at that. Thanks for the pointers. I'll try them and get back if I have questions.

@kailashg26
Copy link
Author

kailashg26 commented Sep 27, 2024

Hi @ebsmothers ,

I successfully executed this command:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

I'm wondering how do I infer the output? Any suggestions or any documentation that will walk me through?
Additionally, I would also like to understand how the workload is balanced on single CPU-GPU system. Like which portion of code is CPU or GPU-bound. It would be great if you can provide me with the appropriate documentation for this :)

Thanks and appreciate the help!

@ebsmothers
Copy link
Contributor

@kailashg26 that depends on what you are trying to do. If the full training loop completed there should be a checkpoint saved on your local filesystem. You can evaluate the quality of your fine-tuned model by using it to generate some text or by evaluating on a common benchmark (e.g. using our integration with EleutherAI's eval harness). You can check out our Llama3 tutorial here, which has sections on both of these (everything in there should be equally applicable to Llama 3.1).

Regarding determining whether code is CPU- or GPU-bound, you can use our integration with the PyTorch profiler. Just set profiler.enabled=True in your config. This will output a trace file that you can then view in e.g. Perfetto. The full set of profiler configurations we support can be seen for example here.

@kailashg26
Copy link
Author

kailashg26 commented Oct 3, 2024

Hi @ebsmothers,

Thanks for your response. I’m trying to understand the interaction between hardware parameters and workload characteristics. This is my first step toward learning and fine-tuning LLMs. I’m currently focusing on parameters that could help me study the trade-offs between training time, power consumption, energy, and cache metrics (including misses) from a systems perspective.

From an algorithmic perspective, I’m working on a resource-constrained device (a single Nvidia 4090 GPU) and am interested in observing trade-offs between these metrics. I was thinking about varying parameters like batch size and sequence length, but I’m not entirely sure how these affect the underlying hardware. Any insights or documentation on these metrics would be really helpful!

I’m also experimenting with data types, using bf16, but I assume switching to fp32 would result in running out of memory when training a Llama 3.1 8B model. I’m also considering different training approaches (full, LoRA, and QLoRA) and how they impact performance and resource utilization.

Since I’m just getting started, are there any other knobs or parameters you’d recommend studying to better understand LLMs from a systems perspective?

Also, could you please help me to find the below two metrics:

  1. Throughput (tokens per second)
  2. Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants