-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug when I run on single GPU #1694
Comments
Hi @kailashg26 - did you download the model to the location specified at the top of the config?
|
Hi @joecummings I did try that now. But unfortunately, I get an error: _tune download: error: Failed to download meta-llama/Meta-Llama-3.1-8B-Instruct with error: 'An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.' and traceback: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): 403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository.. The above exception was the direct cause of the following exception: Traceback (most recent call last): |
The Llama models are "gated". This simply means you need to fill out some information in order to download the model. This can all be done from the model card page on the Hugging Face Hub: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. After you do this, you should see a tag like below that says you've been granted access to the model. The whole process should take less than 10 minutes. After this, try the above command again. Should work! |
Hi @joecummings I did see that I have been granted access. I had to change some permissions and it worked now. Thanks!
|
@kailashg26 training time will depend on a bunch of things, including whether you're running full/LoRA/QLoRA finetune, whether you have activation checkpointing enabled, whether you're performing sample packing, bsz, seq len, how many (and what kind of devices) you're running on, and more. I recently ran some tests and was able to run an epoch of QLoRA training on an A100 (with Llama3, not 3.1, but should be similar) in as fast as 36 minutes (but again, depends on all of the above). You can check out slides 23-29 of this presentation for more details on this. As a starting point, I'd recommend setting For monitoring system-level metrics, we log tokens/second by default (see here) and will also log peak memory stats if you set |
Thanks @ebsmothers |
Hi @ebsmothers , I successfully executed this command: I'm wondering how do I infer the output? Any suggestions or any documentation that will walk me through? Thanks and appreciate the help! |
@kailashg26 that depends on what you are trying to do. If the full training loop completed there should be a checkpoint saved on your local filesystem. You can evaluate the quality of your fine-tuned model by using it to generate some text or by evaluating on a common benchmark (e.g. using our integration with EleutherAI's eval harness). You can check out our Llama3 tutorial here, which has sections on both of these (everything in there should be equally applicable to Llama 3.1). Regarding determining whether code is CPU- or GPU-bound, you can use our integration with the PyTorch profiler. Just set |
Hi @ebsmothers, Thanks for your response. I’m trying to understand the interaction between hardware parameters and workload characteristics. This is my first step toward learning and fine-tuning LLMs. I’m currently focusing on parameters that could help me study the trade-offs between training time, power consumption, energy, and cache metrics (including misses) from a systems perspective. From an algorithmic perspective, I’m working on a resource-constrained device (a single Nvidia 4090 GPU) and am interested in observing trade-offs between these metrics. I was thinking about varying parameters like batch size and sequence length, but I’m not entirely sure how these affect the underlying hardware. Any insights or documentation on these metrics would be really helpful! I’m also experimenting with data types, using bf16, but I assume switching to fp32 would result in running out of memory when training a Llama 3.1 8B model. I’m also considering different training approaches (full, LoRA, and QLoRA) and how they impact performance and resource utilization. Since I’m just getting started, are there any other knobs or parameters you’d recommend studying to better understand LLMs from a systems perspective? Also, could you please help me to find the below two metrics:
Thanks in advance. |
Command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
Output:
Can any help me with this?
The text was updated successfully, but these errors were encountered: