Efficient Fine-Tuning of Open Source Language Learning Models (LLMs) Repository

This repository contains multiple Jupyter notebooks used for fine-tuning open source Language Learning Models (LLMs). The models include but are not limited to, Gemma, Falcon7B, Phi-2, and Zephyr7B.

Each notebook provides a detailed walkthrough of the fine-tuning, including data preprocessing, and model training. They serve as comprehensive guides for those interested in understanding and applying LLMs in their projects.

Here are the major steps involved in each notebook:

Set Up the Development Environment: Prepare the necessary software and libraries for the project.
Load and Prepare the Dataset: Import the dataset and preprocess it for the model.
Load the Base Model: Load the model for fine-tuning.
Fine-Tune the LLM: Adjust the LLM model parameters on our dataset.
Push the Fine-Tuned Model to the Hugging Face Hub.

The Efficient Techniques Used Are:

Quantization

Quantization is the process of constraining an input from a continuous (large set of values) to a discrete set.

Mathematically, a linear quantization is an affine mapping of integers to floating points, we can write the equation as:

$$ r = (q - Z) \times S $$

Where:

$r:$ Floating-point
$q:$ Integer
$Z:$ Zero point
$S:$ Scale
$r_{max}:$ maximum float in Tensor
$r_{min}:$ minimum float in Tensor
$q_{max}:$ maximum integer ($N$ bits, $2^{N-1} -1$)
$q_{min}:$ minimum integer ($N$ bits, $-2^{N-1}$)

[TinyML and Efficient Deep Learning Computing, Song Han]

$$ S = \dfrac{r_{max} - r_{min}}{q_{max} - q_{min}}$$

$$ Z = q_{min} - \dfrac{r_{min}}{S} $$

$$ Z = round(q_{min} - \dfrac{r_{min}}{S}) $$

The basic concept of neural network quantization is converting the weights and activations of a neural network into a limited discrete set of numbers. The most well-known quantization methods are:

Post-Training Quantization (PTQ): The quantization is done after the model is trained.
Quantization Aware Training (QAT): The quantization is applied to the model, and then it is retrained or fine-tuned.

[Olivia Weng]

In our notebooks, we are using QAT quantization using bits and bytes.

LoRA

For efficient fine-tuning, we are using the Lasso technique known as LoRA which focuses only on training the adapters rather than the entire model. These adapters are two low-rank matrices A and B. The idea came into existence when Aghajanyan et al. (2020) showed that when adapting to specific tasks, the pre-trained language models have a low intrinsic dimension. Inspired by that, Edward Hu et al. (2021) suggest that the updates to the weights of an LLM also have a low "intrinsic dimension" during adaptation.

For a given matrix weight $$W$$, and its update $$\Delta W$$, then mathematically:

$$W + \Delta W = W + BA$$

Where:

$$W: (d, k)$$
$$B: (d, r)$$
$$A: (r, k)$$
$$r << min(d,k):$$ the rank
$B, A$ represent the low-rank adapters

For the forward pass: $$(W +\Delta W) x = W x + BA x$$

[Edward Hu et al. (2021)]

Flash Attention

Flash Attention is a technique used to accelerate the computation of attention. The standard attention algorithm can be decomposed as follows:

For each block of Q and K:
- Load the block from High Bandwidth Memory (HBM) to Static Random-Access Memory (SRAM).
- Compute the corresponding block of the similarity matrix S ($S = Q \times K^T$).
- Write the computed block of S to HBM.
For each block of S:
- Load the block from HBM to SRAM.
- Compute the corresponding block of P ($P = softmax(S)$).
- Write the computed block of P to HBM.
For each block of P and V:
- Load the block from HBM.
- Compute the corresponding block of O ($O = PV$).
- Write the computed block of O to HBM.

At the end of this process, S, P, and O are stored in HBM, resulting in a space complexity of $$O(N^2) + O(N \times d_{h})$$.

However, Flash Attention differs in its approach: it doesn't store S, P, and O in HBM. Instead, it only stores O and the softmax normalization values. During the backward pass, it recomputes P and S based on O and the normalization values. This results in a more efficient use of memory and faster computations.

[Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré]

You can access my Hugging Face account here.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Evaluate_Phi_multi_steps_reasoning.ipynb		Evaluate_Phi_multi_steps_reasoning.ipynb
Fennec_7B.ipynb		Fennec_7B.ipynb
Fine_Tune_Falcon7B_Linear_Equations.ipynb		Fine_Tune_Falcon7B_Linear_Equations.ipynb
Kaggle_PyGemma.ipynb		Kaggle_PyGemma.ipynb
Merge_with_LoRA_adapters.ipynb		Merge_with_LoRA_adapters.ipynb
PQT - QAT.png		PQT - QAT.png
Phi_multi_step_reasoning.ipynb		Phi_multi_step_reasoning.ipynb
PyGemma_Assistant.ipynb		PyGemma_Assistant.ipynb
RAG.jpg		RAG.jpg
RAG_pipeline_with_Zephyr_7b.ipynb		RAG_pipeline_with_Zephyr_7b.ipynb
README.md		README.md
Saqr_7B.ipynb		Saqr_7B.ipynb
finetuning_llms.jpg		finetuning_llms.jpg
flashattn_banner.jpg		flashattn_banner.jpg
lora.gif		lora.gif
quantization.PNG		quantization.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Fine-Tuning of Open Source Language Learning Models (LLMs) Repository

The Efficient Techniques Used Are:

Quantization

[TinyML and Efficient Deep Learning Computing, Song Han]

[Olivia Weng]

LoRA

[Edward Hu et al. (2021)]

Flash Attention

[Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré]

About

Releases

Packages

Languages

menouarazib/llm

Folders and files

Latest commit

History

Repository files navigation

Efficient Fine-Tuning of Open Source Language Learning Models (LLMs) Repository

The Efficient Techniques Used Are:

Quantization

[TinyML and Efficient Deep Learning Computing, Song Han]

[Olivia Weng]

LoRA

[Edward Hu et al. (2021)]

Flash Attention

[Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré]

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages