This repository contains a reference torch implementation of the LDAdam optimizer as proposed in the paper: LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics.
Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
To integrate the optimizer into your own pipeline, please use the following snippet:
from LowDimensionalAdam import LDAdamW
# define param groups as fullrank_params and lowrank_params
optimizer = LDAdamW(
params=[{'params': fullrank_params, 'enable_lowrank': False},
{'params': lowrank_params, 'enable_lowrank': True, 'rank':16, 'rho':0.908}],
lr=0.001,
betas=(0.908,0.99),
eps=1e-8,
weight_decay=0.0,
)
# you can then use the variable `optimizer` as any other PyTorch optimizer
To install LDAdam latest stable version from source, please run:
pip3 install git+https://github.com/IST-DASLab/LDAdam.git
To clone the project and install it as a Python package in a new conda environment named LDAdam, please run:
git clone https://github.com/IST-DASLab/LDAdam.git
cd LDAdam
source install.sh
conda activate LDAdam
To conduct experiments on fine-tuning RoBERTa-base model on the GLUE benchmark, we rely on the Huggingface Transformers project. The additional dependencies required can be found in the file glue_requirements.txt. To install them, please run:
pip3 install -r experiments/glue_finetuning/glue_requirements.txt
For reproductibility purposes, we provide the scripts we used to run our experiments.
To conduct our experiments on Llama pre-training on the C4 dataset, we follow the training procedure provided by the ReLora project and adapted for the GaLore project. The additional dependencies required can be found in the file c4_requirements.txt. To install them, please run:
pip3 install -r experiments/c4_pretraining/c4_requirements.txt
For reproductibility purposes, we provide the scripts we used to run our experiments.
To run our experiments on fine-tuning the Llama2 7B model on the GSM8K dataset, we use the training pipeline provided by the MicroAdam project, based on MosaicML's LLM foundry framework.
@misc{robert2024ldadamadaptiveoptimizationlowdimensional,
title={LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics},
author={Thomas Robert and Mher Safaryan and Ionut-Vlad Modoranu and Dan Alistarh},
year={2024},
eprint={2410.16103},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.16103},
}