Ruisi Cai1, Yuandong Tian2, Zhangyang Wang1, Beidi Chen3,
1University of Texas at Austin, 2Meta AI (FAIR), 3Carnegie Mellon University
LoCoCo supports two modes: (1): inference mode, and (2) post-training tuning mode. Please see more details in paper.
To train the model with sequence length of 4096, and the chunk size of 512:
torchrun --nproc_per_node=8 train.py \
--dataset_name togethercomputer/RedPajama-Data-1T-Sample \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--block_size 512 \
--clean_period 8 \
--method conv \
--kernel_size 21 \
--n_convlayer 1 \
--mem_size 512 \
--max_train_steps 1000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 128 \
--eval_iter 20 \
--eval_interval 50 \
--stream_tokenizer \
--normalizer_init 0.5 \
--memory_lr_scale 1000 \
--norm_lr_scale 5 \
--rope_change \
--checkpointing_steps 100 \
--output_dir ${save_dir} \
--auto_resume
To train the model with sequence length of 8192, and the chunk size of 512:
torchrun --nproc_per_node=8 train.py \
--dataset_name togethercomputer/RedPajama-Data-1T-Sample \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--block_size 512 \
--clean_period 16 \
--method conv \
--kernel_size 21 \
--n_convlayer 1 \
--mem_size 512 \
--max_train_steps 1000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 128 \
--eval_iter 20 \
--eval_interval 50 \
--stream_tokenizer \
--normalizer_init 0.5 \
--memory_lr_scale 1000 \
--norm_lr_scale 5 \
--rope_change \
--lora_finetuning \
--checkpointing_steps 100 \
--output_dir ${save_dir} \
--auto_resume
Remember to enable lora finetuning in this case, by --lora_finetuning
.
The model checkpoints is coming soon!
If you find this useful, please cite the following paper:
@article{cai2024lococo,
title={LoCoCo: Dropping In Convolutions for Long Context Compression},
author={Cai, Ruisi and Tian, Yuandong and Wang, Zhangyang and Chen, Beidi},
journal={arXiv preprint arXiv:2406.05317},
year={2024}
}