This code is adapted from here. If you use this code, please consider citing the original paper.
export MODEL_NAME=JunxiongWang/Llama3.2-Mamba2-3B-distill
export RESULT_SAVE_PATH=Llama3.1-Mamba-distill
python -u needle_in_haystack.py --s_len 0 --e_len 65536\
--model_provider Mamba \
--model_path ${MODEL_NAME} \
--test_name ${RESULT_SAVE_PATH}
Notice that, during the distillation, we only train model with 2k context.
Here is the results