A demo on fine-tuning an LLM on coding data to improve autocomplete predictions for continue.dev within an organization. Using the tools: dlt, Hugging Face and Ollama
The continue.dev tool has an autocomplete feature, and stores data in a autocomplete.jsonl
file on whether the person using the tool has accepted or rejected the suggestions made. The continue-hf-pipeline.py
file contains code for a custom dlt destination that takes this data, converts it to a parquet file and pushes it to a Hugging Face dataset repo, in a format ready for finetuning an LLM.
I finetuned the starcoder2:3b
model using the SFTTrainer from Hugging Face, based on the finetuning code that the creators of that model open-sourced.
I tried finetuning both on the dlt github repository, as well as the autocomplete dataset mentioned above. The code for the finetuning process can be found here: https://colab.research.google.com/drive/1jjb14BDlEeGjRmeXnfm41gDBlTNvsscn?usp=sharing