An introduction to large language models for scientific research - how do they work, how can they be used, and how can they be trained?
Check out the notebooks
·
Check out the Slides
Report Bug
·
Request Feature
Head to the hands-on course »
Table of Contents
This repository is Part I of a two-part course:
- An introduction to large language models for scientific research - how do they work, how can they be used, and how can they be trained?
- A hands-on tutorial on how to use large language models for scientific research.
Part 1 consists of some slides and some notebooks that introduce LLMs, and how they can be used for scientific research. We look at finetuning a BERT model for classication tasks, finetuning GPT-2 to make it sound like the President of the United States, and using RAG to produce a simple question-answering system with Streamlit
Part 2 is a much more comprehensive hands-on tutorial. It is designed as a reference source for our hands-on LLM workshop, but may be useful for others to get started with LLMs. To get started, follow the link and head to the website.
This project requires some prerequisites in terms of skill level: you should be proficient with Python and PyTorch, and some understanding of git would be helpful.
Development of this material is an ongoing process, and given the rapid advancement of LLM libraries may contain bugs or out of date information. If you find any issues, please raise an Issue via GitHub, and we will endeavour to address it as soon as possible. Thank you!
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under an MIT License. See LICENSE
for more information.
Slides for the course can be found in the Slides
directory. The notebooks that cover the content listed below are in the notebooks
directory.
Download the code from GitHub. Unzip the file and upload it to your Google Drive. In the beginning of the notebooks, there is an optional cell to run which will mount your drive and put you in the correct directory.
We recommend using docker to run the code locally. To build the docker image, run the following command in the root directory of the project:
docker pull acceleratescience/large-language-models:latest
and then
docker run -it acceleratescience/large-language-models:latest /bin/bash
Alternatively, if you're running locally, then just run
pip install -r requirements.txt
Set up an OpenAI account, and a Hugging Face account. For the OpenAI account, you will need to enter credit card information in order to actually use the API!
You have some options when using the OpenAI API. You can either initialize the OpenAI client in this way:
client = OpenAI(api_key='YOUR_API_KEY')
or you can create a separate file called .env
and store your API key in this way:
OPENAI_API_KEY = 'sk-1234567890'
and when you call OpenAI()
, you key will be automatically read using os.environ.get("OPENAI_API_KEY")
A walkthrough of using and finetuning a Hugging Face model (GPT-2) can be found in the notebook finetuning.ipynb
.
This notebook also contains code detailing the construction of a very simple RAG system.
The notebook BERT_classification.ipynb
contains some code for finetuning smaller models for classification or regression tasks using a simple dataset. It can be modified relatively easily to include your own data.
In the workshop, we covered some no-code options:
The easiest to get up and running is LMStudio. If you have a Macbook, it should be very easy to install. You experience with Windows may vary.
GPT4All is also relatively easy to install and get up and running.
Textgen-webui is capable of both inference and some fine-tuning. To get Textgen-webui up and running on your local machine is not too challenging. It is possible to run high-parameter models on the HPC or another remote cluster, and have access to the UI on your local machine. This can be more challenging, so if you're interested in doing this, and get stuck, get in touch with us.
You can find a very brief introduction to producing images with Stable Diffusion in the notebook titled introduction_to_stable_diffusion.ipynb
. This should run on a Macbook or Colab.
In addition to the above no-code options, there is also ComfyUI, a UI for running Stable Diffusion model checkpoints and LoRAs. This will be slow when running on a laptop, but as with Textgen-webui, ComfyUI can also be run on a remote GPU. There are numerous tutorials online and on YouTube for ComfyUI (here for example).