Skip to content

Get an introduction to how large language models work, and how to get up and running quickly.

License

Notifications You must be signed in to change notification settings

acceleratescience/large-language-models

Repository files navigation

MIT License Issues GitHub contributors GitHub pull requests PR's Welcome
GitHub stars GitHub watchers GitHub forks GitHub followers Twitter Follow


Logo

Large Language Models for Scientific Research

An introduction to large language models for scientific research - how do they work, how can they be used, and how can they be trained?

Check out the notebooks · Check out the Slides
Report Bug · Request Feature

Head to the hands-on course »

Table of Contents
  1. Overview
  2. Prerequisites
  3. Contributing
  4. License
  5. Getting Started
    1. Colab
    2. Local
  6. Setting up API Keyes
  7. Training and Augmenting GPT-2
  8. Finetuning for classification
  9. No-code
  10. Stable Diffusion

Overview

This repository is Part I of a two-part course:

  1. An introduction to large language models for scientific research - how do they work, how can they be used, and how can they be trained?
  2. A hands-on tutorial on how to use large language models for scientific research.

Introduction to LLMs

Part 1 consists of some slides and some notebooks that introduce LLMs, and how they can be used for scientific research. We look at finetuning a BERT model for classication tasks, finetuning GPT-2 to make it sound like the President of the United States, and using RAG to produce a simple question-answering system with Streamlit

Hands-on LLM workshop

Part 2 is a much more comprehensive hands-on tutorial. It is designed as a reference source for our hands-on LLM workshop, but may be useful for others to get started with LLMs. To get started, follow the link and head to the website.

Prerequisites

This project requires some prerequisites in terms of skill level: you should be proficient with Python and PyTorch, and some understanding of git would be helpful.

Development of this material is an ongoing process, and given the rapid advancement of LLM libraries may contain bugs or out of date information. If you find any issues, please raise an Issue via GitHub, and we will endeavour to address it as soon as possible. Thank you!

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under an MIT License. See LICENSE for more information.

(back to top)

Getting Started

Slides for the course can be found in the Slides directory. The notebooks that cover the content listed below are in the notebooks directory.

Colab

Download the code from GitHub. Unzip the file and upload it to your Google Drive. In the beginning of the notebooks, there is an optional cell to run which will mount your drive and put you in the correct directory.

Local

Docker

We recommend using docker to run the code locally. To build the docker image, run the following command in the root directory of the project:

docker pull acceleratescience/large-language-models:latest

and then

docker run -it acceleratescience/large-language-models:latest /bin/bash

Alternatively, if you're running locally, then just run

pip install -r requirements.txt

(back to top)

Setting up API Keys

Set up an OpenAI account, and a Hugging Face account. For the OpenAI account, you will need to enter credit card information in order to actually use the API!

You have some options when using the OpenAI API. You can either initialize the OpenAI client in this way:

client = OpenAI(api_key='YOUR_API_KEY')

or you can create a separate file called .env and store your API key in this way:

OPENAI_API_KEY = 'sk-1234567890'

and when you call OpenAI(), you key will be automatically read using os.environ.get("OPENAI_API_KEY")

(back to top)

Training and Augmenting GPT-2

A walkthrough of using and finetuning a Hugging Face model (GPT-2) can be found in the notebook finetuning.ipynb.

This notebook also contains code detailing the construction of a very simple RAG system.

(back to top)

Finetuning for classification

The notebook BERT_classification.ipynb contains some code for finetuning smaller models for classification or regression tasks using a simple dataset. It can be modified relatively easily to include your own data.

(back to top)

No-code

In the workshop, we covered some no-code options:

The easiest to get up and running is LMStudio. If you have a Macbook, it should be very easy to install. You experience with Windows may vary.

GPT4All is also relatively easy to install and get up and running.

Textgen-webui is capable of both inference and some fine-tuning. To get Textgen-webui up and running on your local machine is not too challenging. It is possible to run high-parameter models on the HPC or another remote cluster, and have access to the UI on your local machine. This can be more challenging, so if you're interested in doing this, and get stuck, get in touch with us.

(back to top)

Stable Diffusion

You can find a very brief introduction to producing images with Stable Diffusion in the notebook titled introduction_to_stable_diffusion.ipynb. This should run on a Macbook or Colab.

In addition to the above no-code options, there is also ComfyUI, a UI for running Stable Diffusion model checkpoints and LoRAs. This will be slow when running on a laptop, but as with Textgen-webui, ComfyUI can also be run on a remote GPU. There are numerous tutorials online and on YouTube for ComfyUI (here for example).

(back to top)

About

Get an introduction to how large language models work, and how to get up and running quickly.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages