Welcome to LLaMA Text Generation API! This API is implemented in Python using Flask and utilizes a pre-trained LLaMA model for generating text based on user input.
- Setting Up Virtual Environment
- Installing Requirements
- Choosing a GPU
- Running API with Torch Command
- Running API using Gunicorn
- API Usage with cURL
- Request/Response Objects
- Using Postman
-
Install Virtual Environment:
Ensure that Python 3 and
pip
are installed and then run:pip install virtualenv
-
Create Virtual Environment:
Navigate to the project directory and run:
virtualenv venv
-
Activate Virtual Environment:
- Windows:
.\venv\Scripts\activate
- Linux/Mac:
source venv/bin/activate
- Windows:
Clone the repo
git clone https://github.com/Lightning-AI/lit-llama
cd lit-llama
install dependencies
pip install -r requirements.txt
You are all set! 🎉
- LLaMA 7B and 13B Models: Sufficiently run on an A500 with 24GB VRAM.
- LLaMA 30B Model: Requires a more powerful GPU such as A40 with 48GB VRAM.
You should navigate to the project directory and run:
python app.py
The API will be hosted on http://0.0.0.0:5000/complete
.
-
Install Gunicorn: Ensure the virtual environment is activated and then run:
pip install gunicorn
-
Run API: Use gunicorn to serve the Flask app:
gunicorn -w 4 -b 0.0.0.0:5000 app:app
-
Set Up Gunicorn as a Service:
- Create a gunicorn systemd service file:
sudo nano /etc/systemd/system/llama-api.service
- Add the following content and adjust paths accordingly:
[Unit] Description=Gunicorn instance to serve LLaMA API After=network.target [Service] User=your_user Group=www-data WorkingDirectory=/path/to/your/project Environment="PATH=/path/to/your/project/venv/bin" ExecStart=/path/to/your/project/venv/bin/gunicorn --workers 4 --bind 0.0.0.0:5000 app:app [Install] WantedBy=multi-user.target
- Start and enable the gunicorn service:
sudo systemctl start llama-api sudo systemctl enable llama-api
- Create a gunicorn systemd service file:
Example cURL request:
curl -X POST http://0.0.0.0:5000/complete \
-H "Content-Type: application/json" \
-d '{"text": "Once upon a time,", "top_p": 0.9, "top_k": 50, "temperature": 0.8, "length": 30}'
Example response:
{
"completion":{
"generation_time":"0.8679995536804199s",
"text":["Once upon a time, the kingdom was ruled by a wise and just king..."]
}
}
-
Request:
text
: The input text (string).top_p
: Probability for nucleus sampling (float).top_k
: The number of top most probable tokens to consider (integer).temperature
: Controls the randomness of the sampling process (float).length
: The number of new tokens to generate (integer).
-
Response:
text
: The generated text based on the input (string).generation_time
: Time taken to generate the text (string, formatted as seconds).
-
Set Up Postman: Download and install Postman from Postman's official site.
-
Send Request:
- Set the request type to
POST
. - Enter the request URL:
http://0.0.0.0:5000/complete
. - Navigate to the "Body" tab, select "raw" and "JSON (application/json)".
- Enter the JSON payload:
{ "text": "Once upon a time,", "top_p": 0.9, "top_k": 50, "temperature": 0.8, "length": 30 }
- Click "Send" and view the API's response in the section below.
- Set the request type to
And that concludes our README guide! Feel free to adapt this guide as per additional requirements for your API.