LLaMA Text Generation API Readme

Welcome to LLaMA Text Generation API! This API is implemented in Python using Flask and utilizes a pre-trained LLaMA model for generating text based on user input.

Setup

Clone the repo

git clone https://github.com/Lightning-AI/lit-llama
cd lit-llama

install dependencies

pip install -r requirements.txt

You are all set! 🎉

Choosing a GPU

LLaMA 7B and 13B Models: Sufficiently run on an A500 with 24GB VRAM.
LLaMA 30B Model: Requires a more powerful GPU such as A40 with 48GB VRAM.

Running API with Torch Command

You should navigate to the project directory and run:

python app.py

The API will be hosted on http://0.0.0.0:5000/complete.

Running API using Gunicorn

Install Gunicorn: Ensure the virtual environment is activated and then run:
```
pip install gunicorn
```
Run API: Use gunicorn to serve the Flask app:
```
gunicorn -w 4 -b 0.0.0.0:5000 app:app
```

Set Up Gunicorn as a Service:

Create a gunicorn systemd service file:

sudo nano /etc/systemd/system/llama-api.service

Add the following content and adjust paths accordingly:

[Unit]
Description=Gunicorn instance to serve LLaMA API
After=network.target

[Service]
User=your_user
Group=www-data
WorkingDirectory=/path/to/your/project
Environment="PATH=/path/to/your/project/venv/bin"
ExecStart=/path/to/your/project/venv/bin/gunicorn --workers 4 --bind 0.0.0.0:5000 app:app

[Install]
WantedBy=multi-user.target

Start and enable the gunicorn service:

sudo systemctl start llama-api
sudo systemctl enable llama-api

API Usage with cURL

Example cURL request:

curl -X POST http://0.0.0.0:5000/complete \
-H "Content-Type: application/json" \
-d '{"text": "Once upon a time,", "top_p": 0.9, "top_k": 50, "temperature": 0.8, "length": 30}'

Example response:

{
   "completion":{
      "generation_time":"0.8679995536804199s",
      "text":["Once upon a time, the kingdom was ruled by a wise and just king..."]
   }
}

Request/Response Objects

Request:
- text: The input text (string).
- top_p: Probability for nucleus sampling (float).
- top_k: The number of top most probable tokens to consider (integer).
- temperature: Controls the randomness of the sampling process (float).
- length: The number of new tokens to generate (integer).
Response:
- text: The generated text based on the input (string).
- generation_time: Time taken to generate the text (string, formatted as seconds).

Using Postman

Set Up Postman: Download and install Postman from Postman's official site.
Send Request:
- Set the request type to POST.
- Enter the request URL: http://0.0.0.0:5000/complete.
- Navigate to the "Body" tab, select "raw" and "JSON (application/json)".
- Enter the JSON payload:
```
{
   "text": "Once upon a time,",
   "top_p": 0.9,
   "top_k": 50,
   "temperature": 0.8,
   "length": 30
}
```
- Click "Send" and view the API's response in the section below.

And that concludes our README guide! Feel free to adapt this guide as per additional requirements for your API.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
app.py		app.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaMA Text Generation API Readme

Table of Contents

Setting Up Virtual Environment

Setup

Choosing a GPU

Running API with Torch Command

Running API using Gunicorn

API Usage with cURL

Request/Response Objects

Using Postman

About

Uh oh!

Releases

Packages

Languages

License

ahmedjawedaj/LLAMA-Flask

Folders and files

Latest commit

History

Repository files navigation

LLaMA Text Generation API Readme

Table of Contents

Setting Up Virtual Environment

Setup

Choosing a GPU

Running API with Torch Command

Running API using Gunicorn

API Usage with cURL

Request/Response Objects

Using Postman

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages