-
Large Language Models (LLMs): Utilize quantized versions of LLMs like llama2, open-llama, or falcon that can run on Google Colab.
-
Fine-tuning/Prompt Engineering:
-
Weather Service:
-
Identify location from user queries (e.g., "weather in Chennai").
-
Respond with "Service: Weather, Location: Chennai".
-
If location is missing, prompt user and respond with "Location: Chennai".
-
Support stateful interaction with multiple query-response exchanges.
-
-
News Service:
-
Extract location from queries about news (e.g., "news for India").
-
Respond with "Location: India" or "Location: USA" depending on the location found.
-
If location is missing, prompt user and respond with "Service: News, Location: India."
-
-
-
Generalization: Design the system to work with any service and identify the relevant service and location from user prompts.
Specifically:
-
Develop a generic response format "Service: , Location: ".
-
Identify weather-related prompts (e.g., "weather in Bangalore", "climate in Bangalore") and respond with "Service: Weather, Location: Bangalore".
-
Identify news-related prompts (e.g., "news for India") and respond with "Service: News, Location: India".
-
Allow for handling of missing location information with prompts.
Desired Outcome:
-
A system that can dynamically identify services and locations from user prompts and provide appropriate responses.
-
The system should be generalizable to work with any service and location.
-
The system should be lightweight and run efficiently on Google Colab.
-
We are using GPTQ quantized version of openchat_3.5 model, by "TheBloke".
- GPTQ is a post-training quantization (PTQ) method for 4-bit quantization.
-
The original openchat_3.5 model itself is a fine-tuned Mistral Model.
-
π₯ The first 7B model Achieves Comparable Results with ChatGPT (March)! π₯
-
π€ Open-source model on MT-bench scoring 7.81, outperforming 70B models π€
-
-
The original openchat_3.5 model can runs on consumer GPU with 24GB RAM, but the quantized version consumes ~6GB VRAM GPU only.
-
This is a
4 bit
quantized,7 Billion Parameter
model, with asequence length of 4096
.
Model | # Params | Average | MT-Bench | AGIEval | BBH MC | TruthfulQA | MMLU | HumanEval | BBH CoT | GSM8K |
---|---|---|---|---|---|---|---|---|---|---|
OpenChat-3.5 | 7B | 61.6 | 7.81 | 47.4 | 47.6 | 59.1 | 64.3 | 55.5 | 63.5 | 77.3 |
ChatGPT (March)* | ? | 61.5 | 7.94 | 47.1 | 47.6 | 57.7 | 67.3 | 48.1 | 70.1 | 74.9 |
Mistral | 7B | - | 6.84 | 38.0 | 39.0 | - | 60.1 | 30.5 | - | 52.2 |
Open-source SOTA** | 13B-70B | 61.4 | 7.71 | 41.7 | 49.7 | 62.3 | 63.7 | 73.2 | 41.4 | 82.3 |
WizardLM 70B | Orca 13B | Orca 13B | Platypus2 70B | WizardLM 70B | WizardCoder 34B | Flan-T5 11B | MetaMath 70B |
- System Requirements
~ 6GB of system RAM
~ 6GB of GPU VRAM
Important
π A GPU of 6GB VRAM is madatory for running the inference. Will works fine on google colab with T4 GPU enabled.
-
Software Requirements
-
python-version:
3.10
-
CUDA-version:
11.8
-
β¬οΈ python packages
pip install accelerate==0.25.0
pip install auto-gptq --extra-index-url "https://huggingface.github.io/autogptq-index/whl/cu118/"
pip install bitsandbytes==0.41.3.post2
pip install einops==0.7.0
pip install langchain==0.0.349
pip install optimum==1.15.0
pip install tiktoken==0.5.2
pip install torch @ "https://download.pytorch.org/whl/cu118/torch-2.1.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=a81b554184492005543ddc32e96469f9369d778dedd195d73bda9bed407d6589"
pip install torchaudio @ "https://download.pytorch.org/whl/cu118/torchaudio-2.1.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=cdfd0a129406155eee595f408cafbb92589652da4090d1d2040f5453d4cae71f"
pip install torchvision @ "https://download.pytorch.org/whl/cu118/torchvision-0.16.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=033712f65d45afe806676c4129dfe601ad1321d9e092df62b15847c02d4061dc"
pip install transformers==4.35.2
- π οΈ Tools used:
transformers
: "for importing and using LLM model from π€Huggingface."auto-gptq
: "for working with quantized models."langchain
: π¦π "for advanced prompting."
- π¬ Conversation-Flow Chart:
-
This section describes the process of handling user input and generating structured output.
I] User Input and Prompt:
-
When the bot receives user input, it is combined with a specific prompt.
-
This prompt instructs the LLM to generate a structured output.
-
For example if the user input is :
USER: what is the weather in Chennai?
-
π Prompt template: OpenChat
GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:
-
The
{prompt}
along with the user input will be :GPT4 Correct User: As an Named Entity Recognition and Intent Classification Expert, your task is to analyze questions like the following '###user_input': ###user_input: what is the weather in Chennai? You are required to perform two main tasks based on the '###user_input': #Task 1: Classify the '###user_input' into one of the predefined intents ['Weather', 'News']. IMPORTANT: If no clear match to given intents is found, categorize the intent as "Other". #Task 2: Extract any geographical or location-related entities present in the '###user_input'. IMPORTANT: If no specific location is mentioned, label the location as "Other". ###INSTRUCTIONS: Do NOT add any clarifying information. Output MUST follow the schema below. The output should be formatted as a JSON instance that conforms to the JSON schema below. As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]} the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted. Here is the output schema: ''' {"properties": {"service_intent": {"title": "Service Intent", "description": "This field stores the intent classified from the user input.", "type": "string"}, "location": {"title": "Location", "description": "This field stores the location value extracted from the user input.", "type": "string"}}, "required": ["service_intent", "location"]} ''' <|end_of_turn|>GPT4 Correct Assistant:
- This prompt is generated by combining π¦πlangchain along with some custom instructions at the first place.
- We can dynamically pass the required services here it is
['Weather', 'News']
.
II] LLM Output Generation: π€
- The combined user input and prompt are passed to the LLM for processing.
- The LLM attempts to generate structured output, aiming for JSON format.
III] Output Validation: π
-
checking for structured output format:
- If the output is not in JSON format an iterative process begins.
- The user input is passed again to the LLM with the same prompt.
- This loop continues for a pre-defined number of iterations (N).
- This iterative process will help to filter out some rare glitch in LLM output.
-
checking for missing key values:
- Following the generation of well-structured JSON data by the LLM, an additional validation step is performed. This step focuses on ensuring that the essential keys, "service" and "location," contain valid non-null values.
- Service Key: The value of the "service" key is checked for null or emptiness.
- Location Key: Similarly, the value of the "location" key is checked for null or emptiness.
-
π§ββοΈ Follow-up Questioning:
If either key (service or location) lacks a valid value:
-
βοΈ Service
- The bot initiates a follow-up questioning flow specifically designed to elicit the missing service information from the user.
- This may involve asking the user directly what service they are seeking information about.
-
π Location
- A similar follow-up questioning flow is initiated if the "location" key lacks a valid value.
- The bot prompts the user for the desired location information.
-
-
π¦ Resuming the Process:
- Once the user provides the missing information, the original process resumes.
- The combined user input (including service and location information) is again passed to the LLM with the specific prompt for structured output generation.
- The validation and follow-up questioning steps repeat as needed until all essential key-value pairs are obtained and a valid structured output is generated.
-
π₯This iterative approach guarantees that the bot generate a complete and accurate JSON response all the time, even if the user initially forgets to provide all necessary information.
-
π₯We also passing previous one chat history along with the user input in-order to direct the LLM to generate better quality results.
-
-
β‘User interactions: β‘
- Direct query:-
- conversation with follow-up questioning when
service
information is missing:-
- conversation with follow-up questioning when
location
information is missing:-
- conversation with follow-up questioning when both
location
&service
informations are missing:-
- clone repo
git clone https://github.com/Ribin-Baby/RAG-json-responderV1.git
- change directory
cd ./RAG-json-responderV1
- install requirements
pip install -r requirements.txt
- π₯ Run
Note
Need GPU with 6GB VRAM and cuda 11.8 installed. Better run on colab with T4 GPU.
python bot.py --s '["News", "Weather"]'
- not only ["News", "Weather"] as services we can pass any services we need dynamically to ["Game", "Law"] or ["Economics", "Law", "Weather"] or any.
- open the
bot_notebook.ipynb
file in google colab environment and change the runtime toGPU
. And run cell-by-cell. - It may be required to restart the colab after installing the packages. To do that run this cell below.
import IPython
IPython.Application.instance().kernel.do_shutdown(True)
- upon running the last cell, you will get an interactive UI to chat with the BoT.