performance evaluation of LLM models on Text to SQL
- Mistral - 7B
- LLaMA 2 - 7B
- WizardLM - 7B
- Flan-T5 - 11B
- PaLM - 540B
All the open-source models were run locally using the module llama-cpp-python. The GGUF files for the open-source models were downloaded from HuggingFace Repositories. For the PaLM model, the PaLM API was used to send requests and receive the results of the query.
Once the gguf files are downloaded, place them in a directory named models. The test set used for this is the dev_set from SPIDER dataset. The test set is in the location : "spider/dev.json. The spider directory must also contain the database to use when we want to provide the schema for the DB along with the user query.
For the PaLM testing, run the following command
- python main_Palm.py --test "test_file " --schema 1
The --schema 1 queries the database manually for the schema of the database and appends it to the model prompt as additional information. Set it to 0 to not include this information
For the other gguf files, the python file internally uses the llama-cpp-python to run the inference locally.
- python main.py --model_name "" --test "" --schema ""
Create a folder "results" to store results of the inference. The result file is of csv format which contains
- The question queried
- Gold query
- Predicted query
The file name will be "model_name/with_schema" if schema bit is 1. Otherwise it will be "model_name/without_schema"
Spider dataset provides an evaluator to test the accuracy of the predictions. To run the Python program, use the following command
- python3 evaluators/evaluation.py --gold "Gold_file --pred "Pred_file" --etype all
The Gold_file must contain only the Gold queries where each query is separated by a newline The pred_file must contain only the predicted queries where each query is separated by a newline
HIGHLIGHTS
- Uses Grouped-query attention
- Speeds up inference of the model
- Reduces memory req during decoding
- Uses Sliding window attention
- Handles longer sequences with a reduced computation cost
CAPABILITIES
- Code generation
- Reasoning
- Mathematics
LIMITATIONS
- Prone to hallucination
- Prone to prompt injections
- Low knowledge store due to low parameter size
HIGHLIGHTS
- It is a fine-tuned LLaMA LLM using the evol-instruct method
- trained with fully evolved instructions
- Optimized to perform highly complex instructions
- Outperforms Vicuna and Alpaca
CAPABILITIES
- instruction-following LLMs
- Code Generation
LIMITATIONS
- Prone to hallucination
- Low knowledge store due to low parameter size
HIGHLIGHTS
- Llama 2 is pre-trained using publicly available online data (2 trillion "tokens").
- Iteratively refined using (RLHF), which includes rejection sampling and proximal policy optimization (PPO)
- Only open-source model on par with ChatGPT, Anthropic, and PaLM on all general NLP tasks
CAPABILITIES
- Applied to many different use cases for example
- Code Generation
- Sentence completion
- Summarization
- Sentiment analysis
LIMITATIONS
- Prone to hallucination
- Inappropriate content (if not used responsibly)
- Potential for bias
Highlights
- Enhanced T5: Builds upon the powerful T5 model with further fine-tuning
- Multi-task learning: Trained on diverse tasks, making it versatile for various NLP applications.
- Five sizes: small, base, large, XL, and XXL for different performance and resource requirements.
- Open-sourced: Accessible through Hugging Face and can be fine-tuned for specific tasks.
CAPABILITIES:
- Text summarization
- Question answering
- Text generation
- Language Translation
LIMITATIONS:
- Potential for bias
- Inappropriate content (if not used responsibly)
- Significant computational resources for training and inference.
HIGHLIGHTS
- Massive parameter size: advanced reasoning and understanding capabilities.
- Multi-task learning: Trained on a diverse set of tasks
- Improved zero-shot and few-shot learning.
- Handles multiple languages with fluency and accuracy.
CAPABILITIES
- Advanced reasoning tasks: Solves complex problems, comprehends riddles
- Question answering
- Natural language generation: creative text formats like poems, scripts, emails
- Code understanding and generation: Analyzes existing code, generates new code snippets, and helps with code completion.
Limitations:
- Potential for bias: Trained on a massive dataset that may contain inherent biases, reflected in its outputs.
- Ethical considerations: Can generate inappropriate content if not used responsibly.
- Demands significant computational resources for training and inference.