FinRAG is team November's winning solution for the 4th UNIST-KAIST-POSTECH AI & Data-Science Competition. The project implements a specialized Retrieval-Augmented Generation (RAG) system designed for financial documents, capable of retrieving the top 10 most relevant corpora for given queries. Performance was evaluated using NDCG metrics.
FinRAG/
├── data/ # Storage for financial datasets and their query-corpus pairs
├── embed/ # Directory for ChromaDB document embeddings
├── output/ # Evaluation results, prediction files, and performance metrics
├── pipeline/
│ ├── common/ # Global constants, config files, and shared helper functions
│ ├── preprocessing/ # Document cleaning, normalization, and table extraction
│ ├── chunking/ # Text segmentation and document splitting logic
│ ├── embedding/ # Embedding model implementations and vector generation scripts
│ ├── retrieval/ # Core retrieval logic, ranking algorithms, and evaluation
│ ├── generate/ # Generation pipeline and LLM integration
│ ├── chat/ # Chatbot implementation
│ └── util/ # Individual utility functions for data handling and processing
└── main.py # Entry point of pipeline
The system is designed to work with multiple specialized financial datasets:
Dataset | Description | Focus Area |
---|---|---|
FinDER | 10-K reports | Jargon and abbreviation handling |
FinQABench | 10-K reports | Hallucination detection and factuality |
FinanceBench | 10-K reports | Real-world financial queries |
TATQA | Mixed formats | Numerical reasoning with text and tables |
FinQA | Earnings reports | Multi-step reasoning |
ConvFinQA | Earnings reports | Conversational query processing |
MultiHiertt | Annual reports | Complex hierarchical table reasoning |
- Clone the repository:
git clone https://github.com/MinjaeKimmm/FinRAG.git
cd FinRAG
- Install dependencies:
pip install -r requirements.txt
- Extract the queries and corpora datasets provided by this HuggingFace Dataset into the
data/
directory.
Important!!! Run preprocessing scripts independently to process data beforehand and adjust parameters in the main function for best performance.
Execute the main script to run experiments:
python main.py
To run the interactive chatbot interface:
- Add the project root to your Python path:
export PYTHONPATH=/path/to/FinRAG:$PYTHONPATH
- Run the Streamlit app:
streamlit run pipeline/chat/app.py
The pipeline supports several advanced techniques for RAG optimization:
- Data preprocessing and cleaning
- Automated table extraction
- Query rewriting and expansion
- Corpus refinement strategies
- Multi-step reranking
- LLM integration for enhanced processing