FinRAG

Overview

FinRAG is team November's winning solution for the 4th UNIST-KAIST-POSTECH AI & Data-Science Competition. The project implements a specialized Retrieval-Augmented Generation (RAG) system designed for financial documents, capable of retrieving the top 10 most relevant corpora for given queries. Performance was evaluated using NDCG metrics.

Repository Structure

FinRAG/
├── data/                   # Storage for financial datasets and their query-corpus pairs
├── embed/                  # Directory for ChromaDB document embeddings
├── output/                 # Evaluation results, prediction files, and performance metrics
├── pipeline/
│   ├── common/             # Global constants, config files, and shared helper functions
│   ├── preprocessing/      # Document cleaning, normalization, and table extraction
│   ├── chunking/           # Text segmentation and document splitting logic
│   ├── embedding/          # Embedding model implementations and vector generation scripts
│   ├── retrieval/          # Core retrieval logic, ranking algorithms, and evaluation
│   ├── generate/           # Generation pipeline and LLM integration
│   ├── chat/               # Chatbot implementation
│   └── util/               # Individual utility functions for data handling and processing
└── main.py                 # Entry point of pipeline

Supported Datasets

The system is designed to work with multiple specialized financial datasets:

Dataset	Description	Focus Area
FinDER	10-K reports	Jargon and abbreviation handling
FinQABench	10-K reports	Hallucination detection and factuality
FinanceBench	10-K reports	Real-world financial queries
TATQA	Mixed formats	Numerical reasoning with text and tables
FinQA	Earnings reports	Multi-step reasoning
ConvFinQA	Earnings reports	Conversational query processing
MultiHiertt	Annual reports	Complex hierarchical table reasoning

Installation

Clone the repository:

git clone https://github.com/MinjaeKimmm/FinRAG.git
cd FinRAG

Install dependencies:

pip install -r requirements.txt

Extract the queries and corpora datasets provided by this HuggingFace Dataset into the data/ directory.

Usage

Important!!! Run preprocessing scripts independently to process data beforehand and adjust parameters in the main function for best performance.

Execute the main script to run experiments:

python main.py

Streamlit Chatbot

To run the interactive chatbot interface:

Add the project root to your Python path:

export PYTHONPATH=/path/to/FinRAG:$PYTHONPATH

Run the Streamlit app:

streamlit run pipeline/chat/app.py

Advanced Features

The pipeline supports several advanced techniques for RAG optimization:

Data preprocessing and cleaning
Automated table extraction
Query rewriting and expansion
Corpus refinement strategies
Multi-step reranking
LLM integration for enhanced processing

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
embed		embed
output		output
pipeline		pipeline
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinRAG

Overview

Repository Structure

Supported Datasets

Installation

Usage

Streamlit Chatbot

Advanced Features

About

Releases

Packages

Languages

MinjaeKimmm/FinRAG

Folders and files

Latest commit

History

Repository files navigation

FinRAG

Overview

Repository Structure

Supported Datasets

Installation

Usage

Streamlit Chatbot

Advanced Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages