Skip to content

Team November's solution for the 4th UNIST-KAIST-POSTECH AI & Data-Science Competition

Notifications You must be signed in to change notification settings

MinjaeKimmm/FinRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinRAG

Overview

FinRAG is team November's winning solution for the 4th UNIST-KAIST-POSTECH AI & Data-Science Competition. The project implements a specialized Retrieval-Augmented Generation (RAG) system designed for financial documents, capable of retrieving the top 10 most relevant corpora for given queries. Performance was evaluated using NDCG metrics.

First Place

Repository Structure

FinRAG/
├── data/                   # Storage for financial datasets and their query-corpus pairs
├── embed/                  # Directory for ChromaDB document embeddings
├── output/                 # Evaluation results, prediction files, and performance metrics
├── pipeline/
│   ├── common/             # Global constants, config files, and shared helper functions
│   ├── preprocessing/      # Document cleaning, normalization, and table extraction
│   ├── chunking/           # Text segmentation and document splitting logic
│   ├── embedding/          # Embedding model implementations and vector generation scripts
│   ├── retrieval/          # Core retrieval logic, ranking algorithms, and evaluation
│   ├── generate/           # Generation pipeline and LLM integration
│   ├── chat/               # Chatbot implementation
│   └── util/               # Individual utility functions for data handling and processing
└── main.py                 # Entry point of pipeline

Supported Datasets

The system is designed to work with multiple specialized financial datasets:

Dataset Description Focus Area
FinDER 10-K reports Jargon and abbreviation handling
FinQABench 10-K reports Hallucination detection and factuality
FinanceBench 10-K reports Real-world financial queries
TATQA Mixed formats Numerical reasoning with text and tables
FinQA Earnings reports Multi-step reasoning
ConvFinQA Earnings reports Conversational query processing
MultiHiertt Annual reports Complex hierarchical table reasoning

Installation

  1. Clone the repository:
git clone https://github.com/MinjaeKimmm/FinRAG.git
cd FinRAG
  1. Install dependencies:
pip install -r requirements.txt
  1. Extract the queries and corpora datasets provided by this HuggingFace Dataset into the data/ directory.

Usage

Important!!! Run preprocessing scripts independently to process data beforehand and adjust parameters in the main function for best performance.

Execute the main script to run experiments:

python main.py

Streamlit Chatbot

To run the interactive chatbot interface:

  1. Add the project root to your Python path:
export PYTHONPATH=/path/to/FinRAG:$PYTHONPATH
  1. Run the Streamlit app:
streamlit run pipeline/chat/app.py

Advanced Features

The pipeline supports several advanced techniques for RAG optimization:

  • Data preprocessing and cleaning
  • Automated table extraction
  • Query rewriting and expansion
  • Corpus refinement strategies
  • Multi-step reranking
  • LLM integration for enhanced processing

About

Team November's solution for the 4th UNIST-KAIST-POSTECH AI & Data-Science Competition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages