A Streamlit-based chatbot that uses RAG (Retrieval-Augmented Generation) to answer questions about uploaded PDF documents.
- PDF Upload & Processing: Upload PDF files and automatically process them for RAG
- Vector Storage: Uses FAISS for efficient document retrieval
- RAG Integration: Leverages OpenAI's GPT models for intelligent question answering
- PDF Visualization: View PDF pages as images alongside responses
- Context Display: See the source documents used to generate answers
chatbot/
├── src/
│ ├── config/
│ │ ├── __init__.py
│ │ └── settings.py # Configuration and constants
│ ├── core/
│ │ ├── __init__.py
│ │ ├── document_processor.py # PDF processing and vector storage
│ │ └── rag_chain.py # RAG chain implementation
│ ├── utils/
│ │ ├── __init__.py
│ │ └── pdf_converter.py # PDF to image conversion
│ ├── ui/
│ │ ├── __init__.py
│ │ └── streamlit_app.py # Streamlit UI implementation
│ └── __init__.py
├── data/
│ ├── temp_pdfs/ # Temporary PDF storage
│ ├── vector_store/ # FAISS vector database
│ └── pdf_images/ # Converted PDF images
├── tests/ # Test files
├── docs/ # Documentation
├── main.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md # This file
- Clone the repository:
git clone https://github.com/PythonToGo/rag_chatbot.git
cd chatbot
- Create a virtual environment:
python -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
Create a
.env
file in the root directory with your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
- Run the application:
streamlit run main.py
-
Open your browser and navigate to the provided URL (usually
http://localhost:8501
) -
Upload a PDF file using the file uploader
-
Ask questions about the uploaded PDF in the text input field
-
View the generated responses and related document context
You can modify the application settings in src/config/settings.py
:
- Model Settings: Change the embedding and chat models
- Document Processing: Adjust chunk size and overlap
- Retrieval Settings: Modify the number of retrieved documents
- Image Conversion: Change DPI settings for PDF to image conversion
- Streamlit: Web application framework
- LangChain: RAG framework and document processing
- OpenAI: Language models and embeddings
- FAISS: Vector similarity search
- PyMuPDF: PDF processing and image conversion
# Add test files to the tests/ directory
python -m pytest tests/
The application follows a modular architecture:
- DocumentProcessor: Handles PDF loading, chunking, and vector storage
- RAGChain: Manages the RAG pipeline and question processing
- PDFConverter: Converts PDF pages to images for display
- StreamlitApp: Main UI application with clean separation of concerns
- Create new modules in the appropriate directory (
core/
,utils/
,ui/
) - Update configuration in
src/config/settings.py
if needed - Add tests in the
tests/
directory - Update this README with new features
MIT License, Copyright PythonToGo 2025.