-
Notifications
You must be signed in to change notification settings - Fork 96
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FSTORE-1408] LLM PDFs README (#267)
* Readme for pdf LLMs
- Loading branch information
Showing
2 changed files
with
36 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# ⚙️ Index Private PDFs for RAG and create Fine-Tuning Datasets from them | ||
|
||
This project will take a google drive folder of PDF files that you provide and read them, index them in vector embeddings in Hopsworks for retrieval augmented generation (RAG) and create an instruction dataset for fine-tuning using a teacher model (GPT). | ||
|
||
|
||
![Hopsworks Architecture for Private PDFs Indexed for LLMs](../..//images/llm-pdfs-architecture.gif) | ||
|
||
## 📖 Feature Pipeline | ||
The Feature Pipeline does the following: | ||
|
||
* Download any new PDFs from the google drive. | ||
* Extract chunks of text from the PDFs and store them in a Feature Group in Hopsworks. | ||
* Use GPT to generate an instruction set for the fine-tuning a foundation LLM and store as a feature group in Hopsworks. | ||
|
||
## 🏃🏻♂️Training Pipeline | ||
The Training Pipeline does the following: | ||
|
||
* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default) . | ||
* Saves the fine-tuned model to Hopsworks Model Registry. | ||
|
||
## 🚀 Inference Pipeline | ||
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and an embedded LLM. | ||
|
||
## 🕵🏻♂️ Google Drive Credentials Creation | ||
|
||
To create your Google Drive credentials, please follow the steps outlined in this guide: [Google Drive API Quickstart with Python](https://developers.google.com/drive/api/quickstart/python). This guide will walk you through setting up your project and downloading the necessary credentials files. | ||
|
||
After completing the setup, you will have two files: `credentials.json` and `client_secret.json`. These are your authentication files from your Google Cloud account. | ||
|
||
Next, integrate these files into your project: | ||
|
||
1. Create a directory named `credentials` at the root of your forked repository. | ||
|
||
2. Place both `credentials.json` and `client_secret.json` files inside this credentials directory. | ||
|
||
Now, you are ready to download your PDFs from the Google Drive! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.