GitHub - dead8309/ai-rag-crawler: AI pipeline built with the honc and workers-ai. vector embeddings, web scraping and processing with Cloudflare Workflows (beta)

ai-rag-crawler

An AI RAG pipeline built with the Hono stack and Workers AI. This project uses vectorization embeddings to enable semantic search, orchestrates web scraping and data processing with Cloudflare Workflows (beta), and generates context-aware responses using AI.

Note

See How to use the frontend in live url

🧐 About

The ideal state is having a system that can effortlessly ingest documentation from any website, understand its content semantically, and provide accurate, context-aware answers to user questions. This would allow for easy exploration and utilization of vast amounts of documentation without manual effort.

This project provides an automated RAG (Retrieval-Augmented Generation) pipeline built using serverless technologies. It takes a base URL of a documentation website, scrapes the site and all linked pages recursively, generates vector embeddings of the content, stores them in a database, and uses those embeddings to generate accurate, context-aware responses to questions. It uses Cloudflare Workflows(Beta) thus providing a more resilient and scalable solution to the problem.

Api Flow

How to use the frontend in live url

Currently the database has 1 site processed properly which is

https://fiberplane.com/docs/get-started/

So to use the frontend to ask questions:

Head over to https://ai-docs-rag.cjjdxhdjd.workers.dev/
Enter the following url word by word in the Enter Url input field
```
https://fiberplane.com/docs/get-started/
```
Click on Submit URL button
Enter the question you want to ask in Ask about the URL.. field and hit Send
You will get a streaming response from workers-ai.

Alternatively you can also access all the apis through fiberplane studio to interact with the application. See Usage

Demo

https://www.loom.com/share/2aae4caf3dd148cca8e4ef178cfaf851?sid=b19b308e-b514-4610-8866-aae063b164d8

⛏️ Built With HONC🪿

Hono - Web Framework
ORM Drizzle - Database access
Neon - Database platform
Cloudflare Workers - Serverless Environment
Cloudflare AI - AI Models
Cloudflare Workflows - Durable execution framework

🏁 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

What things you need to install the software and how to install them.

Node.js (v18 or higher)
npm or pnpm
Wrangler CLI (Cloudflare's CLI tool)

Installing

Clone the repository:

git clone https://github.com/dead8309/ai-rag-crawler
cd ai-rag-crawler

Install the dependencies:

pnpm install

Create a .dev.vars file and add the following, ensure to populate the database url:

DATABASE_URL=YOUR_DATABASE_URL_HERE
MAX_NO_OF_PAGES_TO_SCRAPE=20

Setup your database by running the migrations

pnpm db:setup

Finally start the development server

pnpm dev

Start Fiberplane Studio

pnpm fiberplane

🎈 Usage

To interact with the api locally, follow these steps:

Access the Fiberplane Studio: Open your web browser and navigate to http://localhost:8788.
Submit a URL for Scraping:
- In Fiberplane Studio, make a POST request to /api/scrape/workflow.
- Include a JSON payload in the request body with the following structure:
```
{
  "url": "YOUR_TARGET_URL_HERE",
  "strict": "false",
  "type": "browser"
}
```
- Replace YOUR_TARGET_URL_HERE with the URL of the documentation website you want to process.
- Example: https://fiberplane.com/docs/get-started/
Monitor the Workflow Status:
- After submitting the URL, the API will return a JSON response that includes a message, an instanceId, and a details object representing the current status.
- Copy the returned instanceId.
- To check the progress of the workflow, send a GET request to /api/scrape/workflow/{instanceId}.
- Replace {instanceId} with the id you copied in the previous step.
- The response will provide real-time details on the workflow status (e.g., queued, running, completed, errored, terminated, etc.).
Ask Questions Using the RAG Pipeline:
- Once the workflow status is complete, make a POST request to /api/sites/ask or /api/sites/ask/stream for a streaming response.
- Include a JSON payload in the request body with the following structure:
```
{
  "site": "YOUR_TARGET_URL_HERE",
  "question": "YOUR_QUESTION_HERE"
}
```
  - Replace "YOUR_TARGET_URL_HERE" with the same URL you used for scraping and "YOUR_QUESTION_HERE" with your specific question. For example, "question": "How do i setup fiberplane application?"
- The API will return a JSON object with the response to your question, generated based on the scraped documentation.

Explanation of Payloads and Endpoints:

/api/scrape/workflow (POST): Starts a new Workflow for scraping and processing a documentation website.
- url: (string, required) The URL of the documentation website.
- strict:(string, optional, default "false") Set to "true" for strict link following.
- type:(string, optional, default "browser") Set to "fetch" to use the fetch method, or "browser" to render client side javascript pages.
/api/scrape/workflow/{instanceId} (GET): Fetches the status of a specific workflow instance. Replace {instanceId} with the ID you received when creating the workflow.
/api/sites/ask (POST): Asks a question about a specific documentation site and returns a complete response.
- site: (string, required) The URL of the documentation website that was previously scraped.
- question: (string, required) The question you want to ask.
/api/sites/ask/stream (POST): Asks a question about a specific documentation site and returns a streamed response which makes the AI response appear piece by piece.
- site: (string, required) The URL of the documentation website that was previously scraped.
- question: (string, required) The question you want to ask.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
apps		apps
packages		packages
.gitignore		.gitignore
README.md		README.md
dev		dev
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-rag-crawler

📝 Table of Contents

🧐 About

Api Flow

How to use the frontend in live url

Demo

⛏️ Built With HONC🪿

🏁 Getting Started

Prerequisites

Installing

🎈 Usage

About

Releases

Packages

Contributors 2

Languages

dead8309/ai-rag-crawler

Folders and files

Latest commit

History

Repository files navigation

ai-rag-crawler

📝 Table of Contents

🧐 About

Api Flow

How to use the frontend in live url

Demo

⛏️ Built With HONC🪿

🏁 Getting Started

Prerequisites

Installing

🎈 Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages