Scrape and Tokenize whole website for LLMs

This project aims to extract the urls from a website's sitemap, scrape the text from each URL, clean the text, and prepare it for use in a Large Language Model (LLM) by tokenizing the text.

Features

Retry Mechanism: Robust retry mechanism for handling network issues.
Concurrent Scraping: Efficiently scrape multiple URLs concurrently.
Text Cleaning: Clean and preprocess the extracted text.
Logging: Detailed logging for monitoring the scraping process.

Installation

Clone the repository:

git clone https://github.com/faisal-fida/web-text-scraper.git
cd web-text-scraper

Install the required dependencies using Poetry:
```
pip install poetry
poetry install
```
Activate the virtual environment and run the script:
```
poetry shell
```

Usage

Update the sitemap_url in main.py with the URL of the sitemap you want to scrape.
Run the script:
```
python main.py
```

Project Structure

utils.py: Contains the RetrySession class with a custom retry decorator.
parser.py: Contains the SitemapParser class for parsing sitemap URLs.
scraper.py: Contains the Scraper class for extracting and cleaning text from URLs.
main.py: Main script to orchestrate the scraping process and save the extracted text to files.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
web_text_scraper		web_text_scraper
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape and Tokenize whole website for LLMs

Features

Installation

Usage

Project Structure

About

Releases

Packages

Languages

sourcegate/Scrape-Tokenize-whole-website-for-LLMs

Folders and files

Latest commit

History

Repository files navigation

Scrape and Tokenize whole website for LLMs

Features

Installation

Usage

Project Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages