This project aims to extract the urls from a website's sitemap, scrape the text from each URL, clean the text, and prepare it for use in a Large Language Model (LLM) by tokenizing the text.
- Retry Mechanism: Robust retry mechanism for handling network issues.
- Concurrent Scraping: Efficiently scrape multiple URLs concurrently.
- Text Cleaning: Clean and preprocess the extracted text.
- Logging: Detailed logging for monitoring the scraping process.
-
Clone the repository:
git clone https://github.com/faisal-fida/web-text-scraper.git cd web-text-scraper
-
Install the required dependencies using Poetry:
pip install poetry poetry install
-
Activate the virtual environment and run the script:
poetry shell
- Update the
sitemap_url
inmain.py
with the URL of the sitemap you want to scrape. - Run the script:
python main.py
utils.py
: Contains theRetrySession
class with a custom retry decorator.parser.py
: Contains theSitemapParser
class for parsing sitemap URLs.scraper.py
: Contains theScraper
class for extracting and cleaning text from URLs.main.py
: Main script to orchestrate the scraping process and save the extracted text to files.