GitHub - DrankRock/AutoScrape: A 10-in-one scraper with parsers as plugins

Scrape a list of urls with 10 different technologies, automatically, and parse the results with your own custom plugin.

Disclaimer

About this project

This project was made using the infamous method of vibe coding, using Claude 3.7. I understand most of it, but the javascript using Ulixee Hero is not something i'm comfortable with. I highly encourage anyone who wants to modify it to do so. If anyone wants to fork it and update it regularly, be my guest, and I'll reference you there.

About testing environment

All testing was performed on Windows 11 with Python 3.12.4 and the latest versions of supported browsers. Performance and compatibility with other operating systems cannot be guaranteed. Users on Linux or MacOS may need to modify certain components to achieve functionality.

About liability

This software is provided for personal use in a protected environment only. I cannot and will not be held responsible for any misuse, illegal use, or any damages that may occur from using this software. Users are solely responsible for ensuring they comply with all applicable laws, terms of service, and policies when using this tool. By downloading or using this software, you acknowledge that you assume all risks associated with its use.

Software View

Example

Comparison: Headless vs Not Headless Mode

Hero + stealth Not Headless	Standard Selenium Headless
demohead.mp4	demoheadless.mp4

Setup

Run the script setup.bat to :

Install Python3 if not installed
Install all the python dependencies
Install npm / javascript if not installed
Install npm dependencies

Usage

To launch AutoScrape, you may either run autoscrape.bat or go in the Backend folder and run python autoscrape.py.

Simple plug and play

Enter a url in the url list textbox
Chose your technology. Selenium standard is fast but not very discrete. Ulixee hero stealth is very slow bu very hard to detect. Everything has downside and upsides.
Chose wether to run it in headless (in background) or not (a browser window will open). Headless is easier to detect by anti-bot technologies.
Click Run
If the page was accessed and no cloudflare page was detected, the html is saved in Backend/scraped_html/

Advanced Usage

Input

A list of urls instead of a single url can be used, either by pasting it, or loading, in .txt format, using the Load URLs button. All urls are consumed by the execution, each time one is scraped, it is removed from the list.

Scraping technologies

Selenium is a webdriver made for browser automation.

Selenium Standard is the default. It is very fast, but not very stealthy. It is easily detectable but works for basic usage or unprotected websites.
Selenium Stealth is just like selenium standard, but uses the selenium_stealth module, a bit more secure and undetected. Selenium stealth has not been updated for four years though.
Selenium Undetected uses a different chromedriver, undetected-chromedriver, made for stealth. It has not been updated for a year.
Selenium Base uses a different selenium, SeleniumBase, built for scraping. It is much better than selenium. AutoScrape is not using SeleniumBase at its fullest, I might update this in the future.

Ulixee Hero is a browser built for scraping. It uses selenium for scraping, but is way less detectable.

Hero Standard is the default. It runs with normal settings.
Hero Puppeteer runs hero alongside puppeteer, an API made to control Chrome and Firefox, and is good for scraping.
Hero Extra runs hero alongside puppeteer-extra, which enables plugin usage. This option automatically uses the puppeteer-extra-plugin-stealth plugin for basic undetectability.
Hero Stealth is an enhanced version that uses multiple undetection plugins including:
- puppeteer-extra-plugin-stealth
- puppeteer-extra-plugin-anonymize-ua
- puppeteer-extra-plugin-block-resources
- puppeteer-extra-plugin-user-preferences
- puppeteer-extra-plugin-user-data-dir
- puppeteer-extra-plugin-font-size
- puppeteer-extra-plugin-click-and-wait
- puppeteer-extra-plugin-proxy (if configured)
- puppeteer-extra-plugin-random-user-agent
Playwright is a framework made by Microsoft for web testing and automation.
- Playwright standard is the basic experience
- Playwright puppeteer+stealth is similar to the ulixe hero extra, but using playwright extra instead of puppeteer extra

Human Behavior

Human Behavior is just some tweak which adds in some scrolling, clicking etc to appear more humane, with a low to high setting. I have not tested this much, i advice just not using it, and it's useless in headless.

Headless

Headless mode runs the browser without showing a window.

Headless (True): Way faster and lets you use your computer while scraping happens in background. Easier for websites to detect as a bot though.
Not Headless (False): Browser window opens and takes over your screen, making it unusable while scraping. Slower but way harder to detect. Use this for heavily protected websites.

Using Plugins

By default, AutoScrape only saves the raw HTML of scraped pages to the Backend/scraped_html/ directory. To extract structured data:

Select a plugin from the dropdown menu in the interface
When you run the scraper, it will process the HTML with your selected plugin
Extracted data is saved as CSV files in Backend/scraped_data/

This lets you automatically extract specific information like prices, product details, or other structured data from the scraped websites.

Creating Custom Plugins

AutoScrape supports custom parser plugins that can extract specific data from the scraped HTML. These plugins process websites you scrape and output structured data.

Plugin Structure

Plugins are Python classes with a specific interface. Each plugin must:

Be placed in the Backend/plugins directory
Import the necessary modules (ScrapedField and DataType from templated_plugin)
Implement all required interface methods

Basic Plugin Template

Click to view basic plugin template

from dataclasses import dataclass
from typing import Any, List, Optional, Type, Union
from bs4 import BeautifulSoup
from templated_plugin import ScrapedField, DataType

class MyCustomPlugin:
    """Plugin that extracts specific data from a website."""
    
    def get_name(self) -> str:
        """Return the name of the plugin."""
        return "My Custom Plugin"
    
    def get_description(self) -> str:
        """Return a description of what the plugin extracts."""
        return "Extracts important data from my favorite website"
    
    def get_version(self) -> str:
        """Return the version of the plugin."""
        return "1.0.0"
    
    def get_available_fields(self) -> List[ScrapedField]:
        """
        Returns all possible fields this plugin can extract, with default values.
        """
        return [
            ScrapedField(
                name="title",
                value="Example Title",
                field_type=DataType.STRING,
                description="The title of the page",
                accumulate=True
            ),
            ScrapedField(
                name="price",
                value="$19.99",
                field_type=DataType.STRING,
                description="The price of the item",
                accumulate=True
            )
        ]

    def parse(self, html: str) -> List[ScrapedField]:
        """
        Parse HTML content and extract data.
        """
        soup = BeautifulSoup(html, 'html.parser')
        results = []
        
        # Extract title
        title_element = soup.select_one('h1.product-title')
        if title_element:
            results.append(ScrapedField(
                name="title",
                value=title_element.get_text().strip(),
                field_type=DataType.STRING,
                description="The title of the page",
                accumulate=True
            ))
        
        # Extract price
        price_element = soup.select_one('span.price')
        if price_element:
            results.append(ScrapedField(
                name="price",
                value=price_element.get_text().strip(),
                field_type=DataType.STRING,
                description="The price of the item",
                accumulate=True
            ))
        
        return results

The ScrapedField Class

The ScrapedField class defines the data fields your plugin extracts:

name: Identifier for the field
value: The extracted value
field_type: Data type (STRING, INTEGER, FLOAT, BOOLEAN, etc.)
description: Human-readable description of the field
accumulate: Whether to collect multiple values for this field across scrapes

Data Types

Available data types from the DataType enum:

DataType.STRING: For text values
DataType.INTEGER: For whole numbers
DataType.FLOAT: For decimal numbers
DataType.BOOLEAN: For true/false values
DataType.JSON: For structured data

Advanced Plugin Example

Click to view advanced plugin example

class CardmarketPricePlugin:
    """Plugin that extracts price information from Cardmarket pages."""
    
    # Global configuration flag to control whether prices are stored as floats or formatted strings
    STORE_PRICES_AS_FLOAT = False  # Set to True to store prices as float values without currency symbols
    
    def get_name(self) -> str:
        """Return the name of the plugin."""
        return "Cardmarket Price Plugin"
    
    def get_description(self) -> str:
        """Return a description of what the plugin extracts."""
        return "Extracts price information from Cardmarket product pages across different games and languages"
    
    def get_version(self) -> str:
        """Return the version of the plugin."""
        return "1.0.0"
    
    def get_available_fields(self) -> List[ScrapedField]:
        """
        Returns all possible fields this plugin can extract, with default values.
        """
        return [
            ScrapedField(
                name="card_name",
                value="Example Card",
                field_type=DataType.STRING,
                description="Name of the card",
                accumulate=True
            ),
            ScrapedField(
                name="card_set",
                value="Example Set",
                field_type=DataType.STRING,
                description="Set/expansion the card belongs to",
                accumulate=True
            ),
            ScrapedField(
                name="available_items",
                value=500,
                field_type=DataType.INTEGER,
                description="Number of available items for sale",
                accumulate=True
            ),
            ScrapedField(
                name="lowest_price",
                value=1.00 if self.STORE_PRICES_AS_FLOAT else "1,00 €",
                field_type=DataType.FLOAT if self.STORE_PRICES_AS_FLOAT else DataType.STRING,
                description="Lowest price available for the card",
                accumulate=True
            ),
            ScrapedField(
                name="card_rarity",
                value="Uncommon",
                field_type=DataType.STRING,
                description="Rarity of the card",
                accumulate=True
            )
        ]
        
    def _clean_price_string(self, price_string: str) -> str:
        """Clean and fix encoding issues in price strings."""
        if not price_string:
            return ""
            
        # Handle common encoding issues
        cleaned = price_string.replace("â‚¬", "€")
        cleaned = cleaned.replace("Â£", "£")
        cleaned = cleaned.replace("Â$", "$")
        
        # Remove any extra whitespace
        cleaned = cleaned.strip()
        
        return cleaned
    
    def _parse_price_to_float(self, price_string: str) -> float:
        """Parse a price string into a float value, removing currency symbols."""
        if not price_string:
            return 0.0
            
        try:
            # Remove currency symbols and other non-numeric characters
            cleaned = ''.join(c for c in price_string if c.isdigit() or c in ',.').strip()
            
            # Handle European number format (comma as decimal separator)
            if ',' in cleaned and '.' in cleaned:
                # If both are present, assume European format with thousand separators
                cleaned = cleaned.replace('.', '')  # Remove thousand separators
                cleaned = cleaned.replace(',', '.')  # Convert decimal separator
            elif ',' in cleaned:
                # Only comma present, assume it's a decimal separator
                cleaned = cleaned.replace(',', '.')
                
            return float(cleaned)
        except ValueError:
            return 0.0
    
    def parse(self, html: str) -> List[ScrapedField]:
        """Parse HTML content and extract Cardmarket price information."""
        soup = BeautifulSoup(html, 'html.parser')
        results = []
        
        # Extract card name and set
        try:
            title_container = soup.select_one('.page-title-container')
            if title_container:
                h1 = title_container.select_one('h1')
                if h1:
                    # Extract main card name (text before the span)
                    card_name = h1.get_text().strip()
                    set_span = h1.select_one('span')
                    if set_span:
                        card_name = card_name.replace(set_span.get_text(), '').strip()
                        card_set = set_span.get_text().strip()
                        
                        results.append(ScrapedField(
                            name="card_name",
                            value=card_name,
                            field_type=DataType.STRING,
                            description="Name of the card",
                            accumulate=True
                        ))
                        
                        results.append(ScrapedField(
                            name="card_set",
                            value=card_set,
                            field_type=DataType.STRING,
                            description="Set/expansion the card belongs to",
                            accumulate=True
                        ))
        except Exception:
            # Continue even if card name extraction fails
            pass
        
        # Find the info container
        container = soup.select_one('.info-list-container')
        if not container:
            return results
            
        # Process prices, rarity, etc.
        # ... (additional extraction code)
        
        return results

Using Your Plugin

Once you've created your plugin:

Place the Python file in the Backend/plugins directory
Restart AutoScrape
Your plugin will be automatically loaded and available for use
When scraping a website, your plugin will process the HTML and save structured data

The extracted data from plugins is saved in the Backend/scraped_data/ directory in CSV format.

Why the cat

Isn't she adorable ?

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Backend		Backend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
autoScrape.bat		autoScrape.bat
setup.bat		setup.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Disclaimer

Software View

Example

Comparison: Headless vs Not Headless Mode

Setup

Usage

Simple plug and play

Advanced Usage

Input

Scraping technologies

Human Behavior

Headless

Using Plugins

Creating Custom Plugins

Plugin Structure

Basic Plugin Template

The ScrapedField Class

Data Types

Advanced Plugin Example

Using Your Plugin

Why the cat

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DrankRock/AutoScrape

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Software View

Example

Comparison: Headless vs Not Headless Mode

Setup

Usage

Simple plug and play

Advanced Usage

Input

Scraping technologies

Human Behavior

Headless

Using Plugins

Creating Custom Plugins

Plugin Structure

Basic Plugin Template

The ScrapedField Class

Data Types

Advanced Plugin Example

Using Your Plugin

Why the cat

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages