Skip to content

πŸ•ΈοΈ A multi-stage Docker app that scrapes any URL using Node.js + Puppeteer and serves the data via a lightweight Flask API. Combines the power of browser automation and Python web serving in a clean, efficient container.

License

Notifications You must be signed in to change notification settings

sanjaykadavarath/puppeteer-scraper-flask-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scraper and Flask Web Server

This project demonstrates the use of a multi-stage Docker build to scrape data from a specified URL using Node.js with Puppeteer and Chromium, and serve the scraped data via a simple Python Flask web server.

Project Structure

/scraper-flask-app
β”‚
β”œβ”€β”€ app.py                  # Flask web server for serving scraped data
β”œβ”€β”€ scraper.js              # Node.js script to scrape the provided URL
β”œβ”€β”€ Dockerfile              # Multi-stage Dockerfile for building the image
β”œβ”€β”€ scraped_data.json       # Output file containing the scraped data (generated by the scraper)
└── README.md               # Project documentation

Requirements

Before you begin, ensure that you have the following installed:

Project Description

The project consists of two main parts:

  1. Scraper (Node.js with Puppeteer): A Node.js script (scrape.js) that uses Puppeteer to scrape content from a specified URL and stores the output in a JSON file.
  2. Web Server (Flask): A simple Flask web server (server.py) that reads the scraped JSON data and serves it via an HTTP endpoint.

Scraping Flow:

  • The scraper script will accept a URL as an environment variable.
  • It will use Puppeteer to load the page and scrape content (e.g., the title of the page).
  • The scraped data will be stored as a JSON file (scraped_data.json).

Web Server Flow:

  • The Flask web server will read the scraped_data.json file.
  • It will serve the data through an endpoint (/scraped_data) that returns the content as JSON when accessed.

Docker Setup:

The Dockerfile includes two stages:

  1. Scraper Stage: Uses a Node.js image to install Puppeteer and Chromium, and then runs the scraper script.
  2. Server Stage: Uses a Python image with Flask to serve the scraped content.

Setup and Usage

Step 1: Clone the Repository

Start by cloning this repository to your local machine:

git clone https://github.com/sanjaykadavarath/puppeteer-scraper-flask-app.git
cd puppeteer-scraper-flask-app

Step 2: Create Docker Image

The first step in setting up the project is to build the Docker image. You will need to specify the URL you want to scrape via a build argument.

docker build --build-arg SCRAPE_URL=http://example.com -t scraper-flask-app .
  • Replace http://example.com with the URL you want to scrape.

Step 3: Run the Docker Container

Once the image is built, run the container on your local machine or a server:

docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
  • This command runs the container and maps port 5000 on the host machine to port 5000 inside the container.

Step 4: Access the Web Server

After the container starts, you can access the Flask web server by opening a browser and navigating to:

http://localhost:5000/scraped_data

If you're running it on a remote server, replace localhost with the server's IP address.

Step 5: Pushing to Docker Hub

To push the Docker image to Docker Hub, follow these steps:

  1. Tag the image with your Docker Hub username and repository name:

    docker tag scraper-flask-app sanjaykadavarath/scraper-flask-app:latest
  2. Push the image to Docker Hub:

    docker push sanjaykadavarath/scraper-flask-app:latest

Step 6: Running on Another Machine

To run this project on another machine, follow these steps:

  1. Install Docker on the other machine.

  2. Login to Docker Hub on the new machine:

    docker login
  3. Pull the image from Docker Hub:

    docker pull sanjaykadavarath/scraper-flask-app:latest
  4. Run the container:

    docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
  5. Access the Flask server at http://<machine-ip>:5000/scraped_data.

Dockerfile Breakdown

Scraper Stage

FROM node:16 AS scraper

# Install dependencies
RUN apt-get update && apt-get install -y     wget     ca-certificates     --no-install-recommends     && rm -rf /var/lib/apt/lists/*

# Install Puppeteer and Chromium
RUN npm install puppeteer --save

# Set working directory
WORKDIR /app

# Copy the scraper script
COPY scraper.js .

# Set the environment variable for the URL to scrape
ARG SCRAPE_URL
ENV SCRAPE_URL=$SCRAPE_URL

# Run the scraper
RUN node scraper.js
  • This stage installs necessary dependencies, installs Puppeteer, and runs the scraper.js script.

Server Stage

FROM python:3.9-slim AS server

# Install Flask
RUN pip install flask

# Set working directory
WORKDIR /app

# Copy the scraped data and Flask app
COPY --from=scraper /app/scraped_data.json .
COPY app.py .

# Expose the port
EXPOSE 5000

# Run Flask app
CMD ["python", "app.py"]
  • This stage copies the scraped_data.json file from the first stage and sets up the Flask web server.

Notes

  • Environment Variable: The scraper script uses the SCRAPE_URL environment variable to specify the URL to scrape. You must pass this as a build argument when building the Docker image.
  • Dynamic Scraping: The scraper can be easily adapted to scrape different data by modifying the scraper.js script.
  • Flask Web Server: The Flask app serves the scraped data as a JSON response at the /scraped_data endpoint.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

πŸ•ΈοΈ A multi-stage Docker app that scrapes any URL using Node.js + Puppeteer and serves the data via a lightweight Flask API. Combines the power of browser automation and Python web serving in a clean, efficient container.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published