This project demonstrates the use of a multi-stage Docker build to scrape data from a specified URL using Node.js with Puppeteer and Chromium, and serve the scraped data via a simple Python Flask web server.
/scraper-flask-app
β
βββ app.py # Flask web server for serving scraped data
βββ scraper.js # Node.js script to scrape the provided URL
βββ Dockerfile # Multi-stage Dockerfile for building the image
βββ scraped_data.json # Output file containing the scraped data (generated by the scraper)
βββ README.md # Project documentation
Before you begin, ensure that you have the following installed:
- Docker: Installation Guide
- A Docker Hub account: Sign up here
The project consists of two main parts:
- Scraper (Node.js with Puppeteer): A Node.js script (
scrape.js
) that uses Puppeteer to scrape content from a specified URL and stores the output in a JSON file. - Web Server (Flask): A simple Flask web server (
server.py
) that reads the scraped JSON data and serves it via an HTTP endpoint.
- The scraper script will accept a URL as an environment variable.
- It will use Puppeteer to load the page and scrape content (e.g., the title of the page).
- The scraped data will be stored as a JSON file (
scraped_data.json
).
- The Flask web server will read the
scraped_data.json
file. - It will serve the data through an endpoint (
/scraped_data
) that returns the content as JSON when accessed.
The Dockerfile includes two stages:
- Scraper Stage: Uses a Node.js image to install Puppeteer and Chromium, and then runs the scraper script.
- Server Stage: Uses a Python image with Flask to serve the scraped content.
Start by cloning this repository to your local machine:
git clone https://github.com/sanjaykadavarath/puppeteer-scraper-flask-app.git
cd puppeteer-scraper-flask-app
The first step in setting up the project is to build the Docker image. You will need to specify the URL you want to scrape via a build argument.
docker build --build-arg SCRAPE_URL=http://example.com -t scraper-flask-app .
- Replace
http://example.com
with the URL you want to scrape.
Once the image is built, run the container on your local machine or a server:
docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
- This command runs the container and maps port
5000
on the host machine to port5000
inside the container.
After the container starts, you can access the Flask web server by opening a browser and navigating to:
http://localhost:5000/scraped_data
If you're running it on a remote server, replace localhost
with the server's IP address.
To push the Docker image to Docker Hub, follow these steps:
-
Tag the image with your Docker Hub username and repository name:
docker tag scraper-flask-app sanjaykadavarath/scraper-flask-app:latest
-
Push the image to Docker Hub:
docker push sanjaykadavarath/scraper-flask-app:latest
To run this project on another machine, follow these steps:
-
Install Docker on the other machine.
-
Login to Docker Hub on the new machine:
docker login
-
Pull the image from Docker Hub:
docker pull sanjaykadavarath/scraper-flask-app:latest
-
Run the container:
docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
-
Access the Flask server at
http://<machine-ip>:5000/scraped_data
.
FROM node:16 AS scraper
# Install dependencies
RUN apt-get update && apt-get install -y wget ca-certificates --no-install-recommends && rm -rf /var/lib/apt/lists/*
# Install Puppeteer and Chromium
RUN npm install puppeteer --save
# Set working directory
WORKDIR /app
# Copy the scraper script
COPY scraper.js .
# Set the environment variable for the URL to scrape
ARG SCRAPE_URL
ENV SCRAPE_URL=$SCRAPE_URL
# Run the scraper
RUN node scraper.js
- This stage installs necessary dependencies, installs Puppeteer, and runs the
scraper.js
script.
FROM python:3.9-slim AS server
# Install Flask
RUN pip install flask
# Set working directory
WORKDIR /app
# Copy the scraped data and Flask app
COPY --from=scraper /app/scraped_data.json .
COPY app.py .
# Expose the port
EXPOSE 5000
# Run Flask app
CMD ["python", "app.py"]
- This stage copies the
scraped_data.json
file from the first stage and sets up the Flask web server.
- Environment Variable: The scraper script uses the
SCRAPE_URL
environment variable to specify the URL to scrape. You must pass this as a build argument when building the Docker image. - Dynamic Scraping: The scraper can be easily adapted to scrape different data by modifying the
scraper.js
script. - Flask Web Server: The Flask app serves the scraped data as a JSON response at the
/scraped_data
endpoint.
This project is licensed under the MIT License - see the LICENSE file for details.