Scraper and Flask Web Server

This project demonstrates the use of a multi-stage Docker build to scrape data from a specified URL using Node.js with Puppeteer and Chromium, and serve the scraped data via a simple Python Flask web server.

Project Structure

/scraper-flask-app
│
├── app.py                  # Flask web server for serving scraped data
├── scraper.js              # Node.js script to scrape the provided URL
├── Dockerfile              # Multi-stage Dockerfile for building the image
├── scraped_data.json       # Output file containing the scraped data (generated by the scraper)
└── README.md               # Project documentation

Requirements

Before you begin, ensure that you have the following installed:

Docker: Installation Guide
A Docker Hub account: Sign up here

Project Description

The project consists of two main parts:

Scraper (Node.js with Puppeteer): A Node.js script (scrape.js) that uses Puppeteer to scrape content from a specified URL and stores the output in a JSON file.
Web Server (Flask): A simple Flask web server (server.py) that reads the scraped JSON data and serves it via an HTTP endpoint.

Scraping Flow:

The scraper script will accept a URL as an environment variable.
It will use Puppeteer to load the page and scrape content (e.g., the title of the page).
The scraped data will be stored as a JSON file (scraped_data.json).

Web Server Flow:

The Flask web server will read the scraped_data.json file.
It will serve the data through an endpoint (/scraped_data) that returns the content as JSON when accessed.

Docker Setup:

The Dockerfile includes two stages:

Scraper Stage: Uses a Node.js image to install Puppeteer and Chromium, and then runs the scraper script.
Server Stage: Uses a Python image with Flask to serve the scraped content.

Setup and Usage

Step 1: Clone the Repository

Start by cloning this repository to your local machine:

git clone https://github.com/sanjaykadavarath/puppeteer-scraper-flask-app.git
cd puppeteer-scraper-flask-app

Step 2: Create Docker Image

The first step in setting up the project is to build the Docker image. You will need to specify the URL you want to scrape via a build argument.

docker build --build-arg SCRAPE_URL=http://example.com -t scraper-flask-app .

Replace http://example.com with the URL you want to scrape.

Step 3: Run the Docker Container

Once the image is built, run the container on your local machine or a server:

docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest

This command runs the container and maps port 5000 on the host machine to port 5000 inside the container.

Step 4: Access the Web Server

After the container starts, you can access the Flask web server by opening a browser and navigating to:

http://localhost:5000/scraped_data

If you're running it on a remote server, replace localhost with the server's IP address.

Step 5: Pushing to Docker Hub

To push the Docker image to Docker Hub, follow these steps:

Tag the image with your Docker Hub username and repository name:

docker tag scraper-flask-app sanjaykadavarath/scraper-flask-app:latest

Push the image to Docker Hub:

docker push sanjaykadavarath/scraper-flask-app:latest

Step 6: Running on Another Machine

To run this project on another machine, follow these steps:

Install Docker on the other machine.
Login to Docker Hub on the new machine:
```
docker login
```

Pull the image from Docker Hub:

docker pull sanjaykadavarath/scraper-flask-app:latest

Run the container:

docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest

Access the Flask server at http://<machine-ip>:5000/scraped_data.

Dockerfile Breakdown

Scraper Stage

FROM node:16 AS scraper

# Install dependencies
RUN apt-get update && apt-get install -y     wget     ca-certificates     --no-install-recommends     && rm -rf /var/lib/apt/lists/*

# Install Puppeteer and Chromium
RUN npm install puppeteer --save

# Set working directory
WORKDIR /app

# Copy the scraper script
COPY scraper.js .

# Set the environment variable for the URL to scrape
ARG SCRAPE_URL
ENV SCRAPE_URL=$SCRAPE_URL

# Run the scraper
RUN node scraper.js

This stage installs necessary dependencies, installs Puppeteer, and runs the scraper.js script.

Server Stage

FROM python:3.9-slim AS server

# Install Flask
RUN pip install flask

# Set working directory
WORKDIR /app

# Copy the scraped data and Flask app
COPY --from=scraper /app/scraped_data.json .
COPY app.py .

# Expose the port
EXPOSE 5000

# Run Flask app
CMD ["python", "app.py"]

This stage copies the scraped_data.json file from the first stage and sets up the Flask web server.

Notes

Environment Variable: The scraper script uses the SCRAPE_URL environment variable to specify the URL to scrape. You must pass this as a build argument when building the Docker image.
Dynamic Scraping: The scraper can be easily adapted to scrape different data by modifying the scraper.js script.
Flask Web Server: The Flask app serves the scraped data as a JSON response at the /scraped_data endpoint.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper and Flask Web Server

Project Structure

Requirements

Project Description

Scraping Flow:

Web Server Flow:

Docker Setup:

Setup and Usage

Step 1: Clone the Repository

Step 2: Create Docker Image

Step 3: Run the Docker Container

Step 4: Access the Web Server

Step 5: Pushing to Docker Hub

Step 6: Running on Another Machine

Dockerfile Breakdown

Scraper Stage

Server Stage

Notes

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
package.json		package.json
requirements.txt		requirements.txt
scrape.js		scrape.js
server.py		server.py

License

sanjaykadavarath/puppeteer-scraper-flask-app

Folders and files

Latest commit

History

Repository files navigation

Scraper and Flask Web Server

Project Structure

Requirements

Project Description

Scraping Flow:

Web Server Flow:

Docker Setup:

Setup and Usage

Step 1: Clone the Repository

Step 2: Create Docker Image

Step 3: Run the Docker Container

Step 4: Access the Web Server

Step 5: Pushing to Docker Hub

Step 6: Running on Another Machine

Dockerfile Breakdown

Scraper Stage

Server Stage

Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages