Web Scraper API

Technologies Used

Python: Core programming language for backend logic.
FastAPI: For building a RESTful web API.
BeautifulSoup: For parsing and extracting data from static HTML.
Requests: For making HTTP requests to fetch web pages.
Swagger UI and Postman: For testing and validating API requests and responses.

Project Description

The Web Scraper API is a lightweight tool built with FastAPI for scraping web content. It allows you to extract various HTML elements (such as paragraphs, titles, links, etc.) from static web pages by making HTTP requests and parsing the HTML content with BeautifulSoup.

This tool is ideal for scraping websites where the content is directly available in the HTML source, making it easy to extract information such as articles, product descriptions, and other static content.

Key Features

Web Scraping for Static Content: Scrapes content from static HTML elements like paragraphs (<p>), links (<a>), divs (<div>), and spans (<span>).
API-based: Users can send URLs via POST requests to extract data dynamically from various web pages.
Error Handling: Provides detailed error messages when issues arise while fetching or parsing content.
Tested with Swagger UI and Postman: Easily validate scraping results via Swagger UI and Postman.

Types of Websites We Can Scrape

The Web Scraper API is designed for scraping content from static websites where data is readily available in the HTML structure. Here are examples of websites it works well with:

Blogs and News Sites: Extract article content, headlines, and publication dates.
Documentation Websites: Scrape text from user manuals, API docs, and help pages.
Product Pages: Extract product names, descriptions, and prices from static e-commerce websites.
Research Papers or Journals: Extract titles, abstracts, and references from academic papers that are pre-rendered in HTML.
Company Websites: Scrape static information from company pages like about sections, team details, and contact information.

The tool is ideal for websites where the content is delivered without requiring JavaScript to render dynamically. For dynamic sites that require JavaScript execution, other scraping methods would be needed.

Installation

Clone the repository:

https://github.com/Abhimanyu-Gaurav/Web-Scraper-API

Navigate to the project directory:
```
cd web-scraper-api
```

Set up a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

How to Use

Run the FastAPI server:
```
uvicorn main:app --reload
```
Open your browser (Safari, Chrome, Brave) and enter the URL:
```
http://localhost:8000/docs#
```
This should display the Swagger UI for your FastAPI application.
Test the /scrape endpoint directly from the Swagger UI to ensure it is working and accessible.
Use the POST method (/scrape) and provide a JSON in the body like this:
- Click on the "Try it out" button.
- Provide a JSON in the request body:
```
{
    "url": "https://timesofindia.indiatimes.com/"
}
```
Click on the "Execute" button to send the request.
You should see the scraped data in the response section if the request is successful.

Send a POST request using API clients:

Using Postman:
- Open Postman and click the "New" button to create a new request.
- Set the request type to POST from the dropdown menu next to the URL field.
- Enter the URL in the URL field:
```
http://localhost:8000/scrape/
```
- Go to the "Body" tab.
  - Select the "raw" option and choose "JSON" from the dropdown.
  - Paste the following JSON into the body:
```
{
    "url": "https://timesofindia.indiatimes.com/"
}
```
- Click the "Send" button to execute the request.
- You should see the scraped data in the response section if the request is successful.
Using cURL:

Open your terminal and run:

curl -X POST "http://localhost:8000/scrape/" -H "Content-Type: application/json" -d '{"url": "https://timesofindia.indiatimes.com/"}'

search: The term you want to search (e.g., business name or type).
total: The number of listings to retrieve (if available).

License

This project is licensed under the MIT License - see the License file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
License		License
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper API

Technologies Used

Table of Contents

Project Description

Key Features

Types of Websites We Can Scrape

Installation

How to Use

Send a POST request using API clients:

License

About

Releases

Packages

Languages

License

Abhimanyu-Gaurav/Web-Scraper-API

Folders and files

Latest commit

History

Repository files navigation

Web Scraper API

Technologies Used

Table of Contents

Project Description

Key Features

Types of Websites We Can Scrape

Installation

How to Use

Send a POST request using API clients:

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages