A Python-based web scraping tool using Scrapy to extract email addresses from websites and save them in a CSV file.
Use Anaconda or Miniconda to manage your Python environment.
- Create and Activate the Conda Environment
Run the following commands:
conda create -n email_scraper_env python=3.8
conda activate email_scraper_env
- Install Dependencies With the environment activated, install Scrapy:
pip install scrapy
How to Run the Spider
- Navigate to the Project Folder:
cd email_scraper
- Start the Spider with the Target URL:
scrapy crawl email_spider -a url="https://example.com"
Output Extracted emails will be saved to a CSV file in the project directory.
For example, if the URL is https://example.com, the output file will be named: example_com_emails.csv
How It Works
Code Overview The main spider code (email_spider.py) performs the following steps:
Crawls a website: Starting from the given URL, it extracts and follows internal links. Extracts email addresses: Uses regex to identify valid email patterns. Filters duplicates and invalid emails: Ensures only unique, valid email addresses are stored. Saves results: Appends each extracted email to a CSV file.
Dependencies
Python 3.8 or higher Scrapy framework Install dependencies using the instructions in the Installation section.
Example Output A sample output in the CSV file (example_com_emails.csv):
Notes The project respects robots.txt rules by default (ROBOTSTXT_OBEY=True). Use this tool responsibly and avoid scraping websites without permission.
Troubleshooting
- scrapy: command not found Ensure Scrapy is installed in your active environment. Use the full Python command if needed:
python -m scrapy crawl email_spider -a url="https://example.com"
- Missing Dependencies Ensure all required dependencies are installed in the Conda environment. Check the Python version:
python --version
License This project is licensed under the MIT License.
1. Open your project folder.
2. Create a new file named `README.md`.
3. Copy and paste the above content into the `README.md` file.
4. Save the file and commit it to your GitHub repository:
```bash
git add README.md
git commit -m "Added README.md"
git push
This will make it easy for anyone to use and understand your project directly from GitHub.