This repository contains a simple ETL (Extract, Load) pipeline for fetching top headlines from the News API and uploading the data to an AWS S3 bucket. It uses Python and Apache Airflow for orchestration.
news_etl.py
: Contains the ETL process for fetching news data and uploading it to S3.news_etl_dag.py
: Defines the Airflow DAG for scheduling and running the ETL process.
- Python 3.7 or higher
- Apache Airflow
requests
librarypandas
libraryboto3
library (for S3 operations)- AWS credentials configured for
boto3
-
Clone the Repository:
git clone https://github.com/yourusername/news_etl.git cd news_etl
-
Install Dependencies:
It is recommended to use a virtual environment. Install the required Python libraries using:
pip install requests pandas boto3 apache-airflow
-
Configure Airflow:
- Set up Apache Airflow by following the Airflow documentation.
- Place the
news_etl_dag.py
file in your Airflow DAGs folder (usually located at~/airflow/dags
).
-
Configure AWS Credentials:
Ensure that your AWS credentials are configured. You can set up your AWS credentials using the AWS CLI:
aws configure
Alternatively, set up environment variables:
export AWS_ACCESS_KEY_ID=your_access_key_id export AWS_SECRET_ACCESS_KEY=your_secret_access_key
-
Modify API Key:
Replace the placeholder API key in
news_etl.py
with your actual News API key:api_key = 'your_news_api_key'
-
Running the ETL Process Manually:
You can run the ETL process directly using Python:
python news_etl.py
-
Scheduling with Airflow:
-
Start the Airflow web server and scheduler:
airflow webserver --port 8080 airflow scheduler
-
Access the Airflow web interface at
http://localhost:8080
. -
Trigger the
news_etl_dag
manually or wait for it to run according to the schedule (daily).
-
- Ensure you have an S3 bucket named
reddits-data
(or adjust the bucket name innews_etl.py
accordingly). - The pipeline saves the news data as
news.csv
in the S3 bucket.
- API Errors: Ensure your News API key is valid and not expired. Check for HTTP response codes in the logs for detailed error messages.
- S3 Upload Errors: Verify that your AWS credentials have the necessary permissions to upload files to the S3 bucket.