This project showcases a simple data pipeline for extracting, transforming, and loading data from the NASA APOD (Astronomy Picture of the Day) API using PySpark and AWS S3.
The APOD Data Pipeline project demonstrates the process of collecting daily data from the NASA APOD API, performing data transformations using PySpark, and storing the processed data in AWS S3 for further analysis or consumption.
The pipeline consists of the following steps:
- Data extraction: Daily data is collected from the NASA APOD API using the provided PySpark code.
- Data transformation: The extracted data is transformed using PySpark to a desired format or structure.
- Data loading: The transformed data is stored in AWS S3 for future access and analysis.
The project also includes the necessary setup and configuration files for deploying the data pipeline using Terraform and automating the workflow using GitHub Actions.
Before running the data pipeline, ensure that you have the following prerequisites in place:
- Python installed
- Poetry installed (to manage project dependencies)
- AWS account and access credentials
- Terraform installed (for infrastructure provisioning)
- Configured
.env
file or environment variables with required credentials
To get started with the APOD Data Pipeline, follow these steps:
- Clone this repository:
git clone <repository-url>
- Install project dependencies using Poetry:
poetry install
- Configure the necessary environment variables or
.env
file with your AWS credentials and API key for the NASA APOD API. - Modify the configuration files (
terraform.tfvars
andmain.tf
) to customize the AWS resources and settings, if needed. - Provision the required AWS resources using Terraform:
terraform init
andterraform apply
- Execute the data pipeline by running the PySpark script:
poetry run python extract.py
(for data extraction) andpoetry run python transform.py
(for data transformation). - Check the output and verify that the data has been successfully loaded into AWS S3.
Contributions to the APOD Data Pipeline project are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please create an issue or submit a pull request.
The APOD Data Pipeline project is licensed under the MIT License. Feel free to use, modify, and distribute the code for your own projects.
This project was created as a sample project to showcase the capabilities of PySpark and AWS S3 for building data pipelines. Special thanks to the contributors and the open-source community for their valuable contributions and support.
Feel free to update the README with more detailed information, project-specific instructions, or additional sections as needed.