APOD Data Pipeline

This project showcases a simple data pipeline for extracting, transforming, and loading data from the NASA APOD (Astronomy Picture of the Day) API using PySpark and AWS S3.

Description

The APOD Data Pipeline project demonstrates the process of collecting daily data from the NASA APOD API, performing data transformations using PySpark, and storing the processed data in AWS S3 for further analysis or consumption.

The pipeline consists of the following steps:

Data extraction: Daily data is collected from the NASA APOD API using the provided PySpark code.
Data transformation: The extracted data is transformed using PySpark to a desired format or structure.
Data loading: The transformed data is stored in AWS S3 for future access and analysis.

The project also includes the necessary setup and configuration files for deploying the data pipeline using Terraform and automating the workflow using GitHub Actions.

Prerequisites

Before running the data pipeline, ensure that you have the following prerequisites in place:

Python installed
Poetry installed (to manage project dependencies)
AWS account and access credentials
Terraform installed (for infrastructure provisioning)
Configured .env file or environment variables with required credentials

Getting Started

To get started with the APOD Data Pipeline, follow these steps:

Clone this repository: git clone <repository-url>
Install project dependencies using Poetry: poetry install
Configure the necessary environment variables or .env file with your AWS credentials and API key for the NASA APOD API.
Modify the configuration files (terraform.tfvars and main.tf) to customize the AWS resources and settings, if needed.
Provision the required AWS resources using Terraform: terraform init and terraform apply
Execute the data pipeline by running the PySpark script: poetry run python extract.py (for data extraction) and poetry run python transform.py (for data transformation).
Check the output and verify that the data has been successfully loaded into AWS S3.

Contributing

Contributions to the APOD Data Pipeline project are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please create an issue or submit a pull request.

License

The APOD Data Pipeline project is licensed under the MIT License. Feel free to use, modify, and distribute the code for your own projects.

Acknowledgments

This project was created as a sample project to showcase the capabilities of PySpark and AWS S3 for building data pipelines. Special thanks to the contributors and the open-source community for their valuable contributions and support.

Resources

Feel free to update the README with more detailed information, project-specific instructions, or additional sections as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
infrastructure		infrastructure
nasa_apod_iac_with_terraform		nasa_apod_iac_with_terraform
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APOD Data Pipeline

Description

Prerequisites

Getting Started

Contributing

License

Acknowledgments

Resources

About

Releases

Packages

Languages

lis-r-barreto/nasa-apod-iac-with-terraform

Folders and files

Latest commit

History

Repository files navigation

APOD Data Pipeline

Description

Prerequisites

Getting Started

Contributing

License

Acknowledgments

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages