DevOps: integrate Great Expectations for data QC #850

nlebovits · 2024-07-30T19:10:06Z

Describe the task

The objective is to integrate Great Expectations into our Python ETL pipeline to ensure data quality. The task involves researching various integration methods, documenting the findings, and implementing the best solution. This integration aims to validate the data processed by our ETL pipeline, ensuring its accuracy and consistency while catching problems before production.

Acceptance Criteria

Research different options for integrating Great Expectations with a Python ETL pipeline and, if necessary, our Postgres database.
Summarize the pros and cons of each option.
Select the most suitable integration method and explain why it was chosen.
Implement the chosen method to validate the outputs of the ETL pipeline.
Demonstrate the successful integration by showing validation results from the ETL pipeline.

Additional context

Currently, I can imagine either using Great Expectations directly in our Python pipeline or connecting to our Postgres database for data QC. Here are two resources that might help:

Great Expectations Postgres Tutorial:
- This tutorial provides a step-by-step guide on setting up a local PostgreSQL database, creating tables, inserting data, and performing data validations using Great Expectations. It serves as a practical example of integrating Great Expectations with a database.
Connect to in-memory Data Assets:
- This guide explains how to connect Great Expectations to in-memory pandas or Spark DataFrames. It introduces the concepts of Data Assets (data in its original format) and Data Sources (storage locations for Data Assets), which are crucial for configuring data validation.

github-actions · 2024-10-01T16:35:55Z

This issue has been marked as stale because it has been open for 30 days with no activity.

nlebovits added the devops label Jul 30, 2024

nlebovits added this to Clean & Green Philly Jul 30, 2024

nlebovits moved this to To Do in Clean & Green Philly Jul 30, 2024

github-actions bot added the stale label Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DevOps: integrate Great Expectations for data QC #850

DevOps: integrate Great Expectations for data QC #850

nlebovits commented Jul 30, 2024

github-actions bot commented Oct 1, 2024

DevOps: integrate Great Expectations for data QC #850

DevOps: integrate Great Expectations for data QC #850

Comments

nlebovits commented Jul 30, 2024

Describe the task

Acceptance Criteria

Additional context

github-actions bot commented Oct 1, 2024