Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DevOps: integrate Great Expectations for data QC #850

Open
5 tasks
nlebovits opened this issue Jul 30, 2024 · 1 comment
Open
5 tasks

DevOps: integrate Great Expectations for data QC #850

nlebovits opened this issue Jul 30, 2024 · 1 comment

Comments

@nlebovits
Copy link
Collaborator

Describe the task

The objective is to integrate Great Expectations into our Python ETL pipeline to ensure data quality. The task involves researching various integration methods, documenting the findings, and implementing the best solution. This integration aims to validate the data processed by our ETL pipeline, ensuring its accuracy and consistency while catching problems before production.

Acceptance Criteria

  • Research different options for integrating Great Expectations with a Python ETL pipeline and, if necessary, our Postgres database.
  • Summarize the pros and cons of each option.
  • Select the most suitable integration method and explain why it was chosen.
  • Implement the chosen method to validate the outputs of the ETL pipeline.
  • Demonstrate the successful integration by showing validation results from the ETL pipeline.

Additional context

Currently, I can imagine either using Great Expectations directly in our Python pipeline or connecting to our Postgres database for data QC. Here are two resources that might help:

  • Great Expectations Postgres Tutorial:

    • This tutorial provides a step-by-step guide on setting up a local PostgreSQL database, creating tables, inserting data, and performing data validations using Great Expectations. It serves as a practical example of integrating Great Expectations with a database.
  • Connect to in-memory Data Assets:

    • This guide explains how to connect Great Expectations to in-memory pandas or Spark DataFrames. It introduces the concepts of Data Assets (data in its original format) and Data Sources (storage locations for Data Assets), which are crucial for configuring data validation.
Copy link

github-actions bot commented Oct 1, 2024

This issue has been marked as stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

1 participant