Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dagster Openmetadata Ingestion Pipeline #1484

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

quazi-h
Copy link
Contributor

@quazi-h quazi-h commented Feb 21, 2025

What are the relevant tickets?

#1355

Description (What does it do?)

First pass at building out the scaffolding for our openmetadata ingestion pipeline.
This PR adds pipeline code to ingest Trino table metadata into our openmetadata instance.

Updated dependencies in toml and lock files.
Added the new openmetadata definition to workspaces.
Added the trino configuration details as a dictionary object under resources.
Added the new pipeline definition for openmetadata ingestion using Trino as the initial data sources as an asset and using dated partitions.

This is still WIP.

How can this be tested?

Update the credentials in lines 12 and 34 in src/ol_orchestrate/resources/openmetadata/trino.py by using the referenced urls: Trino bot_metadata password from vault
openmetadata JWT token
Export your github token and launch docker compose.
export GITHUB_TOKEN=<YOUR_GITHUB_TOKEN> docker compose up

Added the new openmetadata definition to workspaces.
Added the trino configuration details as a dictionary object under resources.
Added the new pipeline definition for openmetadata ingestion using Trino as the initial data sources as an asset and using dated partitions.
Removing the yaml load logic.
Removing unnecessary asset_helper functions.
@blarghmatey
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces the initial scaffolding for an OpenMetadata ingestion pipeline, which is a valuable addition to the platform. The code is generally well-structured, but the hardcoded credentials in trino.py must be addressed before merging. Additionally, the dependency version changes in uv.lock should be carefully reviewed to ensure they are intentional and don't introduce any issues.

Summary of Findings

  • Security Risk: Hardcoded Credentials: The trino.py file contains hardcoded credentials, which is a significant security vulnerability. These should be managed securely using Vault or a similar secrets management system.
  • Dependency Version Changes: Several dependency versions in uv.lock have been downgraded or changed. It's important to verify that these changes are intentional and don't introduce compatibility issues or regressions.

Assessment

The pull request introduces a new Dagster pipeline for ingesting Trino table metadata into OpenMetadata. The scaffolding appears well-structured, incorporating necessary dependencies, configurations, and asset definitions. However, there are some areas that require attention before merging. Specifically, the hardcoded credentials pose a security risk and should be addressed. Additionally, some dependency version changes in uv.lock seem unusual and warrant further investigation. Users should have others review and approve this code before merging.

Comment on lines +12 to +13
"password": "<retrieve this from https://vault-production.odl.mit.edu/ui/vault/secrets/platform-secrets/kv/starburst-galaxy/details?version=4>"
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Storing credentials directly in the code is a security risk. Consider using a more secure method, such as retrieving the password from Vault or another secrets management system. This is a critical security concern that needs to be addressed before merging.

                    "password": vault.read_secret("path/to/secret")["password"] # Retrieve from Vault

Comment on lines +39 to +40
"jwtToken": "<retrieve this from https://open-metadata-ci.ol.mit.edu/bots/ingestion-bot>"
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the password, the JWT token should be retrieved from a secure source like Vault, rather than being hardcoded in the configuration. This is a critical security concern that needs to be addressed before merging.

                "jwtToken": vault.read_secret("path/to/jwt")["jwtToken"] # Retrieve from Vault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants