-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): add preset source #10954
Conversation
WalkthroughThe recent changes introduce a new data ingestion source for the Preset.io platform, enhancing the ingestion framework's capabilities. This includes updates to configuration classes for connecting to the Preset API and ensuring secure authentication. Additionally, the new integration and unit tests validate the ingestion processes, improving the system's reliability and robustness. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PresetAPI
participant PresetSource
User->>PresetSource: Initialize with PresetConfig
PresetSource->>PresetAPI: POST login request
PresetAPI-->>PresetSource: Return access token
PresetSource->>User: Ingestion complete
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Add preset tests
remove preset specific config keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Outside diff range, codebase verification and nitpick comments (6)
metadata-ingestion/tests/integration/preset/golden_test_ingest.json (6)
1-50
: Ensure consistency in custom properties keys.The keys "Status" and "IsPublished" in customProperties might be redundant. Consider using consistent naming conventions.
- "Status": "published", - "IsPublished": "true", + "status": "published", + "isPublished": "true",
51-97
: Ensure consistency in custom properties keys.The keys "Status" and "IsPublished" in customProperties might be redundant. Consider using consistent naming conventions.
- "Status": "draft", - "IsPublished": "false", + "status": "draft", + "isPublished": "false",
98-144
: Ensure consistency in custom properties keys.The keys "Metrics", "Filters", and "Dimensions" in customProperties might be redundant. Consider using consistent naming conventions.
- "Metrics": "", - "Filters": "", - "Dimensions": "" + "metrics": "", + "filters": "", + "dimensions": ""
145-191
: Ensure consistency in custom properties keys.The keys "Metrics", "Filters", and "Dimensions" in customProperties might be redundant. Consider using consistent naming conventions.
- "Metrics": "", - "Filters": "", - "Dimensions": "" + "metrics": "", + "filters": "", + "dimensions": ""
192-238
: Ensure consistency in custom properties keys.The keys "Metrics", "Filters", and "Dimensions" in customProperties might be redundant. Consider using consistent naming conventions.
- "Metrics": "", - "Filters": "", - "Dimensions": "" + "metrics": "", + "filters": "", + "dimensions": ""
239-285
: Ensure consistency in custom properties keys.The keys "Metrics", "Filters", and "Dimensions" in customProperties might be redundant. Consider using consistent naming conventions.
- "Metrics": "", - "Filters": "", - "Dimensions": "" + "metrics": "", + "filters": "", + "dimensions": ""
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (7)
- metadata-ingestion/setup.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/preset.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/superset.py (3 hunks)
- metadata-ingestion/tests/integration/preset/golden_test_ingest.json (1 hunks)
- metadata-ingestion/tests/integration/preset/golden_test_stateful_ingest.json (1 hunks)
- metadata-ingestion/tests/integration/preset/test_preset.py (1 hunks)
- metadata-ingestion/tests/unit/test_preset_source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/tests/integration/preset/test_preset.py
20-20: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
171-171: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
metadata-ingestion/src/datahub/ingestion/source/superset.py
11-11:
datahub.configuration.ConfigModel
imported but unusedRemove unused import:
datahub.configuration.ConfigModel
(F401)
26-26:
datahub.emitter.mce_builder.DEFAULT_ENV
imported but unusedRemove unused import:
datahub.emitter.mce_builder.DEFAULT_ENV
(F401)
Additional comments not posted (11)
metadata-ingestion/tests/unit/test_preset_source.py (2)
3-11
: LGTM!The test function correctly verifies the default values of the
PresetConfig
class.
14-21
: LGTM!The test function correctly verifies the
display_uri
field of thePresetConfig
class.metadata-ingestion/src/datahub/ingestion/source/preset.py (3)
34-59
: LGTM!The fields of the
PresetConfig
class are well-defined and provide necessary configuration options for the Preset source.
61-70
: LGTM!The methods
remove_trailing_slash
anddefault_display_uri_to_connect_uri
correctly handle the configuration values.
73-94
: LGTM!The constructor correctly initializes the
PresetSource
class and calls thelogin
method.metadata-ingestion/tests/integration/preset/golden_test_stateful_ingest.json (1)
1-260
: LGTM!The structure and content of the JSON data are correct and align with the expected format for integration tests.
metadata-ingestion/tests/integration/preset/test_preset.py (2)
180-217
: Ensure that the test checks for expected outcomes.The test function should include assertions to verify the expected outcomes.
# Add assertions to verify the expected outcomes. assert pipeline is not None assert golden_file is not None
222-349
: Ensure that the test checks for expected outcomes.The test function should include assertions to verify the expected outcomes.
# Add assertions to verify the expected outcomes. assert pipeline_run1 is not None assert pipeline_run2 is not Nonemetadata-ingestion/src/datahub/ingestion/source/superset.py (2)
185-185
: Ensure proper error handling for the login method.Add error handling in case the login method fails.
try: self.login() except Exception as e: logger.error(f"Login failed: {e}") raise
187-187
: Ensure proper error handling for the login request.Add error handling in case the login request fails.
login_response = requests.post( f"{self.config.connect_uri}/api/v1/security/login", json={ "username": self.config.username, "password": self.config.password, "refresh": True, "provider": self.config.provider, }, ) if login_response.status_code != 200: logger.error(f"Login request failed: {login_response.text}") login_response.raise_for_status()metadata-ingestion/setup.py (1)
703-703
: LGTM!The new entry point for the
preset
ingestion source is correctly formatted and integrated.
def login(self): | ||
logger.info(f"self.config is {self.config}") | ||
|
||
login_response = requests.post( | ||
f"{self.config.manager_uri}/v1/auth/", | ||
json={"name": self.config.api_key, "secret": self.config.api_secret}, | ||
) | ||
|
||
self.access_token = login_response.json()["payload"]["access_token"] | ||
logger.debug("Got access token from Preset") | ||
|
||
self.session = requests.Session() | ||
self.session.headers.update( | ||
{ | ||
"Authorization": f"Bearer {self.access_token}", | ||
"Content-Type": "application/json", | ||
"Accept": "*/*", | ||
} | ||
) | ||
|
||
# Test the connection | ||
test_response = self.session.get(f"{self.config.connect_uri}/version") | ||
if not test_response.ok: | ||
logger.error("Unable to connect to workspace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper error handling for the login process.
The login
method should handle potential errors during the authentication process more robustly.
- login_response = requests.post(
+ try:
+ login_response = requests.post(
- self.access_token = login_response.json()["payload"]["access_token"]
+ login_response.raise_for_status()
+ self.access_token = login_response.json()["payload"]["access_token"]
+ except requests.exceptions.RequestException as e:
+ logger.error(f"Failed to authenticate with Preset: {e}")
+ raise
Committable suggestion was skipped due to low confidence.
|
||
api_vs_response.update(override_data) | ||
|
||
for url in api_vs_response.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use key in dict
instead of key in dict.keys()
.
Remove .keys()
for better readability and performance.
- for url in api_vs_response.keys():
+ for url in api_vs_response:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for url in api_vs_response.keys(): | |
for url in api_vs_response: |
Tools
Ruff
171-171: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
GMS_SERVER = f"http://localhost:{GMS_PORT}" | ||
|
||
|
||
def register_mock_api(request_mock: Any, override_data: dict = {}) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using mutable data structures for argument defaults.
Replace with None
and initialize within the function.
- def register_mock_api(request_mock: Any, override_data: dict = {}) -> None:
+ def register_mock_api(request_mock: Any, override_data: Optional[dict] = None) -> None:
+ if override_data is None:
+ override_data = {}
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def register_mock_api(request_mock: Any, override_data: dict = {}) -> None: | |
def register_mock_api(request_mock: Any, override_data: Optional[dict] = None) -> None: | |
if override_data is None: | |
override_data = {} |
Tools
Ruff
20-20: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there might be some lint issues
@@ -178,7 +182,9 @@ def __init__(self, ctx: PipelineContext, config: SupersetConfig): | |||
super().__init__(config, ctx) | |||
self.config = config | |||
self.report = StaleEntityRemovalSourceReport() | |||
self.login() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be cleaner to make this self.session = self.login()
also the domain_registry logic should not be in the login method
super().__init__(ctx, config) | ||
self.config = config | ||
self.report = StaleEntityRemovalSourceReport() | ||
self.login() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if super() already calls login, we don't need to explicitly call it again, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually looking to overwrite the login in superset since the authentication method is different (Superset is username/password, Preset is API token based name/secret and also has a different URL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/setup.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/superset.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/setup.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/superset.py
11-11:
datahub.configuration.ConfigModel
imported but unusedRemove unused import:
datahub.configuration.ConfigModel
(F401)
26-26:
datahub.emitter.mce_builder.DEFAULT_ENV
imported but unusedRemove unused import:
datahub.emitter.mce_builder.DEFAULT_ENV
(F401)
Additional comments not posted (2)
metadata-ingestion/src/datahub/ingestion/source/superset.py (2)
186-186
: Verify error handling for login failures.The
login
method is called during initialization. Ensure that proper error handling is in place if the login fails, to prevent the object from being in an invalid state.
188-188
: Verify error handling for POST request.Ensure that the POST request to the login endpoint handles errors properly, such as invalid credentials or network issues.
|
||
def login(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder: Address the TODO comment.
There is a TODO comment regarding how to message about connection errors. Ensure that appropriate error messaging is implemented.
Do you want me to help implement the error messaging or open a GitHub issue to track this task?
@llance there's a few minor comments pending on this one, but it's a great contribution so would be good to get it in. Let me know if you have any questions or need some help |
@llance there's still a few pending review comments and lint/test issues here Let me know if you need help making sense of any of the comments |
Hey @hsheth2, I'm on the same team as Shuixi, we've built out the remaining ingestion pipeline for Preset (dataset sync, upstream/finegrained lineage) but need this PR merged in first (before we submit our other contributions for review) Looking to take this PR across the finish line, in addition to fixing lintting issue, is the expectation to address all of |
@@ -0,0 +1,21 @@ | |||
from src.datahub.ingestion.source.preset import PresetConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from src.datahub.ingestion.source.preset import PresetConfig | |
from datahub.ingestion.source.preset import PresetConfig |
super().__init__(ctx, config) | ||
self.config = config | ||
self.report = StaleEntityRemovalSourceReport() | ||
self.session = self.login() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because login()
is called from superset's init, and we call super().__init__(...)
here, I don't think we actually need this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall I don't want to make the login API call twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey Harshal, the login here is called twice since the authentication method is different (For Preset, which is why we redefined the method below & called it). Do you have any suggestions on a better way to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the self.login in the superset class will call login() from the preset method already - that's the behavior of method overrides with python
So I believe things should "just work" without the explicit calls
Fix Linting Issues
Looks like there's still some lint/test failures |
Update Preset Tests
@hsheth2 You're right, great catch. I've made the changes and the ingestion works without issues. I've also updated the tests so they should pass now. |
Co-authored-by: MARK CHENG <[email protected]> Co-authored-by: hwmarkcheng <[email protected]>
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests