Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Preset.io source (Superset SaaS) #8188

Closed

Conversation

betodealmeida
Copy link

Preset.io is a commercial SaaS provider for Apache Superset. This PR introduces a new ingestion source for Preset "workspaces", which are essentially Superset instances. The only difference is the authentication mechanism that needs to be done against an API gateway using a key/secret pair.

I implemented the source as a derived class of SupersetSource in the same file; let me know if it would be better to have it in a separate file.

I added a new integration test, but I haven't tested it end-to-end yet. I'm in the process of setting up a local development to test it with a real workspace. But I thought it would be good to get early feedback in the PR since this is my first time contributing to DataHub.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@vercel
Copy link

vercel bot commented Jun 7, 2023

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 7, 2023
@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jun 23, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 5, 2023

@mayurinehate could you take a look at this PR?

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @betodealmeida , thanks for the PR.

I'd like to share this guide on adding a metadata ingestion source - https://datahubproject.io/docs/metadata-ingestion/adding-source/ that'll help you through the process. Especially please refer steps 4,5 (making the new plugin discoverable) and 9 (adding new platform).

Also, I have a couple of initial comments on the PR -

  1. Let's move new source to separate file preset.py. (you said it !)
  2. For lint, I believe you'd need to inherit PresetConfig from SupersetConfig. If Preset Config does not need username/password - you can refractor common parts from existing SupersetConfig to BaseSupersetConfig and make SupersetConfig and PresetConfig inherit BaseSupersetConfig.

Lastly, excited to hear results from your end-to-end preset testing when you have them.

self.config = config
self.report = StaleEntityRemovalSourceReport()

login_response = requests.post(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refractor get token piece into separate get_access_token method in SupersetSource and only override create and get_access_token method in PresetSource ?


pipeline.run()
pipeline.raise_from_status()
golden_file = "golden_test_ingest.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need separate preset golden file as it will have platform preset in entity urns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida
Copy link
Author

@mayurinehate could you take a look at this PR?

Let me know if you need help testing end-to-end, I can get you a Preset workspace to run tests. Thank you!

@betodealmeida
Copy link
Author

Thanks for the review, @mayurinehate! I'll address the comments in the next couple days!

@kuhnen
Copy link

kuhnen commented Aug 6, 2023

Hello @betodealmeida, would you have any idea when you might have time to work on it? thanks a lot

@mayurinehate mayurinehate added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Aug 17, 2023
@byndcivilization
Copy link

byndcivilization commented Jun 13, 2024

@betodealmeida @mayurinehate, @hsheth2 is there still work to do on this? We just ran up against this requirement and would be willing to do the last mile work to get this in.

@byndcivilization
Copy link

I think i have a rebased version of this more or less sussed out. I'm hitting a snag testing it out in the local env. the ingest job fails to pull install the preset subpackage becuase its calling to the acryl pip package. It feels like I missed a step or im operating off of old docs. I can push that PR once I get the functionality ready. I've incorporated the suggestions on this PR.

@byndcivilization
Copy link

byndcivilization commented Jun 15, 2024

The class creation more or less mirrors this PR. I did also add the logo and all the registration stuff in the typescript files.

I am using the quickstartDebug method for gradle. on the first connection run im getting this error:

+ pip install 'acryl-datahub[datahub-rest,datahub-kafka,preset]==c6a1571'
ERROR: Ignored the following yanked versions: 0.8.18, 0.8.29.1, 0.8.33.2, 0.8.34, 0.8.44rc0, 0.8.44rc1, 0.8.44.3rc4, 0.9.4, 0.10.5.3
ERROR: Ignored the following versions that require a different python version: 0.8.24.1 Requires-Python >=3.6, <=3.9.9; 0.8.24.2 Requires-Python >=3.6, <=3.9.9; 0.8.24.3 Requires-Python >=3.6, <=3.9.9; 0.8.25 Requires-Python >=3.6, <=3.9.9; 0.8.25.0 Requires-Python >=3.6, <=3.9.9; 0.8.25.1 Requires-Python >=3.6, <=3.9.9; 0.8.25.2 Requires-Python >=3.6, <=3.9.9; 0.8.26.0 Requires-Python >=3.6, <=3.9.9; 0.8.26.1 Requires-Python >=3.6, <=3.9.9; 0.8.26.2 Requires-Python >=3.6, <=3.9.9; 0.8.26.3 Requires-Python >=3.6, <=3.9.9; 0.8.26.4 Requires-Python >=3.6, <=3.9.9; 0.8.26.5 Requires-Python >=3.6, <=3.9.9; 0.8.26.6 Requires-Python >=3.6, <=3.9.9; 0.8.26.7 Requires-Python >=3.6, <=3.9.9; 0.8.26.7rc1 Requires-Python >=3.6, <=3.9.9; 0.8.26.7rc2 Requires-Python >=3.6, <=3.9.9; 0.8.26.8 Requires-Python >=3.6, <=3.9.9; 0.8.26.8rc1 Requires-Python >=3.6, <=3.9.9; 0.8.27 Requires-Python >=3.6, <=3.9.9; 0.8.27.1 Requires-Python >=3.6, <=3.9.9; 0.8.27.1rc1 Requires-Python >=3.6, <=3.9.9; 0.8.27.2 Requires-Python >=3.6, <=3.9.9; 0.8.27.2rc1 Requires-Python >=3.6, <=3.9.9; 0.8.27.2rc2 Requires-Python >=3.6, <=3.9.9; 0.8.27.2rc3 Requires-Python >=3.6, <=3.9.9
ERROR: Could not find a version that satisfies the requirement acryl-datahub==c6a1571 (from versions: 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.4.0, 0.8.1.0, 0.8.1.1, 0.8.1.2, 0.8.3.0, 0.8.3.1, 0.8.3.2, 0.8.3.3, 0.8.4.0, 0.8.5.0, 0.8.5.1, 0.8.5.2, 0.8.6.0, 0.8.6.1, 0.8.6.2, 0.8.6.3, 0.8.6.4, 0.8.6.5, 0.8.7.0, 0.8.8.0, 0.8.8.1, 0.8.8.2, 0.8.8.3, 0.8.8.4, 0.8.9.0, 0.8.10.0, 0.8.10.1, 0.8.10.2, 0.8.11.0, 0.8.11.1, 0.8.12.0, 0.8.13.0, 0.8.13.1, 0.8.14.0, 0.8.14.1, 0.8.14.2, 0.8.15.0, 0.8.15.1, 0.8.15.2, 0.8.15.3, 0.8.15.4, 0.8.15.5, 0.8.15.6, 0.8.15.7, 0.8.15.8, 0.8.15.9, 0.8.15.10, 0.8.16.0, 0.8.16.1, 0.8.16.2, 0.8.16.3, 0.8.16.4, 0.8.16.5, 0.8.16.6, 0.8.16.7, 0.8.16.8, 0.8.16.9, 0.8.16.11, 0.8.16.12, 0.8.17.0, 0.8.17.1, 0.8.17.2, 0.8.17.3, 0.8.17.4, 0.8.17.5, 0.8.17.6, 0.8.17.7, 0.8.18.1, 0.8.19.0, 0.8.19.1, 0.8.20.0, 0.8.21.0, 0.8.22.1, 0.8.23.0, 0.8.23.1, 0.8.24.0, 0.8.28.0rc1, 0.8.28.0, 0.8.28.1, 0.8.29, 0.8.29.2, 0.8.30.0, 0.8.31, 0.8.31.1rc1, 0.8.31.1, 0.8.31.2, 0.8.31.3rc1, 0.8.31.3, 0.8.31.4rc1, 0.8.31.4, 0.8.31.5rc1, 0.8.31.5, 0.8.31.6rc1, 0.8.31.6rc2, 0.8.31.6, 0.8.32rc1, 0.8.32rc2, 0.8.32rc3, 0.8.32rc4, 0.8.32, 0.8.32.1, 0.8.32.2rc1, 0.8.32.2, 0.8.32.3rc1, 0.8.32.3, 0.8.32.4rc1, 0.8.32.4rc2, 0.8.32.4, 0.8.32.5rc1, 0.8.32.5, 0.8.32.6rc2, 0.8.32.6rc3, 0.8.32.6, 0.8.32.7, 0.8.33rc1, 0.8.33, 0.8.33.1, 0.8.33.2rc1, 0.8.33.2rc2, 0.8.33.3rc2, 0.8.33.3rc3, 0.8.33.3, 0.8.34.1rc1, 0.8.34.1rc2, 0.8.34.1rc3, 0.8.34.1, 0.8.34.2rc1, 0.8.34.2rc2, 0.8.34.2rc3, 0.8.34.2rc4, 0.8.34.2, 0.8.34.3rc1, 0.8.35.0rc2, 0.8.35, 0.8.35.1rc1, 0.8.35.1, 0.8.35.2rc1, 0.8.35.2, 0.8.35.3rc1, 0.8.35.3, 0.8.35.4rc1, 0.8.35.4, 0.8.35.5rc1, 0.8.35.5, 0.8.35.6rc1, 0.8.35.6rc2, 0.8.35.6, 0.8.35.7rc1, 0.8.35.7, 0.8.35.8rc1, 0.8.35.8rc2, 0.8.35.8rc3, 0.8.36.0rc0, 0.8.36rc1, 0.8.36, 0.8.36.1rc1, 0.8.36.1rc2, 0.8.36.1rc6, 0.8.36.1rc7, 0.8.36.1rc8, 0.8.36.1rc9, 0.8.36.1rc10, 0.8.37rc0, 0.8.37, 0.8.38, 0.8.38.1rc0, 0.8.38.1rc1, 0.8.38.1, 0.8.38.2rc1, 0.8.38.2, 0.8.38.3rc1, 0.8.38.3, 0.8.38.4rc0, 0.8.38.4rc2, 0.8.38.4rc3, 0.8.38.4, 0.8.38.5rc0, 0.8.38.5, 0.8.39rc0, 0.8.39, 0.8.39.1rc1, 0.8.39.1rc2, 0.8.39.1rc3, 0.8.39.1rc4, 0.8.39.1rc5, 0.8.39.1rc6, 0.8.39.1rc7, 0.8.39.1rc8, 0.8.40rc1, 0.8.40, 0.8.40.1, 0.8.40.2rc0, 0.8.40.2, 0.8.40.3rc0, 0.8.40.3rc1, 0.8.40.3rc2, 0.8.40.3rc3, 0.8.40.3, 0.8.40.4rc1, 0.8.40.4rc2, 0.8.41rc2, 0.8.41, 0.8.41.1rc0, 0.8.41.1rc1, 0.8.41.1rc2, 0.8.41.1rc3, 0.8.41.1rc4, 0.8.41.1, 0.8.41.2rc0, 0.8.41.2rc1, 0.8.41.2, 0.8.41.3rc1, 0.8.41.3rc2, 0.8.41.3rc3, 0.8.42rc1, 0.8.42rc2, 0.8.42, 0.8.43rc2, 0.8.43rc3, 0.8.43rc4, 0.8.43, 0.8.43.1rc0, 0.8.43.1rc1, 0.8.43.1, 0.8.43.2rc0, 0.8.43.2rc1, 0.8.43.2, 0.8.43.3rc0, 0.8.43.3rc1, 0.8.43.3rc2, 0.8.43.3rc3, 0.8.43.3rc5, 0.8.43.3, 0.8.43.4rc1, 0.8.43.4rc2, 0.8.43.4, 0.8.43.5rc1, 0.8.43.5rc2, 0.8.43.5rc3, 0.8.43.5, 0.8.43.6rc0, 0.8.43.6rc1, 0.8.43.6, 0.8.44rc3, 0.8.44rc4, 0.8.44rc5, 0.8.44, 0.8.44.1rc0, 0.8.44.1rc1, 0.8.44.1rc2, 0.8.44.1rc3, 0.8.44.1rc4, 0.8.44.1, 0.8.44.2rc0, 0.8.44.2rc1, 0.8.44.2rc2, 0.8.44.2, 0.8.44.3rc0, 0.8.44.3rc1, 0.8.44.3rc2, 0.8.44.3rc3, 0.8.44.3, 0.8.44.4rc0, 0.8.44.4rc1, 0.8.44.4, 0.8.44.5rc0, 0.8.44.5rc1, 0.8.44.5rc2, 0.8.44.5rc3, 0.8.44.5, 0.8.44.6rc0, 0.8.45rc1, 0.8.45, 0.8.45.1rc0, 0.8.45.1rc2, 0.8.45.1rc3, 0.8.45.1rc4, 0.8.45.1rc5, 0.8.45.1, 0.8.45.2rc0, 0.8.45.2rc1, 0.8.45.2rc2, 0.8.45.2, 0.8.45.3rc0, 0.8.45.3rc1, 0.8.45.3rc2, 0.8.45.3rc3, 0.8.45.3rc4, 0.8.45.3rc5, 0.9.0rc4, 0.9.0rc5, 0.9.0rc6, 0.9.0, 0.9.0.1rc0, 0.9.0.1, 0.9.0.2rc0, 0.9.0.2rc1, 0.9.0.2rc2, 0.9.0.2rc3, 0.9.0.2rc4, 0.9.0.2, 0.9.0.3rc0, 0.9.0.3, 0.9.0.4rc0, 0.9.0.4, 0.9.0.5rc0, 0.9.0.5rc1, 0.9.0.5rc2, 0.9.0.5, 0.9.1rc0, 0.9.1, 0.9.1.1rc0, 0.9.1.1rc1, 0.9.1.1rc2, 0.9.2, 0.9.2.1rc0, 0.9.2.1rc1, 0.9.2.1rc2, 0.9.2.1, 0.9.2.2rc0, 0.9.2.2rc1, 0.9.2.2rc2, 0.9.2.2rc3, 0.9.2.2, 0.9.2.3rc1, 0.9.2.3rc2, 0.9.2.3rc3, 0.9.2.3rc4, 0.9.2.3, 0.9.2.4rc1, 0.9.2.4rc2, 0.9.2.4, 0.9.2.5rc1, 0.9.2.5rc3, 0.9.2.5rc4, 0.9.2.5rc5, 0.9.2.5rc8, 0.9.2.5, 0.9.3, 0.9.3.1rc0, 0.9.3.1rc1, 0.9.3.1, 0.9.3.2rc0, 0.9.3.2rc2, 0.9.3.2rc3, 0.9.3.2, 0.9.3.3rc0, 0.9 [...truncated]
ERROR: No matching distribution found for acryl-datahub==c6a1571

I'm guessing there's maybe an env var or something im supposed to set that overrides this?

subsequent runs faile with:

[2024-06-15 15:29:57,523] ERROR    {datahub.entrypoints:201} - Command failed: Failed to find a registered source for type preset: 'Did not find a registered class for preset'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 120, in _add_init_error_context
    yield
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 223, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 172, in get
    raise KeyError(f"Did not find a registered class for {key}")
KeyError: 'Did not find a registered class for preset'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 448, in wrapper
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 397, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 167, in run_ingestion_and_check_upgrade
    pipeline = Pipeline.create(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 336, in create
    return cls(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 220, in __init__
    with _add_init_error_context(
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 122, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type preset: 'Did not find a registered class for preset'

@hsheth2
Copy link
Collaborator

hsheth2 commented Jun 21, 2024

@byndcivilization when developing on sources, you'll need to use ingestion from the CLI e.g. datahub ingest -c <recipe.yml> from within the metadata-ingestion venv (see https://datahubproject.io/docs/metadata-ingestion/developing/#set-up-your-python-environment)

@shirshanka
Copy link
Contributor

@coderabbitai full review

@shirshanka shirshanka added the on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles label Jun 28, 2024
@betodealmeida
Copy link
Author

#10954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants