Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Couchbase ingestion source #12345

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

mminichino
Copy link

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 15, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 82.59304% with 145 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../ingestion/source/couchbase/couchbase_profiling.py 77.09% 41 Missing ⚠️
...ngestion/source/couchbase/couchbase_data_reader.py 0.00% 35 Missing ⚠️
...on/src/datahub/ingestion/source/couchbase/retry.py 69.84% 19 Missing ⚠️
...ub/ingestion/source/couchbase/couchbase_connect.py 87.68% 17 Missing ⚠️
...tahub/ingestion/source/couchbase/couchbase_main.py 89.47% 16 Missing ⚠️
.../ingestion/source/couchbase/couchbase_aggregate.py 82.92% 14 Missing ⚠️
.../ingestion/source/couchbase/couchbase_kv_schema.py 97.95% 2 Missing ⚠️
...estion/source/couchbase/couchbase_schema_reader.py 95.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
...b-react/src/app/ingest/source/builder/constants.ts 100.00% <100.00%> (ø)
...hub/ingestion/source/couchbase/couchbase_common.py 100.00% <100.00%> (ø)
...atahub/ingestion/source/couchbase/couchbase_sql.py 100.00% <100.00%> (ø)
...estion/source/couchbase/couchbase_schema_reader.py 95.00% <95.00%> (ø)
.../ingestion/source/couchbase/couchbase_kv_schema.py 97.95% <97.95%> (ø)
.../ingestion/source/couchbase/couchbase_aggregate.py 82.92% <82.92%> (ø)
...tahub/ingestion/source/couchbase/couchbase_main.py 89.47% <89.47%> (ø)
...ub/ingestion/source/couchbase/couchbase_connect.py 87.68% <87.68%> (ø)
...on/src/datahub/ingestion/source/couchbase/retry.py 69.84% <69.84%> (ø)
...ngestion/source/couchbase/couchbase_data_reader.py 0.00% <0.00%> (ø)
... and 1 more

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5309ae0...2cc5ba5. Read the comment docs.

@mminichino mminichino marked this pull request as ready for review January 22, 2025 23:42
@mminichino mminichino requested a review from hsheth2 January 22, 2025 23:42
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments on the code

It feels like some features here were added just because they exist for other sources e.g. profiling/classification/domain mapping. Are those strictly necessary for what we're trying to accomplish with this source? Are the implementations sufficiently performant to work at scale?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use tenacity or another library for this?

samples: List[Any] = []


def json_schema(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a number of utilities for doing "schema inference" from objects - e.g.

Does this stuff still make sense?

@@ -147,6 +150,7 @@ export const PLATFORM_URN_TO_LOGO = {
[BIGQUERY_URN]: bigqueryLogo,
[CLICKHOUSE_URN]: clickhouseLogo,
[COCKROACHDB_URN]: cockroachdbLogo,
[COUCHBASE_URN]: couchbaseLogo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, I prefer to have the UI forms get added in a separate PR.

Small PRs are easier to review - and in this case, I like to have the UI changes reviewed by someone who does more frontend dev than I do

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this docker compose and entrypoint file seem more complex than I would have expected

Usually docker containers come with good defaults, which means our docker setups are very simple

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like duplicate files here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first source that really uses async

In general, we should almost never be using "bare" asyncio. instead, the anyio library is preferred. We also should not really have references to the event loop all over the place - that tends to be an anti-pattern. Methods that need an event loop should be async. For cpu-bound operations, we can use the asyncer library to bridge between asyncio tasks and worker threads.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants