Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Readme and improved support for DLT #45

Merged
merged 5 commits into from
Aug 20, 2024
Merged

Updated Readme and improved support for DLT #45

merged 5 commits into from
Aug 20, 2024

Conversation

mwojtyczka
Copy link
Contributor

@mwojtyczka mwojtyczka commented Jul 26, 2024

Changes

  • Updated Readme
  • Minor refactor of the profiler
  • Added functions for filtering data sets

Linked issues

#17
#19

Tests

  • manually tested
  • added unit tests
  • added integration tests

@mwojtyczka mwojtyczka requested review from a team and nfx as code owners July 26, 2024 14:58
@mwojtyczka mwojtyczka requested review from nehamilak-db and removed request for a team July 26, 2024 14:58
@mwojtyczka mwojtyczka marked this pull request as draft July 26, 2024 14:59
@mwojtyczka
Copy link
Contributor Author

Initial commit, missing design/how it works

@mwojtyczka mwojtyczka marked this pull request as ready for review July 29, 2024 14:18
@mwojtyczka mwojtyczka changed the title Updated Readme Updated Readme and improved support for DLT Jul 29, 2024
@mwojtyczka mwojtyczka requested a review from alexott August 8, 2024 13:48
Copy link

@gergo-databricks gergo-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are minor comments about the documentation part. Please extend it!


return good_df, bad_df


def get_invalid(df: DataFrame) -> DataFrame:
"""
Get invalid records only (errors and warnings).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what does "valid" and "invalid" means!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


def get_valid(df: DataFrame) -> DataFrame:
"""
Get valid records only (errors only)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document that you drop columns!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should it not use the standard naming of bronze-silver-gold rather than raw-curated-final?

Copy link
Contributor Author

@mwojtyczka mwojtyczka Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, better to use our standard naming convention
corrected

README.md Outdated
Fields:
- "criticality": either "error" (data going only into "bad/quarantine" dataframe) or "warn" (data going into both dataframes).
- "check": column expression containing "function" (check function to apply), "arguments" (check function arguments), and "col_name" (column name as str to apply to check for) or "col_names" (column names as array to apply the check for).
If "col_names" is provided the "name" for the check is autogenerated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify: Is it not autogenerated otherwise?

Copy link
Contributor Author

@mwojtyczka mwojtyczka Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, removed the note about the col_names being autogenerated. The name is autogenerated, not the list of columns.

README.md Outdated
### Quality rules in DLT (Delta Live Tables)

You can use [expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html) in DLT to define data quality
constraints. However, if you require detailed info of why certain checks failed you may prefer to use DQX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify: how this integration works? is it using DLT expectations or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. It does not use expectations but the DQX checks directly.

If a check that you need does not exist yet, it may be sufficient to define it via sql expression rule (`sql_expression`).
Alternatively, you can define your own checks. Just create a function available from 'globals', and make sure
it returns `pyspark.sql.Column`. Feel free to submit a PR to the repo so that other can benefit from it as well (see [contribution guide](#contribution)).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a code example here or link the right source file for examples!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


checks = [
{
"check": is_not_null("col1"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the point of the README where it is defined what is expected in the function part!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

checks = DQRuleColSet( # define rule for multiple columns at once
columns=["col1", "col2"],
criticality="error",
check_func=is_not_null).get_rules() + [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the point of the README where it is defined what is expected in the function part!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I'm not sure if it is a good idea to have two "engine" files in the same project, even if it is under a different dir. It is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed back

Copy link

@gergo-databricks gergo-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 4944a32 into main Aug 20, 2024
6 checks passed
@mwojtyczka mwojtyczka deleted the build branch August 20, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Improve support for DLT [FEATURE]: Update project documentation
2 participants