-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Readme and improved support for DLT #45
Conversation
Initial commit, missing design/how it works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are minor comments about the documentation part. Please extend it!
src/databricks/labs/dqx/engine.py
Outdated
|
||
return good_df, bad_df | ||
|
||
|
||
def get_invalid(df: DataFrame) -> DataFrame: | ||
""" | ||
Get invalid records only (errors and warnings). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what does "valid" and "invalid" means!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
src/databricks/labs/dqx/engine.py
Outdated
|
||
def get_valid(df: DataFrame) -> DataFrame: | ||
""" | ||
Get valid records only (errors only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document that you drop columns!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
docs/dqx_lakehouse.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should it not use the standard naming of bronze-silver-gold rather than raw-curated-final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, better to use our standard naming convention
corrected
README.md
Outdated
Fields: | ||
- "criticality": either "error" (data going only into "bad/quarantine" dataframe) or "warn" (data going into both dataframes). | ||
- "check": column expression containing "function" (check function to apply), "arguments" (check function arguments), and "col_name" (column name as str to apply to check for) or "col_names" (column names as array to apply the check for). | ||
If "col_names" is provided the "name" for the check is autogenerated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify: Is it not autogenerated otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, removed the note about the col_names being autogenerated. The name is autogenerated, not the list of columns.
README.md
Outdated
### Quality rules in DLT (Delta Live Tables) | ||
|
||
You can use [expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html) in DLT to define data quality | ||
constraints. However, if you require detailed info of why certain checks failed you may prefer to use DQX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify: how this integration works? is it using DLT expectations or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. It does not use expectations but the DQX checks directly.
If a check that you need does not exist yet, it may be sufficient to define it via sql expression rule (`sql_expression`). | ||
Alternatively, you can define your own checks. Just create a function available from 'globals', and make sure | ||
it returns `pyspark.sql.Column`. Feel free to submit a PR to the repo so that other can benefit from it as well (see [contribution guide](#contribution)). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a code example here or link the right source file for examples!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
checks = [ | ||
{ | ||
"check": is_not_null("col1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the point of the README where it is defined what is expected in the function part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
checks = DQRuleColSet( # define rule for multiple columns at once | ||
columns=["col1", "col2"], | ||
criticality="error", | ||
check_func=is_not_null).get_rules() + [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the point of the README where it is defined what is expected in the function part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I'm not sure if it is a good idea to have two "engine" files in the same project, even if it is under a different dir. It is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes
Linked issues
#17
#19
Tests