Updated Readme and improved support for DLT #45

mwojtyczka · 2024-07-26T14:58:51Z

Changes

Updated Readme
Minor refactor of the profiler
Added functions for filtering data sets

Linked issues

#17
#19

Tests

manually tested
added unit tests
added integration tests

mwojtyczka · 2024-07-26T15:00:18Z

Initial commit, missing design/how it works

gergo-databricks

There are minor comments about the documentation part. Please extend it!

gergo-databricks · 2024-08-19T16:53:52Z

src/databricks/labs/dqx/engine.py


    return good_df, bad_df


+def get_invalid(df: DataFrame) -> DataFrame:
+    """
+    Get invalid records only (errors and warnings).


Please document what does "valid" and "invalid" means!

gergo-databricks · 2024-08-19T16:54:22Z

src/databricks/labs/dqx/engine.py

+
+def get_valid(df: DataFrame) -> DataFrame:
+    """
+    Get valid records only (errors only)


Please document that you drop columns!

gergo-databricks · 2024-08-19T16:55:41Z

docs/dqx_lakehouse.png

Question: Should it not use the standard naming of bronze-silver-gold rather than raw-curated-final?

agree, better to use our standard naming convention
corrected

gergo-databricks · 2024-08-19T17:09:19Z

README.md

+Fields:
+- "criticality": either "error" (data going only into "bad/quarantine" dataframe) or "warn" (data going into both dataframes).
+- "check": column expression containing "function" (check function to apply), "arguments" (check function arguments), and "col_name" (column name as str to apply to check for) or "col_names" (column names as array to apply the check for). 
+If "col_names" is provided the "name" for the check is autogenerated.


Please clarify: Is it not autogenerated otherwise?

good catch, removed the note about the col_names being autogenerated. The name is autogenerated, not the list of columns.

gergo-databricks · 2024-08-19T17:13:47Z

README.md

+### Quality rules in DLT (Delta Live Tables)
+
+You can use [expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html) in DLT to define data quality 
+constraints. However, if you require detailed info of why certain checks failed you may prefer to use DQX.


Please clarify: how this integration works? is it using DLT expectations or not?

Added. It does not use expectations but the DQX checks directly.

gergo-databricks · 2024-08-19T17:17:18Z

README.md

+If a check that you need does not exist yet, it may be sufficient to define it via sql expression rule (`sql_expression`).
+Alternatively, you can define your own checks. Just create a function available from 'globals', and make sure 
+it returns `pyspark.sql.Column`. Feel free to submit a PR to the repo so that other can benefit from it as well (see [contribution guide](#contribution)).
+


Add a code example here or link the right source file for examples!

gergo-databricks · 2024-08-19T17:19:30Z

README.md

+
+checks = [
+  {
+   "check": is_not_null("col1"),


Please refer to the point of the README where it is defined what is expected in the function part!

gergo-databricks · 2024-08-19T17:19:55Z

README.md

+checks = DQRuleColSet( # define rule for multiple columns at once
+            columns=["col1", "col2"], 
+            criticality="error", 
+            check_func=is_not_null).get_rules() + [


Please refer to the point of the README where it is defined what is expected in the function part!

gergo-databricks · 2024-08-19T17:29:38Z

src/databricks/labs/dqx/profiler/engine.py

Optional: I'm not sure if it is a good idea to have two "engine" files in the same project, even if it is under a different dir. It is confusing.

renamed back

gergo-databricks

LGTM

mwojtyczka requested review from a team and nfx as code owners July 26, 2024 14:58

mwojtyczka requested review from nehamilak-db and removed request for a team July 26, 2024 14:58

mwojtyczka marked this pull request as draft July 26, 2024 14:59

This was linked to issues Jul 29, 2024

[FEATURE]: Update project documentation #17

Closed

[FEATURE]: Improve support for DLT #19

Closed

mwojtyczka marked this pull request as ready for review July 29, 2024 14:18

mwojtyczka changed the title ~~Updated Readme~~ Updated Readme and improved support for DLT Jul 29, 2024

mwojtyczka requested a review from alexott August 8, 2024 13:48

Merge branch 'main' into build

f658645

mwojtyczka force-pushed the build branch from a99ed04 to f658645 Compare August 19, 2024 15:56

Merge branch 'main' into build

e63caf8

gergo-databricks requested changes Aug 19, 2024

View reviewed changes

implemented code review feedback

bc1133a

mwojtyczka force-pushed the build branch from eec53d1 to bc1133a Compare August 20, 2024 09:50

mwojtyczka requested a review from gergo-databricks August 20, 2024 09:51

gergo-databricks approved these changes Aug 20, 2024

View reviewed changes

mwojtyczka added 2 commits August 20, 2024 15:45

Merge branch 'main' into build

4994c1d

resolved merge conflict

81b085f

mwojtyczka merged commit 4944a32 into main Aug 20, 2024
6 checks passed

mwojtyczka deleted the build branch August 20, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Readme and improved support for DLT #45

Updated Readme and improved support for DLT #45

mwojtyczka commented Jul 26, 2024 •

edited

Loading

mwojtyczka commented Jul 26, 2024

gergo-databricks left a comment

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024 •

edited

Loading

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024 •

edited

Loading

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks Aug 19, 2024

mwojtyczka Aug 20, 2024

gergo-databricks left a comment

Updated Readme and improved support for DLT #45

Updated Readme and improved support for DLT #45

Conversation

mwojtyczka commented Jul 26, 2024 • edited Loading

Changes

Linked issues

Tests

mwojtyczka commented Jul 26, 2024

gergo-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwojtyczka Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwojtyczka Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gergo-databricks left a comment

Choose a reason for hiding this comment

mwojtyczka commented Jul 26, 2024 •

edited

Loading

mwojtyczka Aug 20, 2024 •

edited

Loading

mwojtyczka Aug 20, 2024 •

edited

Loading