-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Readme and improved support for DLT #45
Changes from 2 commits
f658645
e63caf8
bc1133a
4994c1d
81b085f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -169,12 +169,30 @@ def apply_checks_and_split(df: DataFrame, checks: list[DQRule]) -> tuple[DataFra | |
|
||
checked_df = apply_checks(df, checks) | ||
|
||
good_df = checked_df.where(F.col(Columns.ERRORS.value).isNull()).drop(Columns.ERRORS.value, Columns.WARNINGS.value) | ||
bad_df = checked_df.where(F.col(Columns.ERRORS.value).isNotNull() | F.col(Columns.WARNINGS.value).isNotNull()) | ||
good_df = get_valid(checked_df) | ||
bad_df = get_invalid(checked_df) | ||
|
||
return good_df, bad_df | ||
|
||
|
||
def get_invalid(df: DataFrame) -> DataFrame: | ||
""" | ||
Get invalid records only (errors and warnings). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please document what does "valid" and "invalid" means! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
@param df: input DataFrame | ||
@return: | ||
""" | ||
return df.where(F.col(Columns.ERRORS.value).isNotNull() | F.col(Columns.WARNINGS.value).isNotNull()) | ||
|
||
|
||
def get_valid(df: DataFrame) -> DataFrame: | ||
""" | ||
Get valid records only (errors only) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please document that you drop columns! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
@param df: input DataFrame. | ||
@return: | ||
""" | ||
return df.where(F.col(Columns.ERRORS.value).isNull()).drop(Columns.ERRORS.value, Columns.WARNINGS.value) | ||
|
||
|
||
def build_checks_by_metadata(checks: list[dict], glbs: dict[str, Any] | None = None) -> list[DQRule]: | ||
"""Build checks based on check specification, i.e. function name plus arguments. | ||
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Optional: I'm not sure if it is a good idea to have two "engine" files in the same project, even if it is under a different dir. It is confusing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. renamed back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should it not use the standard naming of bronze-silver-gold rather than raw-curated-final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, better to use our standard naming convention
corrected