Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Update is_(not)_in_range (#87) to support max/min limits from col #153

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
40 changes: 20 additions & 20 deletions docs/dqx/docs/reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,26 @@ This page provides a reference for the quality rule functions (checks) available

The following quality rules / functions are currently available:

| Check | Description | Arguments |
| -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| is_not_null | Check if input column is not null | col_name: column name to check |
| is_not_empty | Check if input column is not empty | col_name: column name to check |
| is_not_null_and_not_empty | Check if input column is not null or empty | col_name: column name to check; trim_strings: boolean flag to trim spaces from strings |
| value_is_in_list | Check if the provided value is present in the input column. | col_name: column name to check; allowed: list of allowed values |
| value_is_not_null_and_is_in_list | Check if provided value is present if the input column is not null | col_name: column name to check; allowed: list of allowed values |
| is_not_null_and_not_empty_array | Check if input array column is not null or empty | col_name: column name to check |
| is_in_range | Check if input column is in the provided range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit; max_limit: max limit |
| is_not_in_range | Check if input column is not within defined range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value |
| not_less_than | Check if input column is not less than the provided limit | col_name: column name to check; limit: limit value |
| not_greater_than | Check if input column is not greater than the provided limit | col_name: column name to check; limit: limit value |
| is_valid_date | Check if input column is a valid date | col_name: column name to check; date_format: date format (e.g. 'yyyy-mm-dd') |
| is_valid_timestamp | Check if input column is a valid timestamp | col_name: column name to check; timestamp_format: timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
| not_in_future | Check if input column defined as date is not in the future (future defined as current_timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
| not_in_near_future | Check if input column defined as date is not in the near future (near future defined as grater than current timestamp but less than current timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
| is_older_than_n_days | Check if input column is older than n number of days | col_name: column name to check; days: number of days; curr_date: current date, if not provided current_date() function is used |
| is_older_than_col2_for_n_days | Check if one column is not older than another column by n number of days | col_name1: first column name to check; col_name2: second column name to check; days: number of days |
| regex_match | Check if input column matches a given regex | col_name: column name to check; regex: regex to check; negate: if the condition should be negated (true) or not |
| sql_expression | Check if input column is matches the provided sql expression, eg. a = 'str1', a > b | expression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated |
| Check | Description | Arguments |
| -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| is_not_null | Check if input column is not null | col_name: column name to check |
| is_not_empty | Check if input column is not empty | col_name: column name to check |
| is_not_null_and_not_empty | Check if input column is not null or empty | col_name: column name to check; trim_strings: boolean flag to trim spaces from strings |
| value_is_in_list | Check if the provided value is present in the input column. | col_name: column name to check; allowed: list of allowed values |
| value_is_not_null_and_is_in_list | Check if provided value is present if the input column is not null | col_name: column name to check; allowed: list of allowed values |
| is_not_null_and_not_empty_array | Check if input array column is not null or empty | col_name: column name to check |
| is_in_range | Check if input column is in the provided range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value; min_limit_col_expr: min limit column name or expr; max_limit_col_expr: max limit column name or expr |
| is_not_in_range | Check if input column is not within defined range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value; min_limit_col_expr: min limit column name or expr; max_limit_col_expr: max limit column name or expr |
| not_less_than | Check if input column is not less than the provided limit | col_name: column name to check; limit: limit value |
| not_greater_than | Check if input column is not greater than the provided limit | col_name: column name to check; limit: limit value |
| is_valid_date | Check if input column is a valid date | col_name: column name to check; date_format: date format (e.g. 'yyyy-mm-dd') |
| is_valid_timestamp | Check if input column is a valid timestamp | col_name: column name to check; timestamp_format: timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
| not_in_future | Check if input column defined as date is not in the future (future defined as current_timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
| not_in_near_future | Check if input column defined as date is not in the near future (near future defined as grater than current timestamp but less than current timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
| is_older_than_n_days | Check if input column is older than n number of days | col_name: column name to check; days: number of days; curr_date: current date, if not provided current_date() function is used |
| is_older_than_col2_for_n_days | Check if one column is not older than another column by n number of days | col_name1: first column name to check; col_name2: second column name to check; days: number of days |
| regex_match | Check if input column matches a given regex | col_name: column name to check; regex: regex to check; negate: if the condition should be negated (true) or not |
| sql_expression | Check if input column is matches the provided sql expression, eg. a = 'str1', a > b | expression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated |

You can check implementation details of the rules [here](https://github.com/databrickslabs/dqx/blob/main/src/databricks/labs/dqx/col_functions.py).

Expand Down
70 changes: 54 additions & 16 deletions src/databricks/labs/dqx/col_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,20 +281,53 @@ def not_greater_than(col_name: str, limit: int | datetime.date | datetime.dateti
)


def _get_min_max_column_expr(
min_limit: int | datetime.date | datetime.datetime | str | None = None,
max_limit: int | datetime.date | datetime.datetime | str | None = None,
min_limit_col_expr: str | Column | None = None,
max_limit_col_expr: str | Column | None = None,
) -> tuple[Column, Column]:
"""Helper function to create a condition for the is_(not)_in_range functions.

:param min_limit: min limit value
:param max_limit: max limit value
:param min_limit_col_expr: min limit column name or expr
:param max_limit_col_expr: max limit column name or expr
:return: tuple containing min_limit_expr and max_limit_expr
:raises: ValueError when both min_limit/min_limit_col_expr or max_limit/max_limit_col_expr are null
"""
if (min_limit is None and min_limit_col_expr is None) or (max_limit is None and max_limit_col_expr is None):
raise ValueError('Either min_limit / min_limit_col_expr or max_limit / max_limit_col_expr is empty')
if min_limit_col_expr is None:
min_limit_expr = F.lit(min_limit)
else:
min_limit_expr = F.col(min_limit_col_expr) if isinstance(min_limit_col_expr, str) else min_limit_col_expr
if max_limit_col_expr is None:
max_limit_expr = F.lit(max_limit)
else:
max_limit_expr = F.col(max_limit_col_expr) if isinstance(max_limit_col_expr, str) else max_limit_col_expr
Comment on lines +301 to +308
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to throw an error if both min_limit and min_limit_col_expr are None (same for max_limit and max_limit_col_expr).

Otherwise we'll get conditions like (F.col(col_name) < F.lit(None)) | (F.col(col_name) > F.lit(None))

return (min_limit_expr, max_limit_expr)


def is_in_range(
col_name: str,
min_limit: int | datetime.date | datetime.datetime,
max_limit: int | datetime.date | datetime.datetime,
min_limit: int | datetime.date | datetime.datetime | str | None = None,
max_limit: int | datetime.date | datetime.datetime | str | None = None,
min_limit_col_expr: str | Column | None = None,
max_limit_col_expr: str | Column | None = None,
) -> Column:
"""Creates a condition column that checks if a value is smaller than min limit or greater than max limit.

:param col_name: column name
:param min_limit: min limit
:param max_limit: max limit
:param min_limit: min limit value
:param max_limit: max limit value
:param min_limit_col_expr: min limit column name or expr
:param max_limit_col_expr: max limit column name or expr
:return: new Column
"""
min_limit_expr = F.lit(min_limit)
max_limit_expr = F.lit(max_limit)
min_limit_expr, max_limit_expr = _get_min_max_column_expr(
min_limit, max_limit, min_limit_col_expr, max_limit_col_expr
)
condition = (F.col(col_name) < min_limit_expr) | (F.col(col_name) > max_limit_expr)

return make_condition(
Expand All @@ -304,9 +337,9 @@ def is_in_range(
F.lit("Value"),
F.col(col_name),
F.lit("not in range: ["),
F.lit(min_limit).cast("string"),
min_limit_expr.cast("string"),
F.lit(","),
F.lit(max_limit).cast("string"),
max_limit_expr.cast("string"),
F.lit("]"),
),
f"{col_name}_not_in_range",
Expand All @@ -315,18 +348,23 @@ def is_in_range(

def is_not_in_range(
col_name: str,
min_limit: int | datetime.date | datetime.datetime,
max_limit: int | datetime.date | datetime.datetime,
min_limit: int | datetime.date | datetime.datetime | str | None = None,
max_limit: int | datetime.date | datetime.datetime | str | None = None,
min_limit_col_expr: str | Column | None = None,
max_limit_col_expr: str | Column | None = None,
) -> Column:
"""Creates a condition column that checks if a value is within min and max limits.

:param col_name: column name
:param min_limit: min limit
:param max_limit: max limit
:param min_limit: min limit value
:param max_limit: max limit value
:param min_limit_col_expr: min limit column name or expr
:param max_limit_col_expr: max limit column name or expr
:return: new Column
"""
min_limit_expr = F.lit(min_limit)
max_limit_expr = F.lit(max_limit)
min_limit_expr, max_limit_expr = _get_min_max_column_expr(
min_limit, max_limit, min_limit_col_expr, max_limit_col_expr
)
condition = (F.col(col_name) > min_limit_expr) & (F.col(col_name) < max_limit_expr)

return make_condition(
Expand All @@ -336,9 +374,9 @@ def is_not_in_range(
F.lit("Value"),
F.col(col_name),
F.lit("in range: ["),
F.lit(min_limit).cast("string"),
min_limit_expr.cast("string"),
F.lit(","),
F.lit(max_limit).cast("string"),
max_limit_expr.cast("string"),
F.lit("]"),
),
f"{col_name}_in_range",
Expand Down
Loading
Loading