You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue: Soda Checks Cache Issue When Running Programmatically
I need to programmatically run Soda checks over different Soda check YAML files to identify the source data type from a finite list of possible data sources. I don't want to execute all of them simultaneously because I am interested in finding which Soda check will pass. The passing YAML file will indicate which data source I am dealing with.
Example YAML Files:
datasource1_definition_check.yml
datasource2_definition_check.yml
Problem Description:
When I try to run YAML checks in a loop, an issue arises. For example, if I am trying to identify datasource2, my logic dictates that when I execute the Soda check with datasource2_definition_check.yml, it should pass. However, when I first execute a scan with datasource1_definition_check.yml (which fails as expected) and then follow it with datasource2_definition_check.yml, it appears that there is a cached state. My logs indicate that datasource2_definition_check.yml has also failed. Upon closer inspection, the logs show that datasource2_definition_check.yml has used the previous checks.
Prepare the YAML check files (datasource1_definition_check.yml and datasource2_definition_check.yml).
Run the provided script to execute the checks sequentially.
Observe that the second check seems to use a cached state from the first check.
Expected Behavior:
The second check (datasource2_definition_check.yml) should pass independently of the first check's result.
Actual Behavior:
The second check fails, indicating it uses the cached state from the first check. The first check has 4 checks in total, the second check has 5 schema checks (4 fail and 1 warn). After the second loop execution I am expecting to see 4 pass and 1 warning, but the logs show this:
...
INFO | Oops! 4 errors. 0 failure. 1 warning. 4 pass.
Has check fails: False
Failed check for Data source 2 with file:
This is clearly incorrect as second execution should not include the outcome of the first check.
Additional Context:
This issue might be related to the caching mechanism within the Soda scan library. Clearing or isolating the state between scans could resolve the issue.
The text was updated successfully, but these errors were encountered:
jancakst
changed the title
When iterating over multiple soda check files soda Scan remembers old executions and makes checks fail despite che
Soda Checks Caches results for independent checks When Running Programmatically when it shouldnt
Aug 8, 2024
I want to check if any of the checks has failed when I perform a schema validation where I check if particular columns exist. The thing is that for some checks those tables may not even be present, in the logs this produces the following:
[13:24:47] Metrics 'schema' were not computed for check 'schema'
Given that the table wasn't present, to me the logical thing would be to ensure that the scan.has_check_fails() method gives True, but it is not the case. So even tho those checks failed as table was not present, it still gives False. After closer inspection, I noticed that in the logs if table name is not present the
forcheckinscan._checks:
print(check.outcome)
produces None.
When you look at the implementation of has_check_fails() method it checks only against the presence of CheckOutcome.FAIL and it does not check for the presence of None:
Based on this the has_check_fails method incorrectly thinks that the check has passed although it was not even performed due to missing table names. In logs it clearly states that an error occurred (see below) and that the check has failed, but this method states the opposite
INFO | 4 checks not evaluated.
INFO | 4 errors.
INFO | Oops! 4 errors. 0 failures. 0 warnings. 0 pass.
This error can be easily fixed by modifying the method to something like this:
Issue: Soda Checks Cache Issue When Running Programmatically
I need to programmatically run Soda checks over different Soda check YAML files to identify the source data type from a finite list of possible data sources. I don't want to execute all of them simultaneously because I am interested in finding which Soda check will pass. The passing YAML file will indicate which data source I am dealing with.
Example YAML Files:
datasource1_definition_check.yml
datasource2_definition_check.yml
Problem Description:
When I try to run YAML checks in a loop, an issue arises. For example, if I am trying to identify
datasource2
, my logic dictates that when I execute the Soda check withdatasource2_definition_check.yml
, it should pass. However, when I first execute a scan withdatasource1_definition_check.yml
(which fails as expected) and then follow it withdatasource2_definition_check.yml
, it appears that there is a cached state. My logs indicate thatdatasource2_definition_check.yml
has also failed. Upon closer inspection, the logs show thatdatasource2_definition_check.yml
has used the previous checks.Code Snippet:
Steps to Reproduce:
datasource1_definition_check.yml
anddatasource2_definition_check.yml
).Expected Behavior:
The second check (
datasource2_definition_check.yml
) should pass independently of the first check's result.Actual Behavior:
The second check fails, indicating it uses the cached state from the first check. The first check has 4 checks in total, the second check has 5 schema checks (4 fail and 1 warn). After the second loop execution I am expecting to see 4 pass and 1 warning, but the logs show this:
This is clearly incorrect as second execution should not include the outcome of the first check.
Additional Context:
This issue might be related to the caching mechanism within the Soda scan library. Clearing or isolating the state between scans could resolve the issue.
The text was updated successfully, but these errors were encountered: