Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dry run fails on models whose fields explicitly reference dependency models #11

Open
alvfermar-mp opened this issue Jul 25, 2022 · 3 comments · May be fixed by #52
Open

Dry run fails on models whose fields explicitly reference dependency models #11

alvfermar-mp opened this issue Jul 25, 2022 · 3 comments · May be fixed by #52
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@alvfermar-mp
Copy link
Contributor

Take the model

SELECT 
    field1
    , table1.field2
    , field3
FROM table1

Since the reference to table1 is being replaced by (SELECT "some string" as field2, 42 as field2 and True as field3), if we don't alias that select literal with the name of the reference (ie. AS table1), the query will fail to execute. This could be solved by simple adding that bit to the select literal

@ghost ghost added the question Further information is requested label Jul 26, 2022
@ghost
Copy link

ghost commented Jul 27, 2022

Following on from the discussions in PR #13

There are several cases we need to be able to cover:

  1. No aliasing, implicit selection of columns
SELECT
    first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    first_name
FROM (SELECT 'some-string' as first_name)
  1. Aliasing with AS
SELECT
    emp.first_name
FROM `at-data-platform-preprod`.`training`.`employee` as emp;

-- Converts to

SELECT
    emp.first_name
FROM (SELECT 'some-string' as first_name) as emp;
  1. Aliasing without AS:
SELECT
    emp.first_name
FROM `at-data-platform-preprod`.`training`.`employee` emp;

-- Converts to

SELECT
    emp.first_name
FROM (SELECT 'some-string' as first_name) emp;
  1. Using the table name explicitly
SELECT
    employee.first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    employee.first_name
FROM (SELECT 'some-string' as first_name) employee; -- Problem: Currently we don't include this 'employee' alias so the dry run fails

The current implementation works for cases 1 - 3 but fails for case 4. We need to detect for each model if after the table reference there is an alias. If there is it should be preserved if there is not it should put the name of the table as the alias. So for case 1 we now get:

SELECT
    first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    first_name
FROM (SELECT 'some-string' as first_name) employee -- This is added but doesn't change anything

I think modifying the regex that the select literal matches could help: https://regexr.com/6qllk

((?:from|join)(?:\s*--.*)?[\r\n\s]*)(`my-project`.`my-dataset`.`my-table`)(?:;?\s*--.*)?[\r\n\s]*((?:(?:AS)?(?:\s*--.*)?[\r\n\s]*.+))?

There are 3 examples here. If capturing group 3 is empty then we need to add my-table as an alias if capturing group 3 is not empty then do nothing and keep the old behaviour. I am happy to pick this up if you like as I can imagine that regex is quite opaque.

The first part ((?:from|join)(?:\s*--.*)?[\r\n\s]*)(my-project.my-dataset.my-table) is saying:

  • Look for 'from' or 'join' and then optionally some new lines or SQL comments (-- followed by some words)
  • Then look for the table reference

The second part (?:;?\s*--.*)?[\r\n\s]*((?:(?:AS)?(?:\s*--.*)?[\r\n\s]*.+))? can be broken up into:

  • Optionally look for either the end of a SQL statement, whitespace or a comment
  • Then look for optionally the word 'AS'
  • Then optionally followed by more whitespace/new line/comment
  • Followed by some words (The table alias)

But it would take a lot of convincing/testing to show that this works in all cases within valid SQL syntax. Maybe moving away from a pure regex based implementation and trying to parse the SQL a bit more will help solve this issue

@ghost ghost added bug Something isn't working and removed question Further information is requested labels Jul 27, 2022
@alvfermar-mp
Copy link
Contributor Author

Hey @connor-charles, I think you are right regarding the cases and best solution - we need to detect when the upstream model is not aliased and alias it.

Regarding the regex, thanks for the detailed and clear explanation but as you mentioned in my case it would need more convincing / testing than in yours (as you have already made it).

If you can pick this up would be great, but a comment regarding your previous suggestion If capturing group 3 is empty then we need to add my-table as an alias if capturing group 3 is not empty then do nothing and keep the old behaviour: looks like in the regex example you sent before the third case would need to be aliased but capturing group 3 is not empty. Maybe that could be sorted amending the still a bit more or maybe as you mentioned is a better idea to parse the SQL a bit more

@ghost
Copy link

ghost commented Aug 12, 2022

I have pushed a commit to improve the test coverage for the literal insertion as there are lot of cases to cover in the SQL parsing logic. There is a library called sqlparse which I have tried to use before with the dry runner but I found it a bit overkill for the small amount of parsing the dry runner needs to do.

I think a decent algorithm for this is to iterate backwards from the table reference you are trying to replace to determine if it is a correct candidate for replacing it with a select literal. If you find a FROM or JOIN before then it is a real table reference. If you find a quote then its probably just a string of the table reference inside the query. Need to disregard comments as well. If we take the case in the test test_replace_upstream_sql_does_not_replace_alias_in_string:

    SELECT bar.foo,
           '{node.to_table_ref_literal()}' as the_table
    FROM {node.to_table_ref_literal()} as bar

If we split the query by whitespace characters the array should look like this:

['SELECT', 'bar.foo,', "'`my_db`.`my_schema`.`A`'", 'as', 'the_table', 'FROM', '`my_db`.`my_schema`.`A`', 'as', 'bar']

Then if we search for `my_db`.`my_schema`.`A` in these tokens we find element 6. Working backwards we hit FROM immediately so we can replace element 6 with the select literal. Element 2 is in quotes so we won't replace it.

To handle comments its a bit more complicated this test we previously had to xfail because the regex couldn't match it properly:

    SELECT foo
    FROM -- test
         -- test2
        `my_db`.`my_schema`.`A`

The split SQL query is:

['SELECT', 'foo', 'FROM', '--', 'test', '--', 'test2', '`my_db`.`my_schema`.`A`']

We find the table reference in the last element. Working backwards I think it's possible to detect that the elements between then and the FROM are comments and then we can replace the table reference. Will try and implement this soon and see how it goes, would be good to get rid of the regex.

@ghost ghost added the help wanted Extra attention is needed label Oct 21, 2022
@ccharlesgb ccharlesgb linked a pull request Dec 27, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant