Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent order of lineage tuples #652

Open
rubytobi opened this issue Sep 25, 2024 · 0 comments · May be fixed by #661
Open

Inconsistent order of lineage tuples #652

rubytobi opened this issue Sep 25, 2024 · 0 comments · May be fixed by #661
Labels
bug Something isn't working

Comments

@rubytobi
Copy link

rubytobi commented Sep 25, 2024

Hi, first of all thanks for this amazing library!

I came across an edge cases I wanted to highlight. The below example inconsistently fails or passes because the column lineages are ordered differently.

To Reproduce

import pytest

from sqllineage.core.metadata_provider import MetaDataProvider
from sqllineage.utils.entities import ColumnQualifierTuple
from ...helpers import assert_column_lineage_equal, generate_metadata_providers


providers = generate_metadata_providers(
    {
        "database_a.table_a": ["col_a", "col_b", "col_c"],
    }
)


@pytest.mark.parametrize("provider", providers)
def test_ouput_consistency(provider: MetaDataProvider):
    sql = """CREATE TABLE database_b.table_c
    AS (
      SELECT
        *,
        1 AS event_time
      FROM (
        SELECT
          table_b.col_b AS col_a
        FROM database_b.table_b AS table_b
        JOIN database_a.table_a AS table_d
      ) AS base
    )
    """
    assert_column_lineage_equal(
        sql,
        [
            (
                ColumnQualifierTuple("col_b", "database_a.table_a"),
                ColumnQualifierTuple("col_b", "database_b.table_c"),
            ),
            (
                ColumnQualifierTuple("col_c", "database_a.table_a"),
                ColumnQualifierTuple("col_c", "database_b.table_c"),
            ),
            (
                ColumnQualifierTuple("col_b", "database_b.table_b"),
                ColumnQualifierTuple("col_a", "database_b.table_c"),
            ),
        ],
        dialect="athena",
        test_sqlparse=False,
        test_sqlfluff=True,
        metadata_provider=provider,
    )

Sometimes the pytest fails with below:

E       	Expected Lineage: {(Column: database_b.table_b.col_b, Column: database_b.table_c.col_a), (Column: database_a.table_a.col_b, Column: database_b.table_c.col_b), (Column: database_a.table_a.col_c, Column: database_b.table_c.col_c)}
E       	Actual Lineage: {(Column: database_a.table_a.col_a, Column: database_b.table_c.col_a), (Column: database_a.table_a.col_b, Column: database_b.table_c.col_b), (Column: database_a.table_a.col_c, Column: database_b.table_c.col_c)}

Sometimes with this:

E       	Expected Lineage: {(Column: database_a.table_a.col_c, Column: database_b.table_c.col_c), (Column: database_b.table_b.col_b, Column: database_b.table_c.col_a), (Column: database_a.table_a.col_b, Column: database_b.table_c.col_b)}
E       	Actual Lineage: {(Column: database_a.table_a.col_a, Column: database_b.table_c.col_a), (Column: database_a.table_a.col_c, Column: database_b.table_c.col_c), (Column: database_a.table_a.col_b, Column: database_b.table_c.col_b)}

And sometimes it actually succeeds.

Expected behavior
I would expect the column lineages to be consistent in the results.

If I understand the codebase right, it's because the results are only ordered based on first and last lineage element, not the whole lineage:

@lazy_method
def get_column_lineage(
self, exclude_path_ending_in_subquery=True, exclude_subquery_columns=False
) -> List[Tuple[Column, Column]]:
"""
a list of column tuple :class:`sqllineage.models.Column`
"""
# sort by target column, and then source column
return sorted(
self._sql_holder.get_column_lineage(
exclude_path_ending_in_subquery, exclude_subquery_columns
),
key=lambda x: (str(x[-1]), str(x[0])),
)

Something like the below would take the whole lineage into account for ordering:

key=lambda x: "".join([str(i) for i in reversed(x)])
@rubytobi rubytobi added the bug Something isn't working label Sep 25, 2024
@rubytobi rubytobi linked a pull request Nov 15, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant