Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Diff found when comparing float columns between Postgresql & Snowflake #379

Closed
yonieilon opened this issue Feb 5, 2023 · 5 comments
Closed
Labels
bug Something isn't working stale_immune Immunity to stale bot

Comments

@yonieilon
Copy link

Describe the bug
When I compare a table between Postgresql & Snowflake, which has "double precision" columns in Postgresql and "FLOAT" columns in Snowflake, data-diff detects a diff, due to the way each db-engine rounds to the minimum precision, even though the actual value in the tables is equal.

  • Steps to reproduce:

    • In Postgresql:
      create table my_test_table(id bigint, float_test double precision);
      insert into my_test_table values (13695705,0.36640625);
    
    • In Snowflake:
     create table my_test_table(id bigint, float_test double precision);
     insert into my_test_table values (13695705,0.36640625);
    
  • The command or code you used

data-diff \
"postgresql://user:password@host:5432/dbname" \
my_test_table \
"snowflake://user:password@account/dbname/schemaname?warehouse=<warehouse>&role=<role>" \
MY_TEST_TABLE \
-k id \
-c float_test \
-w "id = 13695705" -v -d
  • The run output + error you're getting. (including tracestack)
[17:25:18] INFO - [PostgreSQL] Starting a threadpool, size=1.
[17:25:18] DEBUG - Running SQL (PostgreSQL): SET TIME ZONE 'UTC'
[17:25:19] DEBUG - Injecting ssl_wrap_socket_with_ocsp
/usr/local/lib/python3.9/site-packages/snowflake/connector/options.py:96: UserWarning: You have an incompatible version of 'pyarrow' installed (10.0.1), please install a version that adheres to: 'pyarrow<6.1.0,>=6.0.0; extra == "pandas"'
  warn_incompatible_dep(
[17:25:20] DEBUG - cache directory: /Users/yonieilon/Library/Caches/Snowflake
[17:25:21] DEBUG - Running SQL (Snowflake): ALTER SESSION SET TIMEZONE = 'UTC'
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT current_timestamp
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'my_test_table' AND table_schema = 'public'
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'MY_TEST_TABLE' AND table_schema = 'FRUIT'
[17:25:21] DEBUG - [PostgreSQL] Schema = {'id': ('id', 'bigint', None, 64, 0), 'float_test': ('float_test', 'double precision', None, 53, None)}
[17:25:21] DEBUG - [Snowflake] Schema = {'FLOAT_TEST': ('FLOAT_TEST', 'FLOAT', None, None, None), 'ID': ('ID', 'NUMBER', None, 38, 0)}
[17:25:21] DEBUG - Available mutual columns: {'id', 'float_test'}
[17:25:21] INFO - Diffing using columns: key=('id',) update=None extra=('float_test',).
[17:25:21] INFO - Using algorithm 'hashdiff'.
[17:25:21] DEBUG - [PostgreSQL] Schema = {'id': Integer(precision=0, python_type=<class 'int'>), 'float_test': Float(precision=13)}
[17:25:21] DEBUG - [Snowflake] Schema = {'FLOAT_TEST': Float(precision=7), 'ID': Decimal(precision=0)}
[17:25:21] WARNING - Using reduced precision Float(precision=7) for column 'float_test'. Types=Float(precision=13), Float(precision=7)
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT min("id")::varchar, max("id")::varchar FROM "my_test_table" WHERE id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast(min("ID") as decimal(38, 0)) as string), cast(cast(max("ID") as decimal(38, 0)) as string) FROM "MY_TEST_TABLE" WHERE id = 13695705
[17:25:21] INFO - Diffing segments at key-range: 13695705..13695706. size: table1 <= 1, table2 <= 1
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT "id"::varchar, "float_test"::decimal(38, 7)::varchar FROM "my_test_table" WHERE ("id" >= 13695705) AND ("id" < 13695706) AND id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast("ID" as decimal(38, 0)) as string), cast(cast("FLOAT_TEST" as decimal(38, 7)) as string) FROM "MY_TEST_TABLE" WHERE ("ID" >= 13695705) AND ("ID" < 13695706) AND id = 13695705
[17:25:22] INFO - Diff found 2 different rows.
- 13695705, 0.3664063
+ 13695705, 0.3664062
[17:25:22] INFO - Duration: 4.14 seconds.

I can see it's running these statement in Postgres & Snowflake:

Running SQL (PostgreSQL): SELECT "id"::varchar, "float_test"::decimal(38, 7)::varchar FROM "my_test_table" WHERE ("id" >= 13695705) AND ("id" < 13695706) AND id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast("ID" as decimal(38, 0)) as string), cast(cast("FLOAT_TEST" as decimal(38, 7)) as string) FROM "MY_TEST_TABLE" WHERE ("ID" >= 13695705) AND ("ID" < 13695706) AND id = 13695705

And indeed it's producing a different result in each db, since each DB engine rounds differently when casting to decimal(38, 7)

Is there any workaround or way to configure data-diff to compare the value correctly???

Describe the environment

data-diff version: 0.3.1

@yonieilon
Copy link
Author

UPDATE: upgraded to the latest version (0.3.2), and I still get the same result.

@erezsh erezsh added the bug Something isn't working label Feb 6, 2023
@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label May 11, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

@yonieilon
Copy link
Author

Hi, I think this issue is still relevant... Would be glad if it could be fixed...

@github-actions github-actions bot added triage and removed stale Issues/PRs that have gone stale labels May 24, 2023
@dlawin dlawin reopened this Aug 28, 2023
@dlawin dlawin added the stale_immune Immunity to stale bot label Aug 28, 2023
@dlawin dlawin removed the triage label Nov 1, 2023
@glebmezh
Copy link
Contributor

Hi @yonieilon,

I'm sorry for the delay in following up on this. Thank you for taking the time to raise this issue!

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package, including better handling of numerical comparison.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working stale_immune Immunity to stale bot
Projects
None yet
Development

No branches or pull requests

4 participants