Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
This repository was archived by the owner on May 17, 2024. It is now read-only.

Diff found when comparing float columns between Postgresql & Snowflake #379

Closed
@yonieilon

Description

@yonieilon

Describe the bug
When I compare a table between Postgresql & Snowflake, which has "double precision" columns in Postgresql and "FLOAT" columns in Snowflake, data-diff detects a diff, due to the way each db-engine rounds to the minimum precision, even though the actual value in the tables is equal.

  • Steps to reproduce:

    • In Postgresql:
      create table my_test_table(id bigint, float_test double precision);
      insert into my_test_table values (13695705,0.36640625);
    
    • In Snowflake:
     create table my_test_table(id bigint, float_test double precision);
     insert into my_test_table values (13695705,0.36640625);
    
  • The command or code you used

data-diff \
"postgresql://user:password@host:5432/dbname" \
my_test_table \
"snowflake://user:password@account/dbname/schemaname?warehouse=<warehouse>&role=<role>" \
MY_TEST_TABLE \
-k id \
-c float_test \
-w "id = 13695705" -v -d
  • The run output + error you're getting. (including tracestack)
[17:25:18] INFO - [PostgreSQL] Starting a threadpool, size=1.
[17:25:18] DEBUG - Running SQL (PostgreSQL): SET TIME ZONE 'UTC'
[17:25:19] DEBUG - Injecting ssl_wrap_socket_with_ocsp
/usr/local/lib/python3.9/site-packages/snowflake/connector/options.py:96: UserWarning: You have an incompatible version of 'pyarrow' installed (10.0.1), please install a version that adheres to: 'pyarrow<6.1.0,>=6.0.0; extra == "pandas"'
  warn_incompatible_dep(
[17:25:20] DEBUG - cache directory: /Users/yonieilon/Library/Caches/Snowflake
[17:25:21] DEBUG - Running SQL (Snowflake): ALTER SESSION SET TIMEZONE = 'UTC'
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT current_timestamp
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'my_test_table' AND table_schema = 'public'
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'MY_TEST_TABLE' AND table_schema = 'FRUIT'
[17:25:21] DEBUG - [PostgreSQL] Schema = {'id': ('id', 'bigint', None, 64, 0), 'float_test': ('float_test', 'double precision', None, 53, None)}
[17:25:21] DEBUG - [Snowflake] Schema = {'FLOAT_TEST': ('FLOAT_TEST', 'FLOAT', None, None, None), 'ID': ('ID', 'NUMBER', None, 38, 0)}
[17:25:21] DEBUG - Available mutual columns: {'id', 'float_test'}
[17:25:21] INFO - Diffing using columns: key=('id',) update=None extra=('float_test',).
[17:25:21] INFO - Using algorithm 'hashdiff'.
[17:25:21] DEBUG - [PostgreSQL] Schema = {'id': Integer(precision=0, python_type=<class 'int'>), 'float_test': Float(precision=13)}
[17:25:21] DEBUG - [Snowflake] Schema = {'FLOAT_TEST': Float(precision=7), 'ID': Decimal(precision=0)}
[17:25:21] WARNING - Using reduced precision Float(precision=7) for column 'float_test'. Types=Float(precision=13), Float(precision=7)
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT min("id")::varchar, max("id")::varchar FROM "my_test_table" WHERE id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast(min("ID") as decimal(38, 0)) as string), cast(cast(max("ID") as decimal(38, 0)) as string) FROM "MY_TEST_TABLE" WHERE id = 13695705
[17:25:21] INFO - Diffing segments at key-range: 13695705..13695706. size: table1 <= 1, table2 <= 1
[17:25:21] DEBUG - Running SQL (PostgreSQL): SELECT "id"::varchar, "float_test"::decimal(38, 7)::varchar FROM "my_test_table" WHERE ("id" >= 13695705) AND ("id" < 13695706) AND id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast("ID" as decimal(38, 0)) as string), cast(cast("FLOAT_TEST" as decimal(38, 7)) as string) FROM "MY_TEST_TABLE" WHERE ("ID" >= 13695705) AND ("ID" < 13695706) AND id = 13695705
[17:25:22] INFO - Diff found 2 different rows.
- 13695705, 0.3664063
+ 13695705, 0.3664062
[17:25:22] INFO - Duration: 4.14 seconds.

I can see it's running these statement in Postgres & Snowflake:

Running SQL (PostgreSQL): SELECT "id"::varchar, "float_test"::decimal(38, 7)::varchar FROM "my_test_table" WHERE ("id" >= 13695705) AND ("id" < 13695706) AND id = 13695705
[17:25:21] DEBUG - Running SQL (Snowflake): SELECT cast(cast("ID" as decimal(38, 0)) as string), cast(cast("FLOAT_TEST" as decimal(38, 7)) as string) FROM "MY_TEST_TABLE" WHERE ("ID" >= 13695705) AND ("ID" < 13695706) AND id = 13695705

And indeed it's producing a different result in each db, since each DB engine rounds differently when casting to decimal(38, 7)

Is there any workaround or way to configure data-diff to compare the value correctly???

Describe the environment

data-diff version: 0.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale_immuneImmunity to stale bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions