Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default empty_as_null behavior for non-database sources #520

Open
nickolay opened this issue Mar 5, 2025 · 0 comments
Open

default empty_as_null behavior for non-database sources #520

nickolay opened this issue Mar 5, 2025 · 0 comments

Comments

@nickolay
Copy link

nickolay commented Mar 5, 2025

1) empty_as_null documentation is incomplete

https://docs.slingdata.io/concepts/replication/source-options#specification says

empty_as_null: Whether empty fields should be treated as NULL. Default is true.

In reality this depends on the "kind" of a connection: database connections default to false, while file connections default to true (source).

Perhaps the docs could be changed to:

Whether empty fields should be treated as NULL. Default depends on the kind of the source connection: false for database connections, true for storage connections.

2) The empty_as_null: true default does not make sense if a file format supports NULLs

Most reasonable formats (such as parquet) support missing values, just like databases.

For such formats it's unexpected that empty strings are silently loaded as NULLs.

Perhaps if empty_as_null is not specified explicitly in the configuration and defaults to true, sling could warn about this behavior (maybe after this transformation was applied at least once?) suggesting the user explicitly chooses either value:

Empty strings from file sources are loaded as NULL values by default. Set empty_as_null source option explicitly to silence this warning.

3) The empty_as_null: true default is surprising even for CSV

While makes some sense in general, consider this scenario:

  • First copy a table using sling from SRC_DB -> TGT_DB
    ** this preserves NULL / empty string distinction, as expected
  • Then (to take a snapshot to work with) change the pipeline to: SRC_DB -> CSV -> TGT_DB
    ** this exports both NULLs and empty strings as empty strings in CSV, and uploads them to TGT_DB as NULLs

This would work if there was a setting for a target acting as a reverse of nullif: instead of NULL, write the specified placeholder.

Then if the same non-empty placeholder would be specified for nullif and the new setting, both NULLs and empty strings would survive the load process.

For my use-cases I'll just use parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant