Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to specify string matching case sensitivity behavior via configurations #257

Open
knassre-bodo opened this issue Feb 11, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request extensibility Increasing situations in which PyDough works user feature Adding a new user-facing feature/functionality

Comments

@knassre-bodo
Copy link
Contributor

Currently, the default in sqlite is that string matching via LIKE is case insensitive, but users may want to make it case sensitive (and frankly, this makes sense as the default). To that end, the goal of this issue is to add a new config property to the PyDoughConfigs class: case_sensitive_string_matching (default=True). When True, LIKE and similar operators (STARTSWITH, ENDSWITH, CONTAINS are all case sensitive. For sqlite, this means that during conversion, the 2nd argument to like should be wrapped in COLLATE BINARY. For example, if I have PyDough code with a predicate X LIKE 'A%", that should become SELECT ... FROM ... WHERE X LIKE 'A%' COLLATE BINARY, which is the following tree expression in sqlglot:

Select(
  expressions=[...],
  from=From(...),
  where=Where(
    this=Like(
      this=Column(
        this=Identifier(this=A, quoted=False)),
      expression=Collate(
        this=Literal(this=A%, is_string=True),
        expression=Var(this=BINARY)))))

When the config is False, like & similar functions should go back to being case-insensitive. We can keep the COLLATE but switch it from BINARY to NOCASE, which is the default. For other dialects, we can think about how to handle the translation, b ut if a dialect's default is case-sensitive then all we need to do is call LOWER on both arguments when case-insensitivity is requested According to ChatGPT, the dialects have the following behavior:

  • Case-Insensitive by Default: MySQL, MariaDB, SQL Server, SQLite (for ASCII), Snowflake, CockroachDB.
  • Case-Sensitive by Default: PostgreSQL, Oracle, Db2, Firebird, Informix, Sybase ASE, Teradata, Redshift, BigQuery, ClickHouse, DuckDB.
  • Configurable via Collations or Settings: MySQL (utf8_bin for case-sensitive), SQL Server (collations), SQLite (PRAGMA case_sensitive_like = ON;), Snowflake (COLLATE 'BINARY').

The behavior of the config should be well documented in the usage guide and various README files.

@knassre-bodo knassre-bodo added documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request extensibility Increasing situations in which PyDough works user feature Adding a new user-facing feature/functionality labels Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request extensibility Increasing situations in which PyDough works user feature Adding a new user-facing feature/functionality
Projects
None yet
Development

No branches or pull requests

1 participant