Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support wildcard tables (and filter on _TABLE_SUFFIX) in read_gbq / read_gbq_table #169

Closed
tswast opened this issue Nov 3, 2023 · 2 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Collaborator

tswast commented Nov 3, 2023

Is your feature request related to a problem? Please describe.

In https://cloud.google.com/bigquery/docs/create-machine-learning-model there is the following SQL:

SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

I'd like to be able to represent all of this in Python without any SQL code. There are two problems right now:

  1. bigquery-public-data.google_analytics_sample.ga_sessions_* isn't supported as a table ID in read_gbq. This doesn't refer to any single table, so API requests based on the table ID will fail.

  2. Even if (1) were supported, it would try to copy all the data into a temp table. It would be best to be able to specify a filter on _TABLE_SUFFIX at data read time.

Describe the solution you'd like

bpd.read_gbq("bigquery-public-data.google_analytics_sample.ga_sessions_*") should work.

Also, somewhat inspired by the BigQuery Storage API, accept a row_restriction parameter to filter rows.

bpd.read_gbq(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[("_TABLE_SUFFIX", ">=", '20160801'), ("_TABLE_SUFFIX", "<=", '20170630')],
)

See: https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions.FIELDS.string.google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions.row_restriction

See also: "filters" parameter in pandas.read_parquet

Describe alternatives you've considered

SQL as an input to read_gbq works as an alternative right now.

Additional context

Related feature request on pandas-gbq:

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 3, 2023
@tswast
Copy link
Collaborator Author

tswast commented Jan 23, 2024

@Genesis929 This is affecting more BQML tutorials. I think we should try and find a solution for this sooner rather than later. I think it may require translation to SQL similar to your filters implementation.

@GarrettWu GarrettWu assigned GarrettWu and unassigned Genesis929 Feb 14, 2024
@GarrettWu
Copy link
Contributor

fixed by #377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants