-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds Schema class and modifies schema handling #2081
base: pangea-v1alpha
Are you sure you want to change the base?
Conversation
🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use -- conventional-commit-lint bot |
google/cloud/bigquery/client.py
Outdated
@@ -3680,7 +3680,7 @@ def insert_rows( | |||
if selected_fields is not None: | |||
schema = selected_fields | |||
|
|||
if len(schema) == 0: | |||
if not schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, could you make len(schema_object)
work? I guess this is to allow None
as schema
now, too? If table.schema
could indeed return None
now, that might be considered a breaking change, depending on how strict we want to be about types.
Either way, folks might have to change their code to deal with missing schema, so something to consider. Maybe we return an object of len(...) == 0
for now and have a warning that it will be changed to None
in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike the idiom of schema-less tables, but it is indeed something we've seen people do intentionally in the wild. My recollection is that it was used historically as a signal for some workflow state (e.g. some kind of is-table-present check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that Table.schema
can currently be set to None
. It appears that we allow for that in the existing code:
@schema.setter
def schema(self, value):
api_field = self._PROPERTY_TO_API_FIELD["schema"]
if value is None:
self._properties[api_field] = None
else:
value = _to_schema_fields(value)
self._properties[api_field] = {"fields": _build_schema_resource(value)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched it back to the old check.
google/cloud/bigquery/client.py
Outdated
@@ -4029,7 +4029,7 @@ def list_rows( | |||
|
|||
# No schema, but no selected_fields. Assume the developer wants all | |||
# columns, so get the table resource for them rather than failing. | |||
elif len(schema) == 0: | |||
elif not schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Let's be careful about breaking changes regarding table.schema
allowing None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also switched this back to the old check.
google/cloud/bigquery/schema.py
Outdated
Exception: If ``schema`` is not a sequence, or if any item in the | ||
sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField` | ||
instance or a compatible mapping representation of the field. | ||
"""TODO docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging with a comment just so we don't lose track of this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -796,8 +792,6 @@ def serde_info(self) -> Any: | |||
prop = _get_sub_prop(self._properties, ["serDeInfo"]) | |||
if prop is not None: | |||
prop = StorageDescriptor().from_api_repr(prop) | |||
print(f"DINOSAUR prop: {prop}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, oops! 😆 Could be worth a separate cleanup PR to have something small and easy to approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
google/cloud/bigquery/schema.py
Outdated
self._properties["foreignTypeInfo"] = value | ||
|
||
@property | ||
def _fields(self) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why private? I guess because we're trying to emulate a list and want to support mutation that way?
Would be worth an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, the idea behind having it be private was to try and emulate a list.
I can add a note to that effect when I add docstrings.
return _parse_schema_resource(prop) | ||
elif isinstance(prop, Schema): | ||
return prop | ||
return _parse_schema_resource(prop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we want to return a Schema
object now.
return _parse_schema_resource(prop) | |
return Schema.from_api_repr(prop) |
Aside: Is _parse_schema_resource
dead code? I assume it still returns a list of SchemaField
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_parse_schema_resource
still returns a list of SchemaFields
for backwards compatibility.
Where possible I -tried- to stay consist about what the user provides versus gets back: if someone handed the library a list
of SchemaField
s (i.e. cause that is what they are used to) then we maintained that state. I tried to avoid randomly converting lists
into Schema
objects to avoid user surprise.
|
||
@schema.setter | ||
def schema(self, value): | ||
api_field = self._PROPERTY_TO_API_FIELD["schema"] | ||
|
||
if value is None: | ||
self._properties[api_field] = None | ||
elif isinstance(value, Schema): | ||
self._properties[api_field] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._properties[api_field] = value | |
self._properties[api_field] = value.to_api_repr() |
google/cloud/bigquery/schema.py
Outdated
Dict[str, Any]: | ||
A dictionary in the format used by the BigQuery API. | ||
""" | ||
return copy.deepcopy(self._properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google/cloud/bigquery/schema.py
Outdated
An instance of the class initialized with data from 'resource'. | ||
""" | ||
config = cls("") | ||
config._properties = copy.deepcopy(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. May be better to omit the deepcopy.
@@ -452,17 +453,20 @@ def schema(self): | |||
instance or a compatible mapping representation of the field. | |||
""" | |||
prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"]) | |||
if not prop: | |||
if not prop: # if empty Schema, empty list, None | |||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return an "empty" Schema here for consistency?
return [] | |
return Schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the following influence your thoughts on this comment?
Historically, we have returned an empty list
.
An existing user would still expect a list
.
A new user should probably not see an empty Schema
object, since they are only needed if the user is gonna try to access an external data source (where they have to identify the foreign_type_info
).
My general principal, focused on backwards compatibility is:
- avoid as many changes to what the code did before, unless necessary
- as much as possible, force the choice to move to Schema to be an explicit decision by any user (especially since external data sources are likely to be the exception, not the rule).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way on this one. One one hand, using Schema
everywhere would make the new Schema class more discoverable and fewer cases to consider when type checking. On the other hand, providing a real list in all cases where Schema
is not needed would provide the most backwards compatible option.
On further consideration, I agree with your assessment that avoiding breaking existing users is preferable. +1 on empty list in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a related concept.
Right now, if someone gives me a Schema
object, I have been storing that in _properties and I see your point about avoiding that.
That complicates getting a Schema
object back out, if/when you need it.
What are your thoughts on the following approach? (I seem to recall you mention keeping objects in a separate place and ensuring that _properties is a view into the object)
@schema.setter
def schema(self, value):
api_field = self._PROPERTY_TO_API_FIELD["schema"]
if value is None:
self._properties[api_field] = None
elif isinstance(value, Schema):
# store original as selt.<original_schema_object> # TODO find a better name
self.<original_schema_object> = value
self.properties[api_field] = self.<original_schema_object>.to_api_repr()
else:
value = _to_schema_fields(value)
self._properties[api_field] = {"fields": _build_schema_resource(value)}
@property
def schema(self):
# if user stored a Schema object, short circuit and return early.
<original_schema_object> = self.get(<original_schema_object>, None)
if isinstance(original_schema_object, Schema):
return self.<original_schema_object>
# if user did not store a Schema object, behave as normal
prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
if not prop: # either None OR empty list
return []
return _parse_schema_resource(prop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry with self.<original_schema_object> = value
is that it can get out of sync with the values stored in _properties
, especially since Schema
supports in-place mutation.
Also, there are other cases to consider besides "someone gave us a Schema
object". Importantly, what if someone calls get_table("proj.dataset.some_table_with_pangea_features_enabled")
? IMO this case is one where we should return a Schema
object.
Alternatively, we introduce separate properties for the pangea fields to Table
and forget all this Schema
class business. Something like Table.schema_foreign_type_info
could work. I believe we already do similar for certain statistics on job, for example. The only gotcha here is that we'd need to make sure we merge these two properties (schema
and schema_foreign_type_info
) rather than replace the whole _properties["schema"]
when one sets either of these.
Thanks! Overall goals outlined in the description look good to me.
|
google/cloud/bigquery/schema.py
Outdated
@_fields.setter | ||
def _fields(self, value: list) -> None: | ||
value = _isinstance_or_raise(value, list, none_allowed=True) | ||
self._properties["_fields"] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why _fields
. It's fields
in the REST API. Also, this should convert to the list to API representation. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed _fields
to fields
. Changed tests to match.
* feat: add property for maxStaleness in table definitions Signed-off-by: Yu Ishikawa <[email protected]> * Update google/cloud/bigquery/table.py --------- Signed-off-by: Yu Ishikawa <[email protected]> Co-authored-by: Lingqing Gan <[email protected]>
* add type hints * Update client.py Moves import from being used solely during specific checks to being more universally available. * Update google/cloud/bigquery/client.py * Update client.py testing some minor changes to deal with mypy quirks * Update google/cloud/bigquery/client.py --------- Co-authored-by: Chalmer Lowe <[email protected]>
Source-Link: googleapis/synthtool@e808c98 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8e3e7e18255c22d1489258d0374c901c01f9c4fd77a12088670cd73d580aa737 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.4...3.1.5) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hemaField` (#2097) * feat: preserve unknown fields from the REST API representaton in `SchemaField` * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * remove unnecessary variable * remove unused private method * fix pytype --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* fix: adds test of roundingmode as a str * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
schema = Schema( | ||
[ | ||
field | ||
if isinstance(field, SchemaField) | ||
else SchemaField.from_api_repr(field) | ||
for field in schema | ||
], | ||
foreign_type_info=schema.foreign_type_info, | ||
) | ||
return schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a Schema
object, the Schema
object is responsible for all conversions. I don't think we need to make a copy of it or anything.
Returning the original object should be sufficient.
schema = Schema( | |
[ | |
field | |
if isinstance(field, SchemaField) | |
else SchemaField.from_api_repr(field) | |
for field in schema | |
], | |
foreign_type_info=schema.foreign_type_info, | |
) | |
return schema | |
return schema |
This also future-proofs us a bit. In case any additional properties are added to Schema
, we won't have to update this code.
Exception: If ``schema`` is not a sequence, or if any item in the | ||
sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField` | ||
instance or a compatible mapping representation of the field. | ||
ValueError: If the items in ``schema`` are not valid. | ||
""" | ||
|
||
for field in schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this check if we get a Schema
object. Let's move that above.
for field in schema: | |
if isinstance(schema, Schema): | |
return schema | |
for field in schema: |
Note: I just return schema
if given a Schema
. See my comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could call return Schema.from_api_repr(schema.to_api_repr())
, but I don't see much reason to do so.
|
||
@property | ||
def _fields(self) -> Optional[List[Any]]: | ||
return self._properties.get("fields") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid storing non-REST API representation in _properties
. Given the usage of UserList
, I think just a regular instance variable would be all that's needed.
# TODO docstrings and type hints | ||
def __init__(self, fields=None, foreign_type_info=None): | ||
self._properties = {} | ||
self._fields = [] if fields is None else list(fields) # Internal List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving. I don't think _properties
for "fields"
makes sense in the context of UserList
, because we want to support all sorts of mutations and such. For that, let's do something like this instead:
self._fields = [] if fields is None else list(fields) # Internal List | |
self.data = _to_schema_fields(fields) # Internal List |
And remove the data
and _fields
properties. Just a data
instance variable is needed. That also has the benefit of not documenting data
, which isn't really part of our public API.
# If this is a RECORD type, then sub-fields are also included, | ||
# add this to the serialized representation. | ||
answer = self._properties.copy() | ||
if self._fields == []: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have to be updated if we remove "fields"
from _properties
, as I suggest above.
Also, per go/pystyle#truefalse-evaluations-decision it is preferred to use the falsey nature of empty list if possible.
if self._fields == []: | |
if not self.data: |
schemafields = any([isinstance(f, SchemaField) for f in self._fields]) | ||
if schemafields: | ||
answer["fields"] = [f.to_api_repr() for f in self._fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. We should allow people to insert Mapping
/ dict
into their Schema / UserList objects. Even if no remaining field is a SchemaField
, we should still return something.
I would expect something more like this:
schemafields = any([isinstance(f, SchemaField) for f in self._fields]) | |
if schemafields: | |
answer["fields"] = [f.to_api_repr() for f in self._fields] | |
answer["fields"] = [ | |
# Users may inject the REST representation of schema fields into the list. | |
# If it's not a `SchemaField` object, then we can assume that's what they've | |
# done. | |
field.to_api_repr() if hasattr(field, "to_api_repr") else field | |
for field in self.data | |
] |
I'd also like to see some tests where the user replaces all fields with the REST representation of different fields. Something like:
def test_schema_to_api_repr_with_rest_entries():
schema = Schema(fields=[SchemaField(), ....])
del schema[:]
rest_fields = [{"name": "...", "type": "..."}, ...]
schema.extend(rest_fields)
got = schema.to_api_repr()
assert got = {"fields": rest_fields}
Edit: Might be good to have another test such as above but with mixing and matching REST and SchemaField objects. Perhaps by not deleting all entries but instead inserting some REST fields.
config = cls([]) | ||
config._properties = copy.copy(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With using UserList
, you will need to pull out fields
and pass them into the constructor.
config = cls([]) | |
config._properties = copy.copy(resource) | |
resource = copy.copy(resource) | |
fields = resource.pop("fields", []) | |
config = cls(fields) | |
config._properties = resource |
This makes sure .data
has all SchemaField
objects while preserving the future-proof ability to add _properties
(other than fields
).
return [] | ||
else: | ||
return _parse_schema_resource(prop) | ||
elif isinstance(prop, Schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still stands. We shouldn't have Schema
objects in _properties
. Instead, check for the existence of any keys besides "fields"
to determine if Schema
or list should be used.
Re: CLA failure, might need a merge with |
This PR adds a
Schema
class topython-bigquery
as part of the Google internal PANGEA work.Historically,
table.schema
has been alist
ofSchemaFields
.Schema
objects will now include not only alist
of fields, but will have an attribute calledforeign_type_info
.The aim of the work here is create, as much as possible, a drop-in replacement for
schema
.The end user should not need to know whether they are looking at a list-based
schema
OR a class Schema-based object unless they need access to theforeign_type_info
attr.They should still be able to:
Schema
just like you would alist
Schema
just like a listFor users who already have a
list
in their code, nothing should break.For future users who will need a
Schema
object, they should be able to instantiate one and it should just work.TODO: