Skip to content

Pyarrow data type, default to small type and fix large type override #1859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Mar 27, 2025

Rationale for this change

#1669 made the change to infer the type when reading, and not default pyarrow data types to the large type. Originally, default to large type was introduced by #986.

I found a bug in #1669 where type promotion from string->binary defaults to large_binary (#1669 (comment)). Which led to to find that we still use large type in _ConvertToArrowSchema. Furthermore, I found that we did not respect PYARROW_USE_LARGE_TYPES_ON_READ=True when reading.

This PR is a continuation of #1669.

  • Change docs for pyarrow.use-large-types-on-read to default value False
  • Change _ConvertToArrowSchema to use small data type instead of large
  • When PYARROW_USE_LARGE_TYPES_ON_READ is enabled (set to True), ArrowScan and ArrowProjectionVisitor and should cast to large type
  • Add back test for setting PYARROW_USE_LARGE_TYPES_ON_READ to True

This PR should help us infer the data type when reading while keeping the PYARROW_USE_LARGE_TYPES_ON_READ override behavior until deprecation.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@kevinjqliu kevinjqliu requested review from Fokko and sungwy March 27, 2025 21:05
Co-authored-by: Fokko Driesprong <[email protected]>
@@ -626,7 +626,7 @@ def field(self, field: NestedField, field_result: pa.DataType) -> pa.Field:

def list(self, list_type: ListType, element_result: pa.DataType) -> pa.DataType:
element_field = self.field(list_type.element_field, element_result)
return pa.large_list(value_type=element_field)
return pa.list_(value_type=element_field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we need to change this. We use schema_to_pyarrow in many places:

  • Schema.as_arrow(), this can be problematic when people already allocate buffers that are larger than what fits in the small ones.
  • _ConvertToArrowExpression.{visit_in,visit_not_in}, I checked manually, and it looks like we can mix large and normal types here :)
  • ArrowProjectionVisitor has the issue similar to what you've described in Arrow: Infer the types when reading #1669 (comment). I think the other way around is also an issue. If you would promote a large_string, it would now produce a binary and not a large_binary.
  • ArrowScan.to_table()will return the schema when there is no data, both small and large are okay.
  • DataScan.to_arrow_batch_reader(), I think we should always update to the large type. Since this is streaming, we don't know upfront if the small buffers are big enough, therefore it is safe to go with the large ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants