Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Types mismatch error while mapping nullable column #49338

Open
bveeramani opened this issue Dec 18, 2024 · 0 comments · May be fixed by #49405
Open

[Data] Types mismatch error while mapping nullable column #49338

bveeramani opened this issue Dec 18, 2024 · 0 comments · May be fixed by #49405
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

I was trying to load a Parquet version of ImageNet that contains a "label" column. The value of the column is a string for the train split and null for the test and val splits. While loading dataset, I got an assertion error:

AssertionError: Types mismatch: null != string

Here's the traceback for the repro below:

    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/balaji/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 398, in __call__
    yield output_buffer.next()
  File "/Users/balaji/ray/python/ray/data/_internal/output_buffer.py", line 73, in next
    block_to_yield = self._buffer.build()
  File "/Users/balaji/ray/python/ray/data/_internal/delegating_block_builder.py", line 68, in build
    return self._builder.build()
  File "/Users/balaji/ray/python/ray/data/_internal/table_block.py", line 133, in build
    return self._concat_tables(tables)
  File "/Users/balaji/ray/python/ray/data/_internal/arrow_block.py", line 149, in _concat_tables
    return transform_pyarrow.concat(tables)
  File "/Users/balaji/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 308, in concat
    col = _concatenate_chunked_arrays(col_chunked_arrays)
  File "/Users/balaji/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 201, in _concatenate_chunked_arrays
    assert type_ == arr.type, f"Types mismatch: {type_} != {arr.type}"
AssertionError: Types mismatch: null != string

The reason this happens is that our check is overly-strict: we should be able to combine null and string data.

Versions / Dependencies

3362ef4

Reproduction script

import ray
import numpy as np


def f(batch):
    yield {"string": [None], "array": np.zeros((1, 2, 2))}
    yield {"string": ["spam"], "array": np.zeros((1, 2, 2))}


ray.data.range(1, override_num_blocks=1).map_batches(f).materialize()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Dec 18, 2024
@bveeramani bveeramani self-assigned this Dec 18, 2024
@bveeramani bveeramani linked a pull request Dec 23, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant