Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(import-pipeline): dict and bytes mixup #28525

Merged
merged 4 commits into from
Feb 11, 2025
Merged

Conversation

EDsCODE
Copy link
Member

@EDsCODE EDsCODE commented Feb 11, 2025

Problem

  • errors: Expected bytes, got a 'dict' object
  • issue is that there's columns that have string and dict types

Changes

  • even if it's a string or bytes, make sure to serialize

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

@EDsCODE EDsCODE requested a review from Gilbert09 February 11, 2025 04:26
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR addresses type conversion issues in the SQL database import pipeline, focusing on handling mixed data types in string columns.

  • Added JSON serialization for dictionary values in string/bytes columns in arrow_helpers.py to prevent "Expected bytes, got a dict" errors
  • Changed default backend from 'sqlalchemy' to 'pyarrow' in __init__.py for more stable schema handling
  • Added test case in test_arrow_helpers.py to verify proper dictionary serialization in string columns
  • Implemented type checking in row_tuples_to_arrow to ensure consistent handling of mixed types
  • Modified column handling to properly serialize dict values while preserving other string/bytes data

3 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines 124 to 129
if issubclass(py_type, bytes) or issubclass(py_type, str):
# For bytes/str columns, ensure any dict values are serialized to JSON strings
columnar_known_types[field.name] = [
None if x is None else json_dumps(x) if isinstance(x, dict) else x
for x in columnar_known_types[field.name]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: this serialization could potentially lose data if the dict contains types that can't be serialized to JSON. Consider adding a warning log similar to line 113 to alert users about the conversion.

@EDsCODE EDsCODE requested a review from a team February 11, 2025 04:29
EDsCODE and others added 3 commits February 11, 2025 08:52
…pers.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@EDsCODE EDsCODE merged commit 8e2f6fc into master Feb 11, 2025
96 checks passed
@EDsCODE EDsCODE deleted the import-bytes-dict-error branch February 11, 2025 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants