Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 ingestion fails for parquet files when assumeRoleArn is used #19619

Open
trina242 opened this issue Jan 31, 2025 · 0 comments · May be fixed by #19620
Open

S3 ingestion fails for parquet files when assumeRoleArn is used #19619

trina242 opened this issue Jan 31, 2025 · 0 comments · May be fixed by #19620

Comments

@trina242
Copy link
Contributor

Affected module
Ingestion Framework

Describe the bug
S3 ingestion task throws unhandled exception when trying to read a parquet file, if assumeRoleArn is provided instead of access keys:

[2025-01-31, 08:00:44 UTC] {datalake_utils.py:69} ERROR - Error fetching file [olxgroup-reservoir-ares/local/odyn/jobs/poc_search_impressions_with_interactions/20230223_133234_04454_w9k7y_93212f86-8e23-4b69-99bf-4201991dbc52] using [S3Config] due to: [Error reading dataframe due to [Forbidden]]
[2025-01-31, 08:00:44 UTC] {status.py:91} WARNING - Wild error while creating Container from bucket details - 'NoneType' object has no attribute 'columns'
[2025-01-31, 08:00:44 UTC] {status.py:92} DEBUG - Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/ingestion/source/storage/s3/metadata.py", line 156, in get_containers
    yield from self._generate_structured_containers(
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/ingestion/source/storage/s3/metadata.py", line 382, in _generate_structured_containers
    ] = self._generate_container_details(
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/ingestion/source/storage/s3/metadata.py", line 297, in _generate_container_details
    columns = self._get_columns(
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/ingestion/source/storage/storage_service.py", line 337, in _get_columns
    extracted_cols = self.extract_column_definitions(
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/ingestion/source/storage/storage_service.py", line 320, in extract_column_definitions
    column_parser = DataFrameColumnParser.create(
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/utils/datalake/datalake_utils.py", line 135, in create
    parser = ParquetDataFrameColumnParser(data_frame)
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/utils/datalake/datalake_utils.py", line 427, in __init__
    self._arrow_table = pa.Table.from_pandas(self.data_frame)
  File "pyarrow/table.pxi", line 4525, in pyarrow.lib.Table.from_pandas
  File "/home/airflow/.local/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 570, in dataframe_to_arrays
    convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
  File "/home/airflow/.local/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 349, in _get_columns_to_convert
    columns = _resolve_columns_of_interest(df, schema, columns)
  File "/home/airflow/.local/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 523, in _resolve_columns_of_interest
    columns = df.columns
AttributeError: 'NoneType' object has no attribute 'columns'

Note that Forbidden error is raised, even though the provided role has access to the file.

To Reproduce

  • Upload a parquet file to s3
  • Create/obtain IAM role with access to the bucket and file
  • Run S3 ingestion with assumeRoleArn configuration

Expected behavior
Parquet file should be read and file structure ingested to OpenMetadata.

Version:

  • OS: macOS Sequoia, Debian Bookworm
  • Python version: 3.11, 3.12
  • OpenMetadata version: 1.6.3
  • OpenMetadata Ingestion package version: openmetadata-ingestion==1.6.3

Additional context
Add any other context about the problem here.

@trina242 trina242 linked a pull request Jan 31, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant