Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns #2937

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Commits on Aug 28, 2024

  1. DRILL-8507 Missing parquet columns quoted with backticks conflict wit…

    …h existing ones
    
    1. In ParquetSchema#createMissingColumn replaced col.toExpr() to col.getAsUnescapedPath() so that missing column name wouldn't be quoted with backticks
    2. Fixed a typo in UnionAllRecordBatch ("counthas" -> "counts")
    3. In TestParquetFilterPushDown workarounded NumberFormatException with CONVERT_TO
    4. Removed testCoalesceWithUntypedNullValues* test methods from TestCaseNullableTypes
    5. Moved testCoalesceOnNotExistentColumns* test methods from TestUntypedNull to a separate TestParquetMissingColumns and made them expect Nullable Int instead of Untyped Null
    6. Created new TestParquetPartiallyMissingColumns test class with test cases for "backticks problem"
    ychernysh committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    c385b6e View commit details
    Browse the repository at this point in the history
  2. DRILL-8508 Choosing the best suitable major type for a partially miss…

    …ing parquet column (minor type solution)
    
    1. Passed an overall table schema from AbstractParquetRowGroupScan to ParquetSchema
    2. In ParquetSchema#createMissingColumn used the minor type from that schema instead of hardcoding the INT
    ychernysh committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    5a775e3 View commit details
    Browse the repository at this point in the history
  3. DRILL-8508 Choosing the best suitable major type for a partially miss…

    …ing parquet column (data mode solution)
    
    1. Added TypeCastRules#getLeastRestrictiveMajorType method for convenience
    2. In Metadata, added resolving data mode (so it always prefer less restrictive one) when collecting file schemas and merging them into a single table schema. Synchronized merging to accomplish that
    3. In ParquetTableMetadataUtils made the column either found OPTIONAL or missing in any of the files be OPTIONAL in the overall table schema
    4. For such cases, added enforcing OPTIONAL data mode in ParquetSchema, ParquetColumnMetadata and ColumnReaderFactory. Now even if the file has the column as REQUIRED, but we need it as OPTIONAL, the nullable column reader and nullable value vector would be created
    5. Added "() -> 1" initialization for definitionLevels in PageReader so that nullable column reader would be able to read REQUIRED columns
    6. Added  testEnforcingOptional* test cases in TestParquetPartiallyMissingColumns
    ychernysh committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    c3f7a72 View commit details
    Browse the repository at this point in the history