Schema check fails on over 100 columns #2129

TomBurdge · 2024-07-08T08:14:49Z

For tables with lots of columns, apparently over 100, Soda does not check over 100 columns.
This is when using the latest version of soda-spark-core.

Having so many columns is, of course an uncommon and unlikely best practice, but does occur. In this case, we receive the data in this form from upstream, and wish to check only the 3 or 4 columns that we use. Selecting only the relevant columns would anticipate the check itself.

To recreate:

# requirements
polars
pyspark
faux_lars
soda-spark-core

# replicating soda error
import polars as pl
# faux_lars is a polars fake data generation library I wrote, something like numpy/faker would work just as well probably
from faux_lars import generate_lazyframe
from soda.scan import Scan

rows=500
cols = {"col_" + str(i) :  "str" for i in range(200)}
df = spark.createDataFrame(generate_lazyframe(cols, rows, "en").collect().to_pandas(use_pyarrow_extension_array=False))
yaml_str=f"""
checks for example_table:
    - row_count > 0
    - schema:
        name: Confirm that required columns are present
        fail:
            when required column missing: {str(list(cols.keys()))}
"""
scan = Scan()
check_name = "example_scan"
scan.set_data_source_name(check_name)
scan.add_spark_session(spark, check_name)
scan.add_sodacl_yaml_str(yaml_str)
df.createOrReplaceTempView("example_table")
scan.execute()
result = scan.build_scan_results()
print(result["hasFailures"])
result["logs"]

Soda records the schema measured as:

schema_measured = [col_0 string, col_1 string, col_2 string, col_3 string, col_4 string, col_5 string, col_6 string, col_7 string, col_8 string, col_9 string, col_10 string, col_11 string, col_12 string, col_13 string, col_14 string, col_15 string, col_16 string, col_17 string, col_18 string, col_19 string, col_20 string, col_21 string, col_22 string, col_23 string, col_24 string, col_25 string, col_26 string, col_27 string, col_28 string, col_29 string, col_30 string, col_31 string, col_32 string, col_33 string, col_34 string, col_35 string, col_36 string, col_37 string, col_38 string, col_39 string, col_40 string, col_41 string, col_42 string, col_43 string, col_44 string, col_45 string, col_46 string, col_47 string, col_48 string, col_49 string, col_50 string, col_51 string, col_52 string, col_53 string, col_54 string, col_55 string, col_56 string, col_57 string, col_58 string, col_59 string, col_60 string, col_61 string, col_62 string, col_63 string, col_64 string, col_65 string, col_66 string, col_67 string, col_68 string, col_69 string, col_70 string, col_71 string, col_72 string, col_73 string, col_74 string, col_75 string, col_76 string, col_77 string, col_78 string, col_79 string, col_80 string, col_81 string, col_82 string, col_83 string, col_84 string, col_85 string, col_86 string, col_87 string, col_88 string, col_89 string, col_90 string, col_91 string, col_92 string, col_93 string, col_94 string, col_95 string, col_96 string, col_97 string, col_98 string, col_99 string]'

And the DQ failure is:

fail_missing_column_names = [col_135, col_170, col_127, col_186, col_122, col_192, col_142, col_154, col_164, col_184, col_157, col_163, col_180, col_173, col_144, col_195, col_110, col_197, col_105, col_156, col_111, col_169, col_150, col_102, col_114, col_171, col_117, col_129, col_136, col_162, col_134, col_158, col_107, col_199, col_108, col_101, col_143, col_179, col_119, col_140, col_113, col_185, col_174, col_130, col_167, col_175, col_149, col_155, col_148, col_196, col_132, col_137, col_172, col_198, col_178, col_159, col_151, col_187, col_194, col_133, col_165, col_106, col_109, col_191, col_183, col_193, col_104, col_118, col_147, col_146, col_181, col_115, col_152, col_176, col_131, col_121, col_161, col_126, col_166, col_124, col_128, col_189, col_160, col_141, col_125, col_138, col_112, col_123, col_145, col_100, col_120, col_153, col_168, col_116, col_182, col_139, col_188, col_103, col_190, col_177]

The text was updated successfully, but these errors were encountered:

tools-soda · 2024-07-08T08:15:27Z

CLOUD-8038

stampthecoder · 2024-07-15T21:41:05Z

@tools-soda does this mean its going to be fixed in the cloud and not the core version?

nellekes · 2024-08-06T15:19:51Z

Hi, in our usecase we also run into this issue. Is there any update if this will be updated?

carmigno · 2024-10-15T13:27:22Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema check fails on over 100 columns #2129

Schema check fails on over 100 columns #2129

TomBurdge commented Jul 8, 2024

tools-soda commented Jul 8, 2024

stampthecoder commented Jul 15, 2024

nellekes commented Aug 6, 2024

carmigno commented Oct 15, 2024

Schema check fails on over 100 columns #2129

Schema check fails on over 100 columns #2129

Comments

TomBurdge commented Jul 8, 2024

tools-soda commented Jul 8, 2024

stampthecoder commented Jul 15, 2024

nellekes commented Aug 6, 2024

carmigno commented Oct 15, 2024