Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema check fails on over 100 columns #2129

Open
TomBurdge opened this issue Jul 8, 2024 · 4 comments
Open

Schema check fails on over 100 columns #2129

TomBurdge opened this issue Jul 8, 2024 · 4 comments

Comments

@TomBurdge
Copy link

For tables with lots of columns, apparently over 100, Soda does not check over 100 columns.
This is when using the latest version of soda-spark-core.

Having so many columns is, of course an uncommon and unlikely best practice, but does occur. In this case, we receive the data in this form from upstream, and wish to check only the 3 or 4 columns that we use. Selecting only the relevant columns would anticipate the check itself.

To recreate:

# requirements
polars
pyspark
faux_lars
soda-spark-core
# replicating soda error
import polars as pl
# faux_lars is a polars fake data generation library I wrote, something like numpy/faker would work just as well probably
from faux_lars import generate_lazyframe
from soda.scan import Scan

rows=500
cols = {"col_" + str(i) :  "str" for i in range(200)}
df = spark.createDataFrame(generate_lazyframe(cols, rows, "en").collect().to_pandas(use_pyarrow_extension_array=False))
yaml_str=f"""
checks for example_table:
    - row_count > 0
    - schema:
        name: Confirm that required columns are present
        fail:
            when required column missing: {str(list(cols.keys()))}
"""
scan = Scan()
check_name = "example_scan"
scan.set_data_source_name(check_name)
scan.add_spark_session(spark, check_name)
scan.add_sodacl_yaml_str(yaml_str)
df.createOrReplaceTempView("example_table")
scan.execute()
result = scan.build_scan_results()
print(result["hasFailures"])
result["logs"]

Soda records the schema measured as:

schema_measured = [col_0 string, col_1 string, col_2 string, col_3 string, col_4 string, col_5 string, col_6 string, col_7 string, col_8 string, col_9 string, col_10 string, col_11 string, col_12 string, col_13 string, col_14 string, col_15 string, col_16 string, col_17 string, col_18 string, col_19 string, col_20 string, col_21 string, col_22 string, col_23 string, col_24 string, col_25 string, col_26 string, col_27 string, col_28 string, col_29 string, col_30 string, col_31 string, col_32 string, col_33 string, col_34 string, col_35 string, col_36 string, col_37 string, col_38 string, col_39 string, col_40 string, col_41 string, col_42 string, col_43 string, col_44 string, col_45 string, col_46 string, col_47 string, col_48 string, col_49 string, col_50 string, col_51 string, col_52 string, col_53 string, col_54 string, col_55 string, col_56 string, col_57 string, col_58 string, col_59 string, col_60 string, col_61 string, col_62 string, col_63 string, col_64 string, col_65 string, col_66 string, col_67 string, col_68 string, col_69 string, col_70 string, col_71 string, col_72 string, col_73 string, col_74 string, col_75 string, col_76 string, col_77 string, col_78 string, col_79 string, col_80 string, col_81 string, col_82 string, col_83 string, col_84 string, col_85 string, col_86 string, col_87 string, col_88 string, col_89 string, col_90 string, col_91 string, col_92 string, col_93 string, col_94 string, col_95 string, col_96 string, col_97 string, col_98 string, col_99 string]'

And the DQ failure is:

fail_missing_column_names = [col_135, col_170, col_127, col_186, col_122, col_192, col_142, col_154, col_164, col_184, col_157, col_163, col_180, col_173, col_144, col_195, col_110, col_197, col_105, col_156, col_111, col_169, col_150, col_102, col_114, col_171, col_117, col_129, col_136, col_162, col_134, col_158, col_107, col_199, col_108, col_101, col_143, col_179, col_119, col_140, col_113, col_185, col_174, col_130, col_167, col_175, col_149, col_155, col_148, col_196, col_132, col_137, col_172, col_198, col_178, col_159, col_151, col_187, col_194, col_133, col_165, col_106, col_109, col_191, col_183, col_193, col_104, col_118, col_147, col_146, col_181, col_115, col_152, col_176, col_131, col_121, col_161, col_126, col_166, col_124, col_128, col_189, col_160, col_141, col_125, col_138, col_112, col_123, col_145, col_100, col_120, col_153, col_168, col_116, col_182, col_139, col_188, col_103, col_190, col_177]
@tools-soda
Copy link

CLOUD-8038

@stampthecoder
Copy link

@tools-soda does this mean its going to be fixed in the cloud and not the core version?

@nellekes
Copy link

nellekes commented Aug 6, 2024

Hi, in our usecase we also run into this issue. Is there any update if this will be updated?

@carmigno
Copy link

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants