Replace infer_schema_length by infer_schema #972

josevalim · 2024-08-27T13:29:04Z

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

The text was updated successfully, but these errors were encountered:

cigrainger · 2024-08-27T14:46:21Z

I like this, but what would we use for all rows? IIUC true -> default (1000 rows).

josevalim · 2024-08-27T14:57:56Z

true means all rows.

lei0zhou · 2024-08-27T18:01:59Z

thanks for improving this! just share a way duckdb did.
it has two parameters,

auto_detect: true | false
sample_size: BIGINT (-1, mean all rows, default 20480)

ref:
CSV Import – DuckDB
CSV Auto Detection – DuckDB

I am more than happy to take a stab at this

ceyhunkerti · 2024-09-25T20:37:12Z

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:
infer_schema: true | false | non_neg_integer()
Where true enables, false disables, and the integer configures the length. The default can be the same as today.

is it only for csv or should we also change it on load_ndjson ?
Also one strange thing I didn't get is; polars side doesn't seem to have an option to disable schema inference for ndjson

👉🏼 given Option<NonZeroUsize>) to infer schema, what I understand is ;

if it's None will use entire file
else will use len(given) rows
will fail at comptime if you give 0

    /// Set the JSON reader to infer the schema of the file. Currently, this is only used when reading from
    /// [`JsonFormat::JsonLines`], as [`JsonFormat::Json`] reads in the entire array anyway.
    ///
    /// When using [`JsonFormat::JsonLines`], `max_records = None` will read the entire buffer in order to infer the
    /// schema, `Some(1)` would look only at the first record, `Some(2)` the first two records, etc.
    ///
    /// It is an error to pass `max_records = Some(0)`, as a schema cannot be inferred from 0 records when deserializing
    /// from JSON (unlike CSVs, there is no header row to inspect for column names).
    pub fn infer_schema_len(mut self, max_records: Option<NonZeroUsize>) -> Self {
        self.infer_schema_len = max_records;
        self
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace infer_schema_length by infer_schema #972

Replace infer_schema_length by infer_schema #972

josevalim commented Aug 27, 2024

cigrainger commented Aug 27, 2024

josevalim commented Aug 27, 2024

lei0zhou commented Aug 27, 2024 •

edited

Loading

ceyhunkerti commented Sep 25, 2024

Replace infer_schema_length by infer_schema #972

Replace infer_schema_length by infer_schema #972

Comments

josevalim commented Aug 27, 2024

cigrainger commented Aug 27, 2024

josevalim commented Aug 27, 2024

lei0zhou commented Aug 27, 2024 • edited Loading

ceyhunkerti commented Sep 25, 2024

lei0zhou commented Aug 27, 2024 •

edited

Loading