Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace infer_schema_length by infer_schema #972

Open
josevalim opened this issue Aug 27, 2024 · 4 comments
Open

Replace infer_schema_length by infer_schema #972

josevalim opened this issue Aug 27, 2024 · 4 comments

Comments

@josevalim
Copy link
Member

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

@cigrainger
Copy link
Member

I like this, but what would we use for all rows? IIUC true -> default (1000 rows).

@josevalim
Copy link
Member Author

true means all rows.

@lei0zhou
Copy link

lei0zhou commented Aug 27, 2024

thanks for improving this! just share a way duckdb did.
it has two parameters,

  • auto_detect: true | false
  • sample_size: BIGINT (-1, mean all rows, default 20480)

ref:
CSV Import – DuckDB
CSV Auto Detection – DuckDB

I am more than happy to take a stab at this

@ceyhunkerti
Copy link
Contributor

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

👉🏼 given Option<NonZeroUsize>) to infer schema, what I understand is ;

  • if it's None will use entire file
  • else will use len(given) rows
  • will fail at comptime if you give 0
    /// Set the JSON reader to infer the schema of the file. Currently, this is only used when reading from
    /// [`JsonFormat::JsonLines`], as [`JsonFormat::Json`] reads in the entire array anyway.
    ///
    /// When using [`JsonFormat::JsonLines`], `max_records = None` will read the entire buffer in order to infer the
    /// schema, `Some(1)` would look only at the first record, `Some(2)` the first two records, etc.
    ///
    /// It is an error to pass `max_records = Some(0)`, as a schema cannot be inferred from 0 records when deserializing
    /// from JSON (unlike CSVs, there is no header row to inspect for column names).
    pub fn infer_schema_len(mut self, max_records: Option<NonZeroUsize>) -> Self {
        self.infer_schema_len = max_records;
        self
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants