Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracing option for reading strings as (Large)Utf8 #246

Merged
merged 3 commits into from
Oct 13, 2024

Conversation

jkylling
Copy link
Contributor

@jkylling jkylling commented Oct 9, 2024

Should be on top of #245

Is there something missing in coerce_primitive_type? There seems to be an asymmetry between Utf8 and LargeUtf8 there.

Could some instructions on how to run tests be added to the README. To run the full test suite the appropriate --features and --cfg flags must be set, and I was a bit confused about which flags to set to make the tests run.

@chmp
Copy link
Owner

chmp commented Oct 9, 2024

I would prefer to keep both changes, large lists and Utf8, in a single PR as they are so closely related. Probably this one.

Re how to execute tests the x.py script bundles all common commands and is documented in Development.md. It could be advertised a bit more prominently, though. Running python x.py test will set everything up.

Re asymetry: as only LargeUtf8 can be encountered during from_samples, only this case is handled in the coercion.

Copy link
Owner

@chmp chmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Looks great, but you're right that coerce_primitives should be updated.

Please also expand the coercion tests. I will look them up for you.

@chmp
Copy link
Owner

chmp commented Oct 9, 2024

I would prefer to keep both changes, large lists and Utf8, in a single PR as they are so closely related. Probably this one.

Okay. Actually the tests for the coercion could be a bit more tricky. Separate PRs is the way to go :)

@chmp
Copy link
Owner

chmp commented Oct 9, 2024

As for the tests, I would include the coercion tests here:

E.g., by changing the macro to:

Potential test impl
    macro_rules! test {
        ($name:ident, $($data:tt)*) => {
            mod $name {
                use super::super::*;
                
                #[test]
                fn large_utf8() -> PanicOnError<()> {
                    let expected = SerdeArrowSchema::from_value(&json!([
                       {
                          "name": "date",
                          "data_type": "LargeUtf8",
                          "nullable": true,
                      },
                  ]))?;

                  let data = json!($($data)*);
                  let actual = SerdeArrowSchema::from_samples(&data, TracingOptions::default().guess_dates(true))?;
                  assert_eq!(actual, expected);
                  Ok(())
              }
              #[test]
                fn utf8() -> PanicOnError<()> {
                    let expected = SerdeArrowSchema::from_value(&json!([
                       {
                          "name": "date",
                          "data_type": "Utf8",
                          "nullable": true,
                      },
                  ]))?;

                  let data = json!($($data)*);
                  let actual = SerdeArrowSchema::from_samples(&data, TracingOptions::default().guess_dates(true).strings_as_large_utf8(false))?;
                  assert_eq!(actual, expected);
                  Ok(())
              }
            }
        };
    }

@jkylling jkylling force-pushed the add-strings_as_large_utf8-tracing-option branch from 68b6160 to 0ed54e5 Compare October 10, 2024 07:14
@jkylling jkylling requested a review from chmp October 10, 2024 07:14
Copy link
Owner

@chmp chmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkylling Thanks a lot for your changes!

I took the opportunity to slightly touch up the TracingOptions docs and make it more consistent between the different items.

@chmp chmp enabled auto-merge October 13, 2024 09:22
@chmp chmp merged commit 54562a5 into chmp:main Oct 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants