-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add FormatOptions
to Config
#15793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FormatOptions
to Config
#15793
Conversation
|
||
#[derive(Debug, Clone, PartialEq)] | ||
#[allow(clippy::large_enum_variant)] | ||
pub enum FormatOptions { | ||
pub enum OutputFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin, this PR is really nice. I've made a quick look, and think that perhaps we should rename this config in even all places as "output format" ? Directly saying "datafusion.format" etc. could make people think something related with objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for checking this out! 🙏 I feel like "format" can be better than "output format" because:
- Below we've discussed that it would be good to add support for these params in UDF functions - so in a way that will be related to the objects.
- Output format IMO implies that it is something related to the way the file is written - but in reality, it has nothing to do with the CSV /Parquet saving
|
||
pub const DEFAULT_CLI_FORMAT_OPTIONS: FormatOptions<'static> = FormatOptions::new() | ||
.with_duration_format(DurationFormat::Pretty) | ||
.with_null("NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it would be good to add NULL
to DEFAULT_FORMAT_OPTIONS
-- but because it requires updating a lot of tests, will do it separately
FYI @Omega359 @jayzhan211 |
Interesting. I'm still processing what the options imply. For example, why are there no defaults for the date/time formats? Or if there are defaults, the comments should point to where there are set and what they are. Once we are able to access the config from udf's it would be interesting to be able to use these options as defaults for things like to_char |
That is a great point! Format wasn't hardcoded anywhere - in fact, we've relied on a debug implementation of that in chrono - even though it is pretty stable, it feels like a much better idea to set them here. |
💯 |
Are we happy with this PR? Shall we merge it in? |
Asked Bruce to re-review, and then I'm ready to merge 🤗 |
@@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to true, the explain statem | |||
datafusion.explain.show_schema false When set to true, the explain statement will print schema information | |||
datafusion.explain.show_sizes true When set to true, the explain statement will print the partition sizes | |||
datafusion.explain.show_statistics false When set to true, the explain statement will print operator statistics for physical plans | |||
datafusion.format.date_format %Y-%m-%d Date format for date arrays | |||
datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime arrays | |||
datafusion.format.duration_format pretty Duration format. Can be either `"pretty"` or `"ISO8601"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think case may matter here for ISO8601 vs iso8601 given the impl in config.rs. I would suggest that config would use .to_lowercase() to make it not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch! fixed in 8374883
Beyond the config try_into looking like it's case sensitive for duration_format I think this is good. There is a breaking change with the rename of FormatOptions to OutputFormat that should be noted in the upgrade guide for the release this goes into. |
Thank for the reviews!!! |
Which issue does this PR close?
Rationale for this change
Right now, there's no way to change formatting params in CLI or when printing batches in datafusion. Completing this should allow us to complete #15395
Now it'll work like this:
So later we can do
What changes are included in this PR?
Add formatting options to datafusion params
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, but backward compatible