Skip to content

Add FormatOptions to Config #15793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 2, 2025
Merged

Add FormatOptions to Config #15793

merged 19 commits into from
May 2, 2025

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Apr 21, 2025

Which issue does this PR close?

Rationale for this change

Right now, there's no way to change formatting params in CLI or when printing batches in datafusion. Completing this should allow us to complete #15395

Now it'll work like this:

cargo run -p datafusion-cli
> select 'abc', 4;
+-------------+----------+
| Utf8("abc") | Int64(4) |
| Utf8        | Int64    |
+-------------+----------+
| abc         | 4        |
+-------------+----------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

So later we can do

> select 'abc', 4;
+-------------+----------+
|'abc'        | 4        |
| Utf8        | Int64    |
+-------------+----------+
| abc         | 4        |
+-------------+----------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

What changes are included in this PR?

Add formatting options to datafusion params

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, but backward compatible

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation labels Apr 21, 2025

#[derive(Debug, Clone, PartialEq)]
#[allow(clippy::large_enum_variant)]
pub enum FormatOptions {
pub enum OutputFormat {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to avoid confusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin, this PR is really nice. I've made a quick look, and think that perhaps we should rename this config in even all places as "output format" ? Directly saying "datafusion.format" etc. could make people think something related with objects?

Copy link
Contributor Author

@blaginin blaginin Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for checking this out! 🙏 I feel like "format" can be better than "output format" because:

  • Below we've discussed that it would be good to add support for these params in UDF functions - so in a way that will be related to the objects.
  • Output format IMO implies that it is something related to the way the file is written - but in reality, it has nothing to do with the CSV /Parquet saving

@github-actions github-actions bot removed the functions Changes to functions implementation label Apr 22, 2025

pub const DEFAULT_CLI_FORMAT_OPTIONS: FormatOptions<'static> = FormatOptions::new()
.with_duration_format(DurationFormat::Pretty)
.with_null("NULL");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be good to add NULL to DEFAULT_FORMAT_OPTIONS -- but because it requires updating a lot of tests, will do it separately

@blaginin blaginin marked this pull request as ready for review April 24, 2025 21:03
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 24, 2025
@blaginin
Copy link
Contributor Author

FYI @Omega359 @jayzhan211

@Omega359
Copy link
Contributor

Interesting. I'm still processing what the options imply. For example, why are there no defaults for the date/time formats? Or if there are defaults, the comments should point to where there are set and what they are.

Once we are able to access the config from udf's it would be interesting to be able to use these options as defaults for things like to_char

@blaginin
Copy link
Contributor Author

For example, why are there no defaults for the date/time formats? Or if there are defaults, the comments should point to where there are set and what they are.

That is a great point! Format wasn't hardcoded anywhere - in fact, we've relied on a debug implementation of that in chrono - even though it is pretty stable, it feels like a much better idea to set them here.

@blaginin
Copy link
Contributor Author

Once we are able to access the config from udf's it would be interesting to be able to use these options as defaults for things like to_char

💯

@blaginin blaginin self-assigned this Apr 30, 2025
@alamb
Copy link
Contributor

alamb commented May 1, 2025

Are we happy with this PR? Shall we merge it in?

@blaginin
Copy link
Contributor Author

blaginin commented May 1, 2025

Asked Bruce to re-review, and then I'm ready to merge 🤗

@@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to true, the explain statem
datafusion.explain.show_schema false When set to true, the explain statement will print schema information
datafusion.explain.show_sizes true When set to true, the explain statement will print the partition sizes
datafusion.explain.show_statistics false When set to true, the explain statement will print operator statistics for physical plans
datafusion.format.date_format %Y-%m-%d Date format for date arrays
datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime arrays
datafusion.format.duration_format pretty Duration format. Can be either `"pretty"` or `"ISO8601"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think case may matter here for ISO8601 vs iso8601 given the impl in config.rs. I would suggest that config would use .to_lowercase() to make it not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch! fixed in 8374883

@Omega359
Copy link
Contributor

Omega359 commented May 1, 2025

Beyond the config try_into looking like it's case sensitive for duration_format I think this is good. There is a breaking change with the rename of FormatOptions to OutputFormat that should be noted in the upgrade guide for the release this goes into.

@blaginin
Copy link
Contributor Author

blaginin commented May 2, 2025

Thank for the reviews!!!

@blaginin blaginin merged commit f40e0db into apache:main May 2, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option to display column types in the table
5 participants