Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: a new formatting API #1127

Open
djc opened this issue Jun 5, 2023 · 13 comments
Open

RFC: a new formatting API #1127

djc opened this issue Jun 5, 2023 · 13 comments

Comments

@djc
Copy link
Member

djc commented Jun 5, 2023

I don't think #1121 is the right approach. The fallibility here is with (a) parsing the format string, which might be invalid, and (b) whether the type we're formatting from contains enough information to fill out the fields the format string has.

Making something work on 0.4.x is "trivial" in a sense because we can just invent new API that does whatever we want. Postponing the failures of parsing from the parsing stage to the formatting stage doesn't sound like a great option, in particular a reasonable approach is to use a long-lived formatter (parsed from a string) for many values. For getting enough data out of the input type, that seems like something we could assert at compile time using generics.

As such, I think the high-level structure here should be:

  1. A fallible function/method that converts a format string to Vec of items (reusing the existing item type if we can)
  2. A fallible function/method that binds a Vec or slice (for the non-alloc case) to a particular input type, failing if the type does not contain enough data for the input type
  3. The outcome of step 2 would be something like a struct Widget<'a, I> { items: Cow<'a, [Items]>, input: PhantomData<I> } (replacing the Cow with just a slice if alloc is not enabled, I think?) which implements Display and will only fail (as you explained in Make StrftimeItems::new return Result #902 (comment)) if the inner Formatter fails
  4. Probably some high level API that wraps over step 2 and 3 above, particularly for the allocating case

To the extent that we can't implement this (efficiently) within the current API (from what I've seen, this will just get ugly with the approach chosen in #902), we should invent a new entry point. The new entry point should be able to satisfy all the use cases for the existing entry points so that we can, at some point in the near future, start deprecating the existing entry points. In particular, there should be affordances for the non-alloc use case and there should be a clean way (that is, with minimal code duplication) to factor in the localization stuff.

cc @jaggededgejustice @pitdicker

@pitdicker
Copy link
Collaborator

Postponing the failures of parsing from the parsing stage to the formatting stage doesn't sound like a great option, in particular a reasonable approach is to use a long-lived formatter (parsed from a string) for many values.

That is a good usecase that is not yet covered by the solution in #902.

For getting enough data out of the input type, that seems like something we could assert at compile time using generics.

While far less common, I think there is some use to keeping the formatting string dynamic. But we can get pretty much the same with const fns?

  1. A fallible function/method that binds a Vec or slice (for the non-alloc case) to a particular input type, failing if the type does not contain enough data for the input type

If we want to allow for an API that doesn't need allocations (in theory) such as we have now it seems to me we have to stick with iterators. Because where can you store a slice of unknown length?

  1. The outcome of step 2 would be something like a struct Widget<'a, I> { items: Cow<'a, [Items]>, input: PhantomData<I> } (…) which implements Display

The type that implements Display has to contain the value it wants to format (or a reference to it). It would basically be the current DelayedFormat.


I do think the idea a method that binds an iterator of formatting items to an input type is good. And also to at some point make going through the resulting type the only way to create a DelayedFormat.

struct Formatter<I, T> {
    items_iter: I,
    value: PhantomData<T>,
}

@djc
Copy link
Member Author

djc commented Jun 5, 2023

For getting enough data out of the input type, that seems like something we could assert at compile time using generics.

While far less common, I think there is some use to keeping the formatting string dynamic. But we can get pretty much the same with const fns?

I'm not sure what you mean by "keeping the formatting string dynamic"?

  1. A fallible function/method that binds a Vec or slice (for the non-alloc case) to a particular input type, failing if the type does not contain enough data for the input type

If we want to allow for an API that doesn't need allocations (in theory) such as we have now it seems to me we have to stick with iterators. Because where can you store a slice of unknown length?

Yeah, eventually the point is iterating over it, and we could have a version that abstracts over roughly something like I: IntoIterator<Item = &Item>. But I think we'll want to provide that types that store a slice or a Vec or some combination of both (through Cow) for storage, since naming iterators can be unergonomic (and part of the point was this to be able to store the value somewhere for repeated formatting use).

  1. The outcome of step 2 would be something like a struct Widget<'a, I> { items: Cow<'a, [Items]>, input: PhantomData<I> } (…) which implements Display

The type that implements Display has to contain the value it wants to format (or a reference to it). It would basically be the current DelayedFormat.

Right, I guess what I'm saying is that widget.format(dt) will yield (something like) a DelayedFormat.

I do think the idea a method that binds an iterator of formatting items to an input type is good. And also to at some point make going through the resulting type the only way to create a DelayedFormat.

struct Formatter<I, T> {
    items_iter: I,
    value: PhantomData<T>,
}

So yes, one idea is binding the allowed input type into a value. But I think the other core aspect is really crisply separating the parsing and formatting stages at least in the internal API.

@djc
Copy link
Member Author

djc commented Jun 12, 2023

So I guess there is a third failure mode here which is that some values are outside the range of some (interpretation of) some format items -- in particular, RFC 3339 does not allow years outside the range 0000-9999.

@pitdicker
Copy link
Collaborator

pitdicker commented Jun 29, 2023

I have arrived at an API that seems to do everything I want.

Part 1: format string parsing

Add two new fallible methods to StrftimeItems: parse and parse_into_slice.

#[derive(Clone, Debug)]
pub struct StrftimeItems<'a> { /* ... */ }

impl<'a> StrftimeItems<'a> {
    /// Creates a new parsing iterator from the `strftime`-like format string.
    /// The iterator will return `Item::Error` when the format string is invalid.
    pub const fn new(s: &'a str) -> StrftimeItems<'a>;

    /// Creates a new parsing iterator from the `strftime`-like format string.
    #[cfg(feature = "unstable-locales")]
    pub const fn new_with_locale(s: &'a str, locale: Locale) -> StrftimeItems<'a>;

    /// Parse into a `Vec` with formatting items.
    /// Returns `Error` if the format string is invalid.
    pub fn parse(self) -> Result<Vec<Item<'a>>, ParseError>;

    /// Parse formatting items into an existing `[Item]` slice.
    /// Returns `Error` if the format string is invalid or if the slice can't fit all elements.
    pub fn parse_into_slice<'b>(self, buf: &'b mut [Item<'a>]) -> Result<&'b [Item<'a>], ParseError>;
}
impl<'a> Iterator for StrftimeItems<'a> {
    type Item = Item<'a>;
    fn next(&mut self) -> Option<Item<'a>>;
}

Part 2: bind to input type

Add a new FormattingSpec type that can bind an iterator yielding Items to an input type.

#[derive(Clone, Debug)]
pub struct FormattingSpec<I, T> {
    items: I,
    date_time_type: PhantomData<T>,
}

The methods to construct it live on the input types, because from there they can be called without type annotations.

Example for NaiveDate and DateTime (the generic timezone parameter is ignored, we supply a default Utc so it can be called without type annotations):

impl NaiveDate {
    /// Create a new `FormattingSpec` that can be use for multiple `NaiveDate`s.
    pub fn formatter<'a, I, J, B>(items: I) -> Result<FormattingSpec<J, Self>, ParseError>
    where
        I: IntoIterator<Item = B, IntoIter = J> + Clone,
        J: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>;
}
impl DateTime<Utc> {
    /// Create a new `FormattingSpec` that can be use for multiple `DateTime`s.
    pub fn formatter<'a, I, J, B>(items: I) -> Result<FormattingSpec<'a, J, Self>, ParseError>
    where
        I: IntoIterator<Item = B, IntoIter = J>,
        J: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>;
}

Part 3: new formatter

I took this opportunity to add a new Formatter<I, Off>, because DelayedFormat<I> has performance overhead we can't reduce.

I see no harm in making the methods public like they are on DelayedFormat.

#[derive(Debug)]
pub struct Formatter<I, Off> { /* ... */ }

impl<'a, I, B, Off> Formatter<I, Off>
where
    I: Iterator<Item = B> + Clone,
    B: Borrow<Item<'a>>,
    Off: Offset + fmt::Display,
{
    /// Makes a new `Formatter` value out of local date and time and UTC offset.
    pub fn new(
        date: Option<NaiveDate>,
        time: Option<NaiveTime>,
        offset: Option<Off>,
        items: I,
    ) -> Formatter<I, Off>;

    /// Makes a new `DelayedFormat` value out of local date and time, UTC offset and locale.
    #[cfg(feature = "unstable-locales")]
    pub fn new_with_locale(
        date: Option<NaiveDate>,
        time: Option<NaiveTime>,
        offset: Option<Off>,
        items: I,
        locale: Locale,
    ) -> Formatter<I, Off>;
}

impl<'a, I, B, Off> fmt::Display for Formatter<I, Off>
where
    I: Iterator<Item = B> + Clone,
    B: Borrow<Item<'a>>,
    Off: Offset + fmt::Display;

Part 4: new format methods

Add new format_with methods to the input types that only accept a FormattingSpec as input.

Example for DateTime (the most complex because of its generic Tz parameter):

impl<Tz: TimeZone> DateTime<Tz>
where
    Tz::Offset: fmt::Display,
{
    /// Format using a `FormattingSpec` created with `DateTime::formatter`.
    pub fn format_with<'a, I, B, Tz2>(
        &self,
        formatter: &FormattingSpec<I, DateTime<Tz2>>,
    ) -> Formatter<I, Tz::Offset>
    where
        I: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>,
        Tz2: TimeZone;

    /// Format using a `FormattingSpec` created with `DateTime::formatter` and a `locale`.
    #[cfg(feature = "unstable-locales")]
    pub fn format_locale_with<'a, I, B, Tz2>(
        &self,
        formatter: &FormattingSpec<'a, I, DateTime<Tz2>>,
        locale: Locale,
    ) -> Formatter<I, Tz::Offset>
    where
        I: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>,
        Tz2: TimeZone;
}

Part 5: convenience formatting method

Add new fallible try_format methods to the input types:

impl<Tz: TimeZone> DateTime<Tz>
where
    Tz::Offset: fmt::Display,
{
    pub fn try_format<'a>(&self, fmt: &'a str)
        -> Result<Formatter<StrftimeItems<'a>, Tz::Offset>, ParseError>
}

Owned vs borrowed arguments

The formatter methods accept any type that implements IntoIterator.
It is converted directly into an iterator, and after validation wrapped in a FormattingSpec.
It will create a new Formatter for every value that is to be formatted, and clone the iterator because it has to start from the beginning.

If you pass an owned Vec as input, FormattingSpec will contain an owned iterator that consumes the Vec.
Every Formatter created from it is also owned. But this means the Vec will be cloned, which is not great for performance.
It is of course still faster than reparsing the format string.

Examples

Example of using format_with:

fn test_format_with() -> Result<(), ParseError> {
    let fmt_str = "%a %Y-%m-%d";
    let dt1 = NaiveDate::from_ymd_opt(2023, 4, 18).unwrap();
    let dt2 = NaiveDate::from_ymd_opt(2023, 9, 2).unwrap();

    // Parses format string once, allocates
    let fmt_items = StrftimeItems::new(&fmt_str).parse()?;
    let formatter = NaiveDate::formatter(&fmt_items)?;
    assert_eq!(dt1.format_with(&formatter).to_string(), "Tue 2023-04-18");
    assert_eq!(dt2.format_with(&formatter).to_string(), "Sat 2023-09-02");

    // Reparses format string on creation and every use
    let fmt_items = StrftimeItems::new(&fmt_str);
    let formatter = NaiveDate::formatter(fmt_items)?;
    assert_eq!(dt1.format_with(&formatter).to_string(), "Tue 2023-04-18");
    assert_eq!(dt2.format_with(&formatter).to_string(), "Sat 2023-09-02");

    let mut buf = [
        Item::Error,
        Item::Error,
        Item::Error,
        Item::Error,
        Item::Error,
        Item::Error,
        Item::Error,
    ];
    // parses format string once, into existing slice
    let fmt_items = StrftimeItems::new(fmt_str).parse_into_slice(&mut buf)?;
    let formatter = NaiveDate::formatter(fmt_items)?;
    assert_eq!(dt1.format_with(&formatter).to_string(), "Tue 2023-04-18");
    assert_eq!(dt2.format_with(&formatter).to_string(), "Sat 2023-09-02");

    Ok(())
}

@djc What do you prefer? If this seems okay, shall I make one big PR or split it up into pieces (that may not make much sense by themselves)?

@pitdicker
Copy link
Collaborator

Some things I have looked at related to creating a new formatting API:

Are parsing errors in a format string and unavailable fields on the input type the only reasons for errors?

#1144 Dates with a year outside of 0..=9999 are difficult, as they are not supported by some specifications. In my opinion making sure a value is valid should be the concern of a higher-level API. Especially if you consider that both problemetic formatting items already have dual puposes (see #1144 for details).

Can we parse format strings at compile time?

→ This is possible but pretty hard with the current limitations on const methods.

We would need to add workarounds for the following limitations:

  • No iterators: manually iterate with a while-loop and some const_next method to fill a static array.
  • No mut references: the iterator workaround would have to take and return owned values.
  • Slices are difficult: (see question on Reddit), the more complex ones are maybe possible with very creative matches.
  • No clone: add a method like Item::shallow_clone().
  • No unicode methods from std: write an is_whitespace method that works on UTF-8 bytes.
  • No allocations.

Can we make StrftimeItems::new const and work without allocating?

→ Yes. #1152.

Can formatting be available without alloc?

→ Yes. #1126.

What is the reason formatting has about 40% overhead? (#94, #1155)

If DelayedFormat::new is given an offset, it first formats a timezone name into a String. In case of DateTime<FixedOffset> and DateTime<Local> this involves formatting the offset, as there is no name available. This is responsible for 20~25% of the ca. 40% overhead in the format_with_items benchmark.

The rest are various small inefficiencies that can be fixed without changing the API.

@djc
Copy link
Member Author

djc commented Jun 30, 2023

#1144 Dates with a year outside of 0..=9999 are difficult, as they are not supported by some specifications. In my opinion making sure a value is valid should be the concern of a higher-level API. Especially if you consider that both problemetic formatting items already have dual puposes (see #1144 for details).

I wonder if we should require a different Item (and % code) for non-0..=9999 years in the new API -- I feel like that could alleviate a bunch of these issues.

@pitdicker
Copy link
Collaborator

pitdicker commented Jul 3, 2023

Found one more interesting issue while reviewing #1058: not all locales have a 12-hour clock.

In that case the local format string for %r is "" instead of something like "%I:%M:%S %p". date formats this ambiguously, with a 12-hour clock without something like AM or PM.
I think it is best to fall back to formatting the time with the 24-hour format %X in such locales. That would make %r the way to format time as a localized 'informal' string.

The formatting specifiers %P and %p are more difficult, as AM and PM don't have an equivalent in many locales. We have nothing to fall back to. I think it is best to return an error in this case, instead of formatting an ambiguous time.


Even without deciding on these issues, it may be better to change the API a little so we can catch issues caused by a locale before formatting. I.e. supply a Locale when constructing a FormattingSpec, not when calling format_with:

impl DateTime<Utc> {
    /// Create a new `FormattingSpec` that can be used for multiple `DateTime`s.
    pub fn formatter<'a, I, J, B>(items: I) -> Result<FormattingSpec<'a, J, Self>, ParseError>
    where
        I: IntoIterator<Item = B, IntoIter = J>,
        J: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>;

    /// Create a new `FormattingSpec` that can be used for multiple `DateTime`s for a `locale`.
    #[cfg(feature = "unstable-locales")]
    pub fn formatter_with_locale<'a, I, J, B>(
        items: I,
        locale: Locale,
    ) -> Result<FormattingSpec<'a, J, Self>, ParseError>
    where
        I: IntoIterator<Item = B, IntoIter = J>,
        J: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>;
}
impl<Tz: TimeZone> DateTime<Tz>
where
    Tz::Offset: fmt::Display,
{
    /// Format using a `FormattingSpec` created with `DateTime::formatter`.
    pub fn format_with<'a, I, B, Tz2>(
        &self,
        formatter: &FormattingSpec<I, DateTime<Tz2>>,
    ) -> Formatter<I, Tz::Offset>
    where
        I: Iterator<Item = B> + Clone,
        B: Borrow<Item<'a>>,
        Tz2: TimeZone;
}

@kamadorueda
Copy link
Contributor

Can we parse format strings at compile time?

Yes, with a procedural macro

The procedural macro would take a string slice ("%Y-%m-%d"), and output source code for an array of Item (&'static [Year, Literal("-"), Month, Literal("-"), Day]).

With this approach, it would also be possible to support runtime formatting by making the format function have a signature of pub fn format<'a>(&self, fmt: impl Fmt) where Fmt can be built out of a &str (at runtime, fallible) or &[Item] (compile time, infallible). Or maybe just having two separate format functions, one const and fmt: &'static [Item], and one non-const and fmt: &'a str

@kamadorueda
Copy link
Contributor

I believe this RFC has the potential to make the formatting API much better.

After reading through the comments and a couple of issues and pull requests, it would be nice to step back for a second and look at the big picture. Once we agree on the problem and the design goals, finding an appropriate implementation should be easier, so I wrote an executive summary of the discussion and what I understood is the desired design. Please let me know if I missed something @pitdicker and @djc, happy to have your input

RFC: A new formatting API

Why?

Currently some methods for formatting timestamps panic or fail in undesirable ways (#956, #575, #419). While the current formatting API is inspired in the classic strftime/strptime found in languages like C and Python and works generally well, handling failures in an idiomatic way or preventing them altogether by using the type system would be preferred.

Failure modes

The existing formatting API has the following failure modes:

  1. Parsing invalid formatting directives
    (e.g.: In: %Y-%,-%d, %, is not supported)
  2. Some directives require a superset of what the timestamp provides
    (e.g. %Y for a NaiveTime).
  3. Some directives can only represent a subset of what the timestamp allows
    (e.g. RFC3339 year is 0..=9999, but NaiveDate's year is -262145..=262143)
  4. Some directives, depending on the locale, may not be supported
    (e.g. not all locales have a 12-hour clock, or AM/PM).

Prior work

Some solutions have been proposed so far, like returning a Result on formatting (#902) or silently skipping improper formatting directives (#614). None of them tries to split the problem into its fundamentals parts, which is what's proposed below

Summary of the discussion so far

Approximate design:

  1. Split the formatting process into stages
    (a high-level API may do all stages at once for convenience,
    while a low-level API could expose the building blocks ):

    1. Parsing of the formatting directive
      (e.g. Turning "%Y-%m" into an intermediate representation - what we call today an iterator/list of Item)
    2. Binding the parsed formatted directive to the timestamp types
      that could use the formatted directive
      (e.g. "%Y-%m", represented as [Year, Literal, Month] in the previous step, would be bound to NaiveDate, NaiveDateTime, DateTime, but not to NaiveTime).
    3. Parsing/formatting of the timestamp with such formatting directive
      (e.g. NaiveDate.format(..), NaiveDate.parse_from_str(..)).
      However, because of the previous step,
      formatting cannot fail, and parsing could only fail because of an invalid input string, so the failure modes are reduced to the minimum.

    Note: An alternative would be swapping the first two steps, so instead of binding a list of Item to a timestamp type, we could do NaiveTime.parse_fmt_directive("%Y-%m") -> Option<Vec<Item>>, which would fail at runtime given that a NaiveTime has no year. This could be done at compile time as well using a procedural macro (const fmt: &'static [Item] = format_naive_time!("%Y-%m"), failures are compiler errors, then NaiveDate.format(&fmt))

    Rationale:

    • This way, failures do not leak from one stage to the other.
    • Performant
      (reusing formatting directive across multiple format invocations reduces overhead).
    • Some failure modes can be detected at compile time
      which is a desirable property for correctness,
      while also allowing to do this at runtime if the programmer prefers so.
  2. Ease of migration: The new API should satisfy the use cases of the current API,
    so that we can start deprecating at some point.
    This implies maintaining #[no_std] support, and extensibility for the new localization features. This shouldn't be too hard once we agree on the approximate design

@pitdicker
Copy link
Collaborator

pitdicker commented Jul 11, 2023

@kamadorueda You write clearer then I do, good summary.

3. Some directives can only represent a subset of what the timestamp allows
(e.g. RFC3339 year is 0..=9999, but NaiveDate's year is -262145..=262143)

To add to that: Currently the only Item that has this problem is Rfc2822.
The Rfc3339 item is defined to also be an ISO 8602-format, and currently switches to that when dates are outside this range (adding a sign to the year).
The Year item currently follows the same format, for better or for worse, and adds a sign when outside this range.

In #1144 I argued that we can make the items work with any year. I.e. make Rfc2822 more permissive, and let the valid range be the concern of a higher-level API. Honestly I would be surprised if anyone uses the Rfc2822 item, it is basically an implementation detail of thew DateTime::to_rfc2822 method. It is deeply buried in the documentation and not available with the strftime syntax.

(e.g. "%Y-%m", represented as [Year, Literal, Month] in the previous step, would be bound to NaiveDate, NaiveDateTime, DateTime, but not to NaiveTime).

I don't see how we can bind to multiple types in the type system. I currently just bind to the one that is to be formatted. (Only the TimeZone type of DateTime does not have to match.)

This implies maintaining #[no_std] support, and extensibility for the new localization features. This shouldn't be too hard once we agree on the approximate design

#[no_std] support would be a new feature. The currently API reads like it should support no_std, but made a couple of small choices that require an allocation. #1163 gets rid of those, but requires a new type instead of DelayedFormat.

Note: An alternative would be swapping the first two steps, so instead of binding a list of Item to a timestamp type, we could do NaiveTime.parse_fmt_directive("%Y-%m") -> Option<Vec<Item>>, which would fail at runtime given that a NaiveTime has no year.

Just yesterday I was trying to write such a method 😆. I am just hitting one problem: With my current implementation, if you bind an owned value to an input type (FormattingSpec<Vec<_>, NaiveTime>), it will also make the final formatter have an owned value. This means cloning the Vec on every use.

I'll open a PR in a couple of days so you and @djc have something concrete to shoot at.

@djc
Copy link
Member Author

djc commented Jul 24, 2023

@kamadorueda great summary, I think you covered all the bases. Your proposal to have the types bind the format sequence makes sense to me but is also slightly less expressive, I think? For example, ideally the outcome of parsing %Y-%m could handle any of NaiveDateTime, DateTime<Utc> and NaiveDate, whereas (if I understand your proposal correctly) binding to NaiveDate eagerly would limit the value to accepting that single type.

(This might not be a big downside, but it is a trade-off to be aware of.)

@djc
Copy link
Member Author

djc commented Jul 24, 2023

I don't see how we can bind to multiple types in the type system. I currently just bind to the one that is to be formatted. (Only the TimeZone type of DateTime does not have to match.)

With an intermediate trait... Something like ProvidesDate or maybe even a Provides<Year>, Provides<Month>, Provides<Day> etc if we wanted to be very precise?

@pitdicker
Copy link
Collaborator

The solution I have gone with in #1196 was to add From impls: caf2b55, as that adds the smallest API surface. But I am not married to the details 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants