Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large memory size for DateTime carrying Tz from chrono-tz #27

Open
breezewish opened this issue Jan 7, 2019 · 3 comments · May be fixed by #165
Open

Very large memory size for DateTime carrying Tz from chrono-tz #27

breezewish opened this issue Jan 7, 2019 · 3 comments · May be fixed by #165

Comments

@breezewish
Copy link

use chrono::prelude::*;
use chrono_tz::US::Pacific;
let pacific_time = Pacific.ymd(1990, 5, 6).and_hms(12, 30, 45);
println!("size_of date time with tz = {}", ::std::mem::size_of_val(&pacific_time));

let dt = Utc.ymd(2014, 7, 8).and_hms(9, 10, 11);
println!("size_of date time = {}", ::std::mem::size_of_val(&dt));

let fixed_dt = FixedOffset::east(9 * 3600).ymd(2014, 7, 8).and_hms_milli(18, 10, 11, 12);
println!("size_of date time with fixed offset = {}", ::std::mem::size_of_val(&fixed_dt));

gives output:

size_of date time with tz = 48
size_of date time = 12
size_of date time with fixed offset = 16

As you can see, DateTime carrying chrono-tz's TimeZone is a very expansive structure, occupying 48 bytes. This is bad for cache utilization and is expansive when copying the structure around.

@neoeinstein
Copy link

I have an idea for fixing this up and potentially improving the speed of the timezone calculations. I'll see if I can put together a PR of some sort this week.

@pitdicker
Copy link
Contributor

pitdicker commented Apr 6, 2024

Last year I did some calculations on the minimum size needed for a TzOffset.

It currently has the definition:

pub struct TzOffset {
    tz: Tz, // ~600 variants so at least 16-bit
    offset: FixedTimespan,
}
pub struct FixedTimespan {
    pub utc_offset: i32,
    pub dst_offset: i32,
    pub name: &'static str, // pointer + usize
}

Combined this type needs 26 bytes with 8-byte alignment on 64-bit platforms. When added to the 12-byte DateTime it becomes 12 + 4 (padding) + 26 + 6 (padding) = 48 bytes.

How much bits do we actually need?

  • Offset from UTC.
    The range for offsets in the time zone database is -14:00:00 to +12:00:00, or in seconds -50400 to 43200.
    With 17 bits we can encode an offset of -65536 to +65535 seconds, or over 18 hours in either direction.
  • Offset from standard time in that time zone.
    This information can be used to indicate DST but is not used much otherwise. Also governments choose a sensible value for the difference between regular time and DST, of which there are only a handful. Over the last century there were only 7 in use worldwide, so it fits in a 3 bits enum.
    • 0:00 (no DST active)
    • 1:00 (most common DST value)
    • -1:00 (reverse DST: Ireland, Namibia, Marocco)
    • 0:20 (Asia/Kuching, Ghana)
    • 0:30
    • 1:30
    • 2:00
    • if in the future more than 1 other value is added we could correlate with the value of the time zone.
  • Abbreviation.
    There are not that many abbreviations. Concatenated they could fit in a ±500 char string. Half that if we don't count numeric offsets as abbreviations. They are always between 3 and 5 characters long. In theory we could encode an index into that string in 9 or 10 bits, and the length in 2 bits.
  • Time zone enum variant.
    ~600 variants so 10 bits.

Combined that would put the minimum size for TzOffset to 42 bits.

Optimization with a medium-sized table

Currently we store the abbreviations as a large number of tiny slices, which adds quite some overhead to the binary. Concatenating them in a ~500 char string as I proposed above is one option.

Alternatively we could make a table of all TZ enum variants and abbreviation combinations. Each abbreviation would be stored in an [u8, 6] with the first byte being the length. I estimate the table to have ~1500 entries and be ca. 13kb. Compared to how we currently store the data that might not even be an increase in binary size.

Just 11 bits is enough to index into the table and get the TZ enum variant and an abbreviation.
That would bring the bits needed in TzOffset down to 17 (offset from UTC) + 3 (dst) + 11 (table index) = 31 bits, i.e. 4 bytes.

DateTime<Tz> could then be 16 bytes, just like DateTime<FixedOffset>.

@djc
Copy link
Member

djc commented Apr 6, 2024

I just want to warn that I don't think there's overwhelming evidence that the size of these types are causing problems for lots of people, so optimizations here should be carefully balanced against the amount of complexity they introduce.

@pitdicker pitdicker linked a pull request Apr 12, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants