Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse-zoneinfo: replace rule parser with simple state machine #172

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

djc
Copy link
Member

@djc djc commented Apr 15, 2024

The raw diffstat of +4957/-371 doesn't look so attractive, but this account new tests (and accompanying data) that account for about 4400 of those lines added, so all in all this doesn't add that much more code than it deletes. The benchmark example suggests it is about 10x faster and it drops a pretty big dependency.

@djc djc requested a review from pitdicker April 15, 2024 21:28
Copy link
Contributor

@pitdicker pitdicker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive you could write this in little time!

I wonder if it wouldn't take less code to initialize Rule with default values and update it's fields, instead of moving around all fields to the next variant in RuleState.

Do you want to convert the zone, continuation and link line parsers in the same PR?

use parse_zoneinfo::line::{Line, LineParser};
use parse_zoneinfo::FILES;

#[ignore]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #[ignore] here because this test will fail every time we update the tz data. Not sure how big of a pain in the ass that will be to update? Using the cargo-insta tooling it is pretty easy so we might decide to just include this.

parse-zoneinfo/src/line.rs Show resolved Hide resolved
@djc
Copy link
Member Author

djc commented May 6, 2024

So the package test fails because I've made chrono-tz-build depend on the FILES list newly duplicated in parse-zoneinfo (to help with the snapshot test), but this doesn't work for packaging (which tests against the published version). I guess we can keep the FILES list duplicated in the repo for now and drop it once we release a new version of parse-zoneinfo?

@pitdicker
Copy link
Contributor

I'll have a look tomorrow (also on the other PR).

Copy link
Contributor

@pitdicker pitdicker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not reached the end yet 😄.

It seems at some point (2017c?) zic became case-insensitive for things like Rule, Zone, Link, weekdays, month names, last. Something we should eventually support?

@@ -17,7 +17,7 @@ case-insensitive = ["uncased", "phf/uncased"]
regex = ["dep:regex"]

[dependencies]
parse-zoneinfo = { version = "0.3" }
parse-zoneinfo = { version = "0.3", path = "../parse-zoneinfo" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to make these changes in my next PR 👍.

@@ -38,3 +38,15 @@ pub mod line;
pub mod structure;
pub mod table;
pub mod transitions;

pub const FILES: &[&str] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want to hardcode this list in parse-zoneinfo.

For my personal experiments the past year I removed backward, included backzone, occasionally included factory, and filtered parts of etcetera.

Maybe move the change out of this PR so we can discuss it separately?

if input.chars().all(|c| c.is_ascii_digit()) {
return Ok(DaySpec::Ordinal(input.parse().unwrap()));
}
// Check if it stars with ‘last’, and trim off the first four bytes if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check if it stars with ‘last’, and trim off the first four bytes if
// Check if it starts with ‘last’, and trim off the first four bytes if

return Ok(DaySpec::Ordinal(input.parse().unwrap()));
}
// Check if it stars with ‘last’, and trim off the first four bytes if
// it does. (Luckily, the file is ASCII, so ‘last’ is four bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about ASCII with strip_prefix, right? This seems an old comment.

return Ok(DaySpec::Last(weekday));
}

let weekday = match input.get(..3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, didn't know this method!

zic.c has the following comment for parsing a day column:

	/*
	** Day work.
	** Accept things such as:
	**	1
	**	lastSunday
	**	last-Sunday (undocumented; warn about this)
	**	Sun<=20
	**	Sun>=7
	*/

I think we should support parsing full weekday names like zic like we did with the regex, but maybe skip the last-{weekday} case.

Can you add a test for DaySpec::from_str?

impl FromStr for TimeSpecAndType {
type Err = Error;

fn from_str(input: &str) -> Result<Self, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split this method over the TimeSpec and TimeSpecAndType types? I am an not sure yet if anything but wall times is allowed zone lines, and if the existing code took a shortcut there that we want to fix.

from_year,
to_year,
},
"-" | "\u{2010}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add back the comment?

impl<'a> Rule<'a> {
fn from_str(input: &'a str) -> Result<Self, Error> {
let mut state = RuleState::Start;
for part in input.split_ascii_whitespace() {
Copy link
Contributor

@pitdicker pitdicker May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer parses a rule with a comment?

zic.c has a getfields method (line 3722) that returns when it encounters a comment sign #.
It also supports quotation marks " surrounding each field, within which whitespace and # is allowed. Maybe we should make an iterator that works similar instead of using split_ascii_whitespace?

let mut state = ZoneInfoState::Start;
for part in iter {
state = match (state, part) {
(st, _) if part.starts_with('#') => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory a comment is allowed to come straight after a field, without whitespace in between.

@@ -13,3 +13,6 @@ keywords = ["date", "time", "timezone", "zone", "calendar"]
version = "1.3.1"
default-features = false
features = ["std", "unicode-perl"]

[dev-dependencies]
insta = "1.38"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you added this test. Not sure about it though.

Would it be better to add this test as a separate crate in the workspace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants