Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time zone variant calculator: does it let us fully handle zoned datetime formatting? #5466

Open
sffc opened this issue Aug 29, 2024 · 32 comments
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal

Comments

@sffc
Copy link
Member

sffc commented Aug 29, 2024

At its core, ICU4X time zones have 4 fields, which fully determine the strings to be selected for formatting:

pub struct CustomTimeZone {
    pub gmt_offset: Option<GmtOffset>,
    pub time_zone_id: Option<TimeZoneBcp47Id>,
    pub metazone_id: Option<MetazoneId>,
    pub zone_variant: Option<ZoneVariant>,
}

Let's say someone gives us an IXDTF string like: 2024-08-29T11:53:18-0700[America/Los_Angeles]

From this string, we can already populate two fields:

  • GMT Offset: -7 hours
  • Time Zone ID: "America/Los_Angeles" == "uslax"

We have MetazoneCalculator, which takes the time portion of the string and lets us calculate the metazone field:

  • Metazone ID: "America_Pacific" == "ampa"

However, how do we calculate the ZoneVariant field?

I learned today that tzif files, at least version 2 and 3 files, contain a footer that looks like this:

$ tail -n1 /usr/share/zoneinfo/America/Los_Angeles 
PST8PDT,M3.2.0,M11.1.0

The "8" in that footer means that this time zone has a standard offset of 8 hours behind UTC. (note that the offset is negated from what we normally see)

Does this mean that we could build a table with standard offsets and use that table to generate zone variants? For example, we could create a data file with the following data, which can all be generated from the TZDB:

Time Zone ID Standard Offset
America/Los_Angeles -8
America/Chicago -6
Asia/Kabul +4:30
Asia/Manila +8
... ...

Then, when reading the IXDTF string, we use the following algorithm to select the zone variant:

  1. Look up the Standard Offset from the IXDTF string's Time Zone Identifier.
  2. If the Standard Offset matches the IXDTF string's Offset: set zone_variant to Standard.
  3. Else, if the Standard Offset is 1 less than the IXDTF string's Offset: set zone_variant to Daylight.
  4. Else, leave the zone_variant undefined.

Mechanically, we can generate this table by using a combination of our own tzif crate, which contains a struct ZoneVariantInfo with this information pre-parsed, and a tzif source, which could potentially be jiff_tzdb.

Note: the Time Zone ID would probably be stored in BCP-47 and Standard Offset would be bitpacked to an i8. It's possible we could stuff this data into one of our existing data structs to be more efficient.

Note: I assume that this mapping of time zone IDs to standard offsets is fairly stable over time, such that we do not need to worry about shipping updates at a cadence different than normal CLDR data updates.

Please help me understand: is the proposed algorithm correct and robust, or is it flawed in some edge cases?

@nekevss @leftmostcat @nordzilla @yumaoka @justingrant

@sffc sffc added needs-approval One or more stakeholders need to approve proposal C-time-zone Component: Time Zones labels Aug 29, 2024
@srl295
Copy link
Member

srl295 commented Aug 29, 2024

Else, if the Standard Offset is 1 less than the IXDTF string's Offset: set zone_variant to Daylight.

Instead of '1 less' couldn't you query the tz data to look for a transition from that data and use it? In other words, couldn't your table have both a standard offset and a daylight offset?

Time Zone ID Standard Offset Daylight Offset
America/Los_Angeles 8 7

Actually, querying the offset table for that exact time 2024-08-29T11:53:18 for America/Los_Angeles should result in an offset of 0700 from GMT.

@sffc
Copy link
Member Author

sffc commented Aug 29, 2024

My goal is, assuming that an IXDTF string is correct (has the correct offset for the given date, time, and time zone), format that data without relying directly on the TZDB at runtime.

I can store both the standard offset and daylight offset for each time zone. I guess my questions then would be:

  1. Does each IANA zone have a stable mapping of what offset is "standard" and which offset is "daylight"?
  2. Is the daylight offset ever not 1 hour more than the standard offset?

@srl295
Copy link
Member

srl295 commented Aug 29, 2024

@sffc

  1. yes. In tzdb it's the SAVE column
  2. in modern zones I'm not sure, but it's not a good reason to hard code it.

@sffc
Copy link
Member Author

sffc commented Aug 29, 2024

Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time

https://www.timeanddate.com/time/zone/mexico/chihuahua

So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?

@srl295
Copy link
Member

srl295 commented Aug 29, 2024

Actually I guess the counter example is when a city switches from one metazone to another metazone, not just changing its transition dates, such as what happened last year in Chihuahua, Mexico, which switched from Mountain Time to Central Time

https://www.timeanddate.com/time/zone/mexico/chihuahua

So maybe this mapping needs to be from metazones, not time zones, to what their standard and daylight offsets are?

a metazone's offsets are valid for that zone for a certain time period. So the Mexico_Pacific and America_Central offsets will be different.

https://github.com/eggert/tz/blob/main/northamerica#L2731-L2732

			<timezone type="America/Chihuahua">
				<usesMetazone to="1998-04-05 09:00" mzone="America_Central"/>
				<usesMetazone to="2022-10-30 08:00" from="1998-04-05 09:00" mzone="Mexico_Pacific"/>
				<usesMetazone from="2022-10-30 08:00" mzone="America_Central"/>
			</timezone>

@sffc
Copy link
Member Author

sffc commented Aug 29, 2024

Does a particular metazone always have the same offsets corresponding to its standard and daylight variants?

@sffc
Copy link
Member Author

sffc commented Aug 29, 2024

It seems that ICU4C determines the zone variant by reading "is the current datetime DST or not" from the TZDB.

That bit appears fetchable from tzif, and it is in the tzif crate:

https://unicode-org.github.io/icu4x/rustdoc/tzif/data/tzif/struct.LocalTimeTypeRecord.html

I think my previous question though is still a valid question to ask. Does a particular metazone always have the same offsets corresponding to its standard and daylight variants? That could perhaps be data that could be added to CLDR.

Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places, such as https://github.com/unicode-org/icu/blob/eda184e6af63d6eee1b3a59c61d1695eef44fcb4/icu4c/source/i18n/timezone.cpp#L1241

@BurntSushi
Copy link

BurntSushi commented Aug 30, 2024

Also, regarding whether the DST shift should be fixed at 1 hour: it seems that the ICU4C code currently assumes this in multiple places

My favorite counter-example to this is Antarctica/Troll, which uses a DST shift of 2 hours:

$ tail -n1 /usr/share/zoneinfo/Antarctica/Troll
<+00>0<+02>-2,M3.5.0/1,M10.5.0/3

And then there is also the case of Ireland, whose DST shift is inverted from what's typical:

$ tail -n1 /usr/share/zoneinfo/Europe/Dublin
IST-1GMT0,M10.5.0,M3.5.0/1

As you noted, TZ strings invert the sign. So Europe/Dubin uses +0100 for standard time and +0000 for DST.

@nekevss
Copy link
Contributor

nekevss commented Aug 30, 2024

FWIW, here's a markdown table of the output of find -L /usr/share/zoneinfo/ -maxdepth 3 -type f,l | xargs tail -n1. Although, I think it does pull in some noise from /usr/share/zoneinfo/right/.

@nekevss
Copy link
Contributor

nekevss commented Aug 30, 2024

It's already been noted regarding the sign in the POSIX tz string. But just found the below quote in the TZ Variable section of the GNU C LIbrary manual.

This is positive if the local time zone is west of the Prime Meridian and negative if it is east. The hour must be between 0 and 24, and the minute and seconds between 0 and 59.

@robertbastian
Copy link
Member

I think the question "how to set the ZoneVariant" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case for ZoneVariant). The straightforward solution to this would be to instead of

"ampa": {
   "dt": "Pacific Daylight Time",
   "st": "Pacific Standard Time"
}

store

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).

@nordzilla
Copy link
Member

nordzilla commented Sep 6, 2024

From @robertbastian

I think the question "how to set the ZoneVariant" is an XY problem. For formatting, we need a way to look up a time zone name given an offset (this is the only use case for ZoneVariant). The straightforward solution to this would be to instead of

"ampa": {
   "dt": "Pacific Daylight Time",
   "st": "Pacific Standard Time"
}

store

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This doesn't require any additional lookup at runtime, as we already have the offset, and naturally handles any kind of DST (even multiple).


I agree that I think data in this format would be ideal.

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
}

This data could be added to supplemental/metaZones.xml in CLDR.

However, there are a few things to consider:


1) Has a metazone ever changed its associated time variants?

If not, the data is straightforward, exactly as shown above.

If so, this data could still reasonably be captured and added to the file.

Consider a hypothetical situation where America_Central (amce) decided to move its standard-time offset for all of its associated time zones by half an hour for one year, and then changed it back to the way it was before:

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-8:00": "Pacific Standard Time"
},
"amce": {
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-6:00": "Central Standard Time",
    "_to": "2024-09-06 00:00"
  },
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-5:30": "Central Standard Time",
    "_from": "2024-09-06 00:00",
    "_to": "2025-09-06 00:00"
  },
  "usesTimeVariants": {
    "-5:00": "Central Daylight Time",
    "-6:00": "Central Standard Time",
    "_from": "2025-09-06 00:00"
  },
},

This format seems reasonable and is the same structure as how Time Zone ID's are mapped to MetaZones in the same file.


2) What would happen if a time zone within an associated metazone observes the same time-variants offsets, but transitions among them at different datetimes than other zones within that metazone?

One relevant example of this is the recent proposal for some of the West Coast states to observe permanent Daylight Savings Time:

https://www.opb.org/article/2024/02/20/oregon-bill-to-end-daylight-saving-time-fails-legislature/

If this were the case, then the offset would remain UTC-7 year round, and those time zones, e.g. America/Los_Angeles would just format to Pacific Daylight Time year round.

This all seems okay to me.


3) What would happen if an individual time zone wants to use use different offsets than the current time-variant offsets established by the metazone?

I am not aware of any such case like this that exists, but I think there are two reasonable solutions:

A) That time zone could switch to a new metazone (either new or preexisting) that matches its desired offsets. This happens all the time.

B) We could add that offset data to CLDR.

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:30": "Pacific Cool New Time",
   "-8:00": "Pacific Standard Time"
},

The time zones that use the prior offsets would go on as usual, and the time zone with the new offset would have its new localized name.

I recall a conversation with @sffc years ago that perhaps daylight_time and standard_time are not great identifiers within the icu4x code base, because sometimes it's formatted as "Summer Time" for example, and in the future it may be possible that there are more than 2 variants.

A format such as this would allow us to be agnostic of naming conventions, instead tying the internationalized name of the variant to an offset.

However, there are a few more considerations to take into account in this case:

3.1) What if a time zone wants to add a new offset, but have the same localized name as another offset?

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:30": "Pacific Standard Time",
   "-8:00": "Pacific Standard Time"
},

This probably wouldn't cause a data ambiguity issue, but I think it would be incredibly confusing, as "Pacific Standard Time" would now be semantically ambiguous.

This should not be allowed.

3.2) What if a metazone wants to add a new localized name for an offset that is already present?

"ampa": {
   "-7:00": "Pacific Daylight Time",
   "-7:00": "Pacific Cool New Time",
   "-8:00": "Pacific Standard Time"
},

This would cause a data issue and should not be allowed.


Conclusion

I don't feel that I have the cycles to take on this work myself right now, but I would support collaborating on making this data available (if people agree it is sound).

Here is an example of when the short metazone identifiers were added to that same CLDR file: https://unicode-org.atlassian.net/browse/CLDR-14607

Filing an issue on Jira would be a good next step if we reach a consensus here.

@robertbastian
Copy link
Member

All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.

@nordzilla
Copy link
Member

All questions of the form "what if a timezone wants to do something different than the rest of the metazone" should be answered by creating a new metazone. My expectation is that all zones in a metazone fully agree on offsets today and in the future, but maybe that's not guaranteed.

That would be much simpler and more stringent. I would agree with imposing these restrictions. I was just trying to think of all the cases.

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

My favorite counter-example to this is Antarctica/Troll, which uses a DST shift of 2 hours:

Another counter-example to the 60-minute transition: https://www.atlasobscura.com/places/lord-howe-islands-time

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

I agree with the workaround of creating a new metazone if the offset invariants ever break down. Metazones are purely a CLDR/ICU construction, not TZDB, so we have a lot of latitude for how we handle them.

For example, if all US West Coast states decided to abolish daylight savings time and that Pacific Time should be GMT-7 instead of GMT-8 (a proposal I don't support but which is good for illustrative purposes), then we would need to create a new metazone such as amp2 meaning "version 2 of ampa".

It is highly likely that such changes already occurred in the last 50 years, and we should probably look for them in datagen.

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

As far as data sources are concerned, it seems perfectly fine to me for this data to be derived from TZDB. Currently ICU4C uses TZDB to determine which zone variant to use when formatting, so if ICU4X used TZDB during datagen, then we should be able to guarantee consistency with ICU4C. ICU4X could manually spawn new "private use" metazones as needed.

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

OK, one other issue I realized. There are numerous countries that use their own country name as the metazone. The first one I pulled is "kyrg", Kyrgyzstan:

https://en.wikipedia.org/wiki/Kyrgyzstan_Time

Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.

@justingrant
Copy link

https://en.wikipedia.org/wiki/Kyrgyzstan_Time

Kyrgyzstan has switched between UTC+5 and UTC+6 multiple times, but presumably the metazone has not changed.

Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

Yeah, this was gonna be my concern: cases where oddball metazones are tidally locked to a country. I assume this fact means that the "use the offset only" idea won't work?

I think it can still "work"; it's just something we need to factor in. A few ways of resolving this:

  1. Should Kyrgystan even have a specific (offset-based) time zone name, since it doesn't have a useful meaning? It is a generic (location-based) time zone name, not a specific time zone name. We could just remove it and fall back to the generic time zone name.
  2. If we need to have a specific time zone name, we could just add both UTC+5 and UTC+6 as offsets with the same name.
  3. Or, we could split it into two metazones.

@sffc
Copy link
Member Author

sffc commented Sep 7, 2024

One other note: I very frequently encounter people using "PST" to mean Pacific Time, not specifically Pacific Standard Time, and similarly with EST and CST and others. For example, it is very common to see people say "let's meet in San Francisco on September 7 at 10am PST", and if you show up at that time according to the TZDB/CLDR definition, unless it is a time zone nerds meetup, you will be an hour late.

What this means: this is all so imprecise anyway, so let's just land something reasonable and otherwise encourage people to use city-based time zone names. Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.

@justingrant
Copy link

Maybe CLDR can focus on adding a short location format, such as "LA Time" or "NYC Time" to use instead of the ambiguous things it currently uses.

Normal people (other than those who are super-familiar with how IANA timezones work, which is a very small Venn diagram overlap with "normal people") don't use "LA Time" or "NYC time". So I'm not sure it'd make sense to add that to CLDR. I understand the desire for consistency, but this seems to be a case where there's no evading the inconsistency of human language use.

@sffc
Copy link
Member Author

sffc commented Sep 8, 2024

My hypothesis is that "normal people" would understand what you meant by "LA Time", even if they haven't often seen it before, and it is also the most unambiguous definition for an i18n library to produce.

@yumaoka
Copy link
Member

yumaoka commented Sep 9, 2024

Random comment for earlier replies.

  • IANA TZ Database files has DST flags. But the information is lost in standard zone data binaries. If you just look at the content of zone data binary file, you cannot tell if a given time is in DST or not. Of course, you can guess DST or not by looking around offset around the time. For example, UTC offset of America/Los_Angeles on 2024-09-01T00:00:00Z is UTC--07:00. But there is no info about whether it's DST or not in zone data binary. ICU want to keep the info to support old TimeZone API, and ICU zone compiler made some modification to store the flag along with zone offset transition data.

  • IANA TZ Database contains DST offset not exactly 1 hour. For example, Australia/Lord_Howe advances 30 minutes in DST. There are many other zones using non-1 hour DST changes historically.

  • Metazone is not associated with specific UTC offsets. Metazone is associated with a set of names. Because North America and Europe assign names associated with standard offsets, you might think standard offset and Metazone are related. Someone commented Metazone with multiple historic standard offsets are odd balls, but I would say North America/Europe are actually exceptional.

I think the concept of ZoneVariant in the struct is problematic.

@robertbastian
Copy link
Member

Random observation:

Same time Formatted with generic TZ
2024-07-01T12:00:00-06:00[America/Denver] 12:00 Mountain Time
2024-07-01T11:00:00-07:00[America/Phoenix] 11:00 Mountain Time
2024-07-01T18:00:00Z 18:00 UTC

@nordzilla
Copy link
Member

nordzilla commented Sep 9, 2024

From @robertbastian:

Random observation:

Same time Formatted with generic TZ
2024-07-01T12:00:00-06:00[America/Denver] 12:00 Mountain Time
2024-07-01T11:00:00-07:00[America/Phoenix] 11:00 Mountain Time
2024-07-01T18:00:00Z 18:00 UTC

These are all technically correct, though confusing. They're both Mountain Time. It's just that Denver is in Mountain Daylight Time and Phoenix is in Mountain Standard Time because Arizona does not observe DST.

I would argue that this is a reason why populating the ZoneVariant struct whenever possible is worthwhile.


EDIT:

Though, to clarify, the above "Mountain Time" formats are "Generic non-location format".

The UTS-35 spec defines several formats with fallbacking:

Generic non-location format

Examples: "Pacific Time" (long), "PT" (short)

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

Generic location format

Examples: "France Time", "Italy Time"

Specific non-location format

Examples: "Pacific Standard Time" (long), "PST" (short), "Pacific Daylight Time" (long), "PDT" (short)

Localized GMT format

Examples: "GMT+03:30" (long), "GMT+3:30" (short), "UTC-03.00" (long), "UTC" (for zero offset)

ISO 8601 time zone formats

Examples: "-0800" (basic), "-08:00" (extended), "Z" (for UTC)

It was years ago, so I'm not sure if the current implementations within ICU4X are exactly the same, but I tried to implement the fallbacking rules according to the spec.

The above strings have enough information available to utilize either Generic location format e.g. Phoenix Time, or Generic partial location format e.g. Mountain Time (Denver).

@justingrant
Copy link

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

FWIW, I think this is a nice solution to this problem described above, where if there's a colloquial name for a time zone like "Pacific Time", it's still used but with a disambiguator for less common cases like Arizona.

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

The observation about generic non-location being ambiguous is well known and largely working as intended. It should only be used if the location of the event is known from context. Here is the language I wrote for how to select your time zone style in semantic skeleta:

  • Specific: A time zone that unambiguously maps the time of day to an instant, which can be understood independently of the location or time of year. This field could resolve to specific non-location (pattern symbol "x", "xxxx") or offset (pattern symbols "O", "OOOO"), depending on the locale, length, and time zone identity.
  • Generic: A time zone based on the location of an event. This field could resolve to generic non-location (pattern symbols "v", "vvvv"), generic partial-location, or location (pattern symbol "VVVV"), depending on the locale, length, and time zone identity. Do not use this field if the location of the event is unknown from context, because doing so could lead to ambiguity.
  • Location: A time zone based on the identity of the IANA time zone. This field always resolves to the location format (pattern symbol "VVVV").
  • Offset: A time zone based on the time offset from UTC.

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

Example use cases where generic time zone style is acceptable:

  • Meet me at the Google San Francisco office at 11:00am Pacific Time.
  • All year round, the bells at St. John's Cathedral strike at 12:00pm Mountain Time.
  • It's best to hike the Grand Canyon before 4:00pm Mountain Time.
  • Your flight departs St. Louis Lambert airport at 6:25pm Central Time.

Note: In most or all of these cases, it would be acceptable to say "local time" or simply drop the qualifier.

Example where generic time is not acceptable and a different style should be used, unless the location is otherwise known from context:

  • The TV show starts at 6:00pm Mountain Time.
  • The teleconference starts at 8:00am Eastern Time.

My point is that there are enough legitimate use cases for generic non-location format, but since it could introduce ambiguity, it should only be used if the developer opts in.

@robertbastian
Copy link
Member

Generic partial location format

Examples: "Pacific Time (Canada)" (long), "PT (Whitehorse)" (short)

This seems to be the non-ambiguous version of the generic non-location format. We don't seem to support this in ICU4X, however?


What we need for full correctness is a ZoneVariantCalculator that maps (TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>). It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar to MetaZonePeriodsV1.

If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.

@robertbastian
Copy link
Member

robertbastian commented Sep 12, 2024

Re generic partial location format, it sounds like we're meant to detect when a metazone is not specific ambiguous, and add the location to it. We can do that, I've found a lot of non-specific ambiguous metazones in #5515. We can extend the return value of MetazoneCalculator with an is_ambiguous flag, in which case the formatter would add the location (or the offset if locations aren't available).

@sffc
Copy link
Member Author

sffc commented Sep 12, 2024

What we need for full correctness is a ZoneVariantCalculator that maps (TimeZoneBcp47Id, DateTime<Iso>) -> (UtcOffset, Option<UtcOffset>). It would do this by storing a sequence of ISO minutes with associated offsets for each zone, similar to MetaZonePeriodsV1.

If there is sufficient overlap between the offset list and the metazone list for each location, they could be combined, as the bulk of these structures will be the keys.

LGTM

robertbastian added a commit that referenced this issue Sep 16, 2024
#5466

Supersedes #5515

---------

Co-authored-by: Shane F. Carr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal
Projects
None yet
Development

No branches or pull requests

8 participants