Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Holistic zerovec abstraction design #5523

Open
9 of 16 tasks
Manishearth opened this issue Sep 9, 2024 · 7 comments
Open
9 of 16 tasks

Holistic zerovec abstraction design #5523

Manishearth opened this issue Sep 9, 2024 · 7 comments
Assignees
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake
Milestone

Comments

@Manishearth
Copy link
Member

Manishearth commented Sep 9, 2024

See also: #5561 for a holistic design for the core vector types

Jumping off of #5520 (comment)

I think we should try to make our zerovec abstractions a bit more holistically designed. I have these goals in mind:

  • The unsafe code in the crate should be well organized and minimal.
  • The abstractions in the crate should be at the right level of genericity. Unsafe abstractions that are super generic tend to be harder to review and safely maintain, but if done right they can cover a lot more use cases without complicating things too much.
  • Users of the crate should almost always be able to avoid custom impls of ULE/VarULE/etc unless they're trying to do really clever packing1. This means that these abstractions should provide acceptably optimal data size. It is okay if there are theoretically better ways of doing things as long as we stay reasonably close to the theoretical maximum.
  • Users of the crate should usually have a way to avoid using custom derives, though we should still maintain useable and versatile custom derives. What ICU4X chooses to do here ought to be a separate discussion.

For the purposes of this discussion, ICU4X is considered a typical large-scale user, as it is the primary user.


The design I've had in mind is the following. It roughly matches @sffc's design in #5520 (comment) but goes in more depth on how the unsafe code should be structured.

  • Step 1: Improve structure of unsafe code
    • Refactor VarZeroVec indexing, construction code to be reusable with out-of-band length Better abstractions for splitting lengths out from VarZeroVec #5378 (planned for 2.0)
      • A strong goal here is that I want the code dealing with the VZV buffer to basically be in one place. We've iterated a bunch of times on how we'd like that buffer to eventually work, and this code is messy and previously was hard to review, so consolidating it is quite important to me. I even find the gnarly mutation code to be something that's worth avoiding if possible since it's another place where tricky VZV buffer invariants live, but perhaps that's too big an ask and the mutation is useful. ICU4X doesn't use it, we always use the wholesale conversion code that constructs from a slice/vec.
    • Update MultiFieldsULE to use this, and supply length from const param MultiFieldsULE should enforce the number of items #5240 (planned for 2.0)
    • Potentially revisit the need to maintain VZV mutation code (unplanned for 2.0. worth discussing quickly). Less mutable, cleaner design for ZeroVec/VarZeroVec #5583
  • Step 2: Perform optimizations enabled by the codesharing
  • Step 3: Actual holistic abstractions. We should end up with:
    • ZeroVec<T>, VarZeroVec<T, Format>, plus slice types (exists)
    • (public but largely "internal") MultiFieldsULE<N, Format> type. Would enjoy suggestions on the name2 since it's more than just fields. Maybe UntypedMultiVarULE. (planned for 2.0)
    • Macros for simple wrappers for both ULE and VarULE. Add simpler macro versions of ULE/VarULE macros to reduce syn deps #5127 (unplanned for 2.0, but is a purely internal implementation change that can happen whenever we want)
    • VarTupleULE that represents (ULE, VarULE). (already exists, would like name suggestions).
    • TupleNULE that represents (ULE, ULE, ...). We've done a good job minimizing unsafe here. (already exists, would like name suggestions)
    • VarTupleNULE that represents (VarULE, VarULE, ...). Implemented using MultiFieldsULE, basically a typed version of the same. Should not have too much unsafe, ideally using macros or something to do the N-tuples. I'd like to have some of this on track for 2.0, I'm okay with not getting all of it but I want to try.
    • OptionULE, OptionVarULE that represent Option<T>. We already have these, but I'd like to potentially switch these over to using VarTupleULE in their internal implementation. (No change planned for 2.0, nice-to-have)
  • Step 4 (future, not yet designed): Better bitpacking abstractions. In the long run, I'd like us to do a similar exercise for generalizing our bitpacking work. I haven't put much design thought into this so I'm not going to actually write a full design, nor do I wish to milestone this at the moment, but it's worth giving a rough sketch:
    • We should have more generic types that work like NichedOption, but aren't just Option. Ideally they let us do more complex enums.
    • It should be possible to do more bitpacky things with the VarZeroVec array type.
    • It should be possible to do more bitpacky things with bitpacky TupleULE types and the custom derive.

I am not fully happy with the names of these generic types, but we can bikeshed and we can punt naming past 2.0. I would like to see if we can come up with something better than TupleN, VarTuple, and VarTupleN (or VarVar). I don't really find VarVar to be evocative of the right thing, it sounds nested rather than just being "two Vars". TupleN and VarTupleN sound about right to me, but we then need a name for the (ULE, VarULE) type. Maybe MixedTuple?

Overall I think this is work I can get done in time for 2.0, at least sufficiently so that we're not blocked on stable data formats in the future, even if the zerovec APIs may change a bit.

cc @sffc @robertbastian

Footnotes

  1. In the long run it would be nice to support that as well, for now I am comfortable not trying to cover that too much in the current holistic design.

  2. I don't consider the bikesheds here to be 2.0 blockers, but I do wish to solicit name suggestions.

@Manishearth Manishearth added the C-zerovec Component: Yoke, ZeroVec, DataBake label Sep 9, 2024
@Manishearth Manishearth added this to the Utilities 1.0 milestone Sep 9, 2024
@sffc
Copy link
Member

sffc commented Sep 9, 2024

Suggestion: something I needed in #5510 is a dataful enum ULE. Maybe OptionULE could be generalized.

@Manishearth
Copy link
Member Author

Yeah. I've stuck that in step 4 but we should start thinking about designs now

@sffc
Copy link
Member

sffc commented Sep 10, 2024

So the most basic dataful enum representation is to spend a byte on the discriminant and then have a single VarULE value (we should consider supporting regular ULEs, too, but I'll focus on VarULE). But, as I did in #5510, you don't normally need the whole discriminant byte, so I took half of it and used it as its own standalone piece of metadata.

A key question to answer would be, can the inner VarULEs depend on the extra data in the discriminant byte? Doing so might mean that we need a new trait such as VarULEWithInfo that takes an extra info variable in validate_byte_slice and friends. I think this is a bit of a scope creep. An easy solution to punt this back to userland would be to say, if you want to make the VarULE depend on the discriminant byte extra data, just add more discriminants (enum variants).

Okay, with that in mind, a design could be

pub enum VarEnum2<A, B> {
    A(A),
    B(B),
}

pub struct VarEnum2WithMetadata<A, B> {
    metadata: u8,
    value: VarEnum2<A, B>
}

impl<A, B> VarEnum2WithMetadata<A, B> {
    pub fn try_new(metadata: u8, value: VarEnum2<A, B>) -> Result<Self> {
        if metadata & 0x80 != 0 {
           Err(...)
        } else {
            Ok(Self { metadata, value })
        }
    }
}

pub struct VarEnum2ULE<A, B> {
    _a: PhantomData,
    _b: PhantomData,
    /// Representation:
    /// - First byte: discriminant and metadata
    /// - Remainder: either A or B
    bytes: [u8]
}

impl<A, B> VarEnum2ULE<A, B> {
    pub fn get(&self) -> VarEnum2WithMetadata<&A, &B> {
        // ...
    }
}

And then we could add additional numbers of variants (VarEnum3, ...) as needed.

Having this at hand, in addition to the proposals in the OP, should allow me to remove unsafe code from PluralElementsPackedULE.

@sffc
Copy link
Member

sffc commented Sep 10, 2024

Another thought: we could build a single type such as VarEnum8ULE with more than enough variants, and then VarEnum2ULE, OptionULE, ... just become typedefs or thin wrappers where the unused variants are the empty type.

@Manishearth
Copy link
Member Author

Yeah, this is roughly what I'm thinking for enums, but there may be reasons to support some kind of packing.

@Manishearth Manishearth self-assigned this Sep 13, 2024
@sffc
Copy link
Member

sffc commented Sep 17, 2024

PXL_20240917_004449866 MP

@Manishearth
Copy link
Member Author

Discussion between @Manishearth and @sffc:

Design principle: Every data model the macro supports should also be supported via publicly exported ZeroVec types. This may not lead to 100% the same experience1, but it should be possible.

Footnotes

  1. So, you may need to nest tuples to make something work

Manishearth added a commit that referenced this issue Sep 25, 2024
It is impossible for an IndexN array to need more than a length integer
of size N, anyway, the max index is always `>=` the length.


Part of #5523

Builds on #5593


We could in theory just have a `VZVFormatCombo<Index, Len>` type that
allows free selection, however I'm trying to keep this minimal. Overall
the main use case for that is picking things like "a small array of
;argely-sized elements" and we could just expose Index16Len8 for that. I
can see that being useful for things like
#5580, though it also feels
like a data microoptimization.


The "total" lines in fingerprints.csv are interspersed in giant diffs,
and this basically only gets a max of 2-6 byte wins per data, but the
overall data size went down by 200KB. Not amazing, not terrible.

```rust
[18:26:22] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ 
$ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 
23501369
[18:26:08] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ 
$ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 
23391499
```
Manishearth added a commit that referenced this issue Nov 7, 2024
The VarULE counterpart of TupleNULE

Part of #5523. Planned to be
used in #4437

I'm not super happy with the naming with this vs VarTupleULE, but I've
tried to make it clearer with the module names and it's fine for now. We
can rename as desired since zerovec isn't on the ICU4X stability track.

I do plan to add serde/etc impls but that's going to be a separate PR.


<!--
Thank you for your pull request to ICU4X!

Reminder: try to use [Conventional
Comments](https://conventionalcomments.org/) to make comments clearer.

Please see
https://github.com/unicode-org/icu4x/blob/main/CONTRIBUTING.md for
general
information on contributing to ICU4X.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake
Projects
Status: Being worked on
Development

No branches or pull requests

2 participants