Any interest in a new directory hashing format? #872

Ericson2314 · 2025-03-12T03:28:48Z

This is a follow-up to #866. I'm still thinking about ways Nix and Buck2 can share algorithms for the same problems in encourage interopt.

We Nix devs can implement the Bazel RBE protobuf Merkle format in Nix, and we probably should because it is wide-spread in the ecosystem, but it has some flaws for our use-case that, IMO, make a new format worthwhile. I see now that Buck2 has the trait infra to potentially support many such formats, so I hope this is not ipso facto a ridiculous request :).

Here are the flaws and fixes:

The first obvious flaw with RBE is that it hashes protobuf. The extensibility is nice, but the non-canonicity is not (https://protobuf.dev/programming-guides/serialization-not-canonical/). This is especially important for Nix, where the content-addressing can extend to the choice of file names themselves. (How things are hashed can be exposed by determining file names based on that hash, in contract to content-addressing just been an implementation detail.) Something like https://github.com/diem/bcs might be more appropriate
I see that it inherits the flaw from git, which is that there are not single unambiguous hashes for files and symlinks.

You can have a directory hash, or a file hash, but you can't have a hash for a single thing which may be any one of a file, directory, or symlink. You can hash *Node types, but those hashes are not used in the hash of Directory.

This flaw IMO boils down the the lack of sum types in protobuf. One would like a directory to look something like Map<FileName, Hash<EntryValue>>, where EntryValue is on a file, executable, or symlink, possibly with other information (timestamps?, extra perms?). But such a structure is/involves tagged unions, which are awkward in protobuf, so instead they partition the entries into per-variant collections, and then we lose the single unambiguous child hash property.

A practical ramification of have no cheap way to go Map<FileName, Hash<EntryValue>> to directory for Nix is that it makes it a little bit harder to describe store sandboxes. We would like to cheaply and easily say "build with a store containing just these root store objects and the references", which means going from:
1. a collection of root objects (by content-address)
2. a collection of objects closed under reference closure (by content-address)
3. a directory of that collection, with the names calculated from the content-addresses (by content-address)
But with a format like Bazel's RBE, we can't just use digests for the above, but much use directory-entries, which are rather bulkier (more metadata) and not fixed sized (symlink targets). This is not insurmountable, but is less nice than working with hashes of that information.

Rust, and BCS, support sum types, again making for an easy fix to this.
A final thing that @edef1c has enlightened me about is using Radix trees for directories, to enable protocols which can utilize merkle inclusion proofs to securely allow looking up single directories without first verifying the entire directly object against its hash. This would be useful for very wide directories like our stores (more likely for whole system network booting at "deploy time", than the dependencies of any individual build step at "build time"). Implementing the radix trie correctly makes the format much more involved, and their is also the question of tuning the radix, but the netboot use case is quite compelling to me.

I suppose whether Meta thinks performant FUSE random access for very wide directories is worth the complexity is more of an EdenFS than Buck2 question.

The text was updated successfully, but these errors were encountered:

sluongng · 2025-03-12T09:35:51Z

I wonder if https://github.com/bazelbuild/remote-apis is a more appropriate place to raise this discussion. Because Buck2 is currently only represents the client-side implementation of a remote build protocol. To make a new protocol work, you would need a server-side implementation supporting such a protocol as well.

The first obvious flaw with RBE is that it hashes protobuf. The extensibility is nice, but the non-canonicity is not (https://protobuf.dev/programming-guides/serialization-not-canonical/).

I think this is a fair criticism. The remote API spec does help clarify how the digest of a proto message MUST be computed to help tighten the rule a bit further.

// When a `Digest` is used to refer to a proto message, it always refers to the
// message in binary encoded form. To ensure consistent hashing, clients and
// servers MUST ensure that they serialize messages according to the following
// rules, even if there are alternate valid encodings for the same message:
//
// * Fields are serialized in tag order.
// * There are no unknown fields.
// * There are no duplicate fields.
// * Fields are serialized according to the default semantics for their type.
//
// Most protocol buffer implementations will always follow these rules when
// serializing, but care should be taken to avoid shortcuts. For instance,
// concatenating two messages to merge them may produce duplicate fields.

https://github.com/bazelbuild/remote-apis/blob/2721568dea746d81a98a116ee222c8da50bbcf5c/build/bazel/remote/execution/v2/remote_execution.proto#L979-L991

This is especially important for Nix, where the content-addressing can extend to the choice of file names themselves. (How things are hashed can be exposed by determining file names based on that hash, in contract to content-addressing just been an implementation detail.)

I am a bit confused by this part.
It would be nice to get a concrete example here.

But with a format like Bazel's RBE, we can't just use digests for the above, but much use directory-entries, which are rather bulkier (more metadata) and not fixed sized (symlink targets). This is not insurmountable, but is less nice than working with hashes of that information.

I also don't understand this claim.
A binary serialization of a proto message can be quite compact.
And I don't fully get why "fixed sized" is needed for the serialization format for a generic directory tree of variable size? Perhaps I am missing some context here.

using Radix trees for directories, to enable protocols which can utilize merkle inclusion proofs to securely allow looking up single directories without first verifying the entire directly object against its hash.

There are several discussions over in remote-apis about how to better represent trees in a future version of the protocol. Such as:

I think your problem is very interesting and it would be nice to get a concrete analysis to compare how a Radix tree would help improve things over the existing format.

Ericson2314 · 2025-03-12T18:01:27Z

I wonder if https://github.com/bazelbuild/remote-apis is a more appropriate place to raise this discussion.

I'm happy to move it there if you prefer.

I sort of think the broader venue might make it a tougher sell, especially for the sum-type stuff which will be confusing for those using languages that don't have it / a library-level best-attempt like std::variant. In other words, if there is something Nix and Buck2 could mutually agree upon first, and then take it to the others, I think that would be highly valuable!

Because Buck2 is currently only represents the client-side implementation of a remote build protocol. To make a new protocol work, you would need a server-side implementation supporting such a protocol as well.

I see, the build side at Meta is either off-the-shelf or proprietary? In any event, Nix does both sides (dating back to when we didn't have a daemon or remote execution at all).

I think this is a fair criticism.

OK, glad at least that part made sense :).

The remote API spec does help clarify how the digest of a proto message MUST be computed to help tighten the rule a bit further.

Yes the "serializer should do the right thing" makes sense to me. But I would want a deserializer that is just as strict too. I am not sure more protobuf implementations support that as one step, but yes I support once can just reserialize to the canonical form, and see if that matched the input. It's not too hard.

[...] This is especially important for Nix, where the content-addressing can extend to the choice of file names themselves. [...]

I am a bit confused by this part. It would be nice to get a concrete example here.

Sure!

First a disclaimer, I'm going to just assume all store objects are content-adddressed for now with Nix. (We also have "input-addressed" store objects, which really philosophically muddles things. But I would like to basically deprecate them, so it is not fantastical to ignore them.)

An (entirely-content addressed) Nix store is semantically a Set<StoreObject>. We materialize it as a directory (Map<Name, FileSystemObject>) with a computeFileNameFromContent : StoreObject -> FileName. computeFileNameFromContent is doing content-addressing that is directly user visible --- because user code gets to see and observe these file names. Store objects, which refer to other store objects via these file names, will likewise make the extra choice of content-addressing algorithm quite "viral", in that it affects just about every absolute path found in build artifacts.

Taking a step back, I think (but am not sure) that Buck 1 & 2, Bazel, Meson, CMake, going back to Make all in the tradition where output and input files are named at least partially according to user-chosen / user-meaningful rule/action names. Nix breaks from that tradition by "leaking" the storage model: all your "stuff" goes in the giant flat "store" directory. Our store object file names currently do have a name part, but this just muddies philosophical difference --- it would be a more "honest" design if the store object names were just (text-encoded) hashes, completely human-unreadable.

Hope that helps!

I also don't understand this claim. A binary serialization of a proto message can be quite compact. And I don't fully get why "fixed sized" is needed for the serialization format for a generic directory tree of variable size? Perhaps I am missing some context here.

Here's another way of looking at it: are we modeling file systems or "file system objects". When we model file systems, we know the root objects are directory; at least on Unix (single root) and Windows (multiple roots). So one usually gets a conceptual model like:

struct Directory { ...FileSystemObject... } 

enum FileSystemObject { ...Directory... }

(where my ...Name... indicates what is used to indicate the mutual recursion.)

Where we Merklize it (the type argument to Digest is phantom), we likewise replace one or more of those usages with hashes instead, like this:

struct Directory { ...Digest<FileSystemObject>... } 

enum FileSystemObject { ...Directory... }

(this is how I think of Git)

or like this:

struct Directory { ...FileSystemObject... } 

enum FileSystemObject { ...Digest<Directory>... }

(this is how I think of Bazel RBE's format)

With Nix however, we're principally concerned with modeling not the entire store as a file system object, but modeling the contents of each file system object, which can equally be directories, or individual files, or individual symlinks. We thus don't need any mutual recursion at all.

Our model is thus this (matching our docs https://nix.dev/manual/nix/2.26/store/file-system-object):

enum FIleSystemObject {
    File { executable: Bool, content: Vec<u8>, },
    Symlink { target: String },
    Directory { children: Map<String, FIleSystemObject> },
}

Ignoring the radix trie part, the model I lean towards for a Merkle Dag is:

enum FIleSystemObject {
    File { executable: Bool, content: Digest<Vec<u8>>, },
    Symlink { target: String },
    Directory { children: Map<String, Digest<FIleSystemObject>> },
}

This to me has two advantages:

Conceptual simplicity The single unified FIleSystemObject content address. A directory is just Map<String, Digest<FIleSystemObject>>, and the content-address (Digest<FIleSystemObject>) of object is always the same whether it is the root object, or the child of a directory.
Easy directory building, a corollary of Map<String, Digest<FIleSystemObject>>, where as much info as possible is "behind the hash". (FWIW, since Unix makes the dubious choice of putting permissions in the inode and not the dirent, one could say "all things with the same content-address have the same inode number", and cause chaos with directory hardlinks. This is very tempting to me :).)

I might add that the 2 rounds of hashing in the file case is not strictly necessary, but I think better than one round of hashing like git does (where they prefix the file with "blob"). I think Bazel RBE is right that the hash of the raw file needs to be part of the merkle DAG, otherwise you break compatibility with too many blob stores for no reason.

There are several discussions over in remote-apis about how to better represent trees in a future version of the protocol.

Yes maybe I should at least take the Radix trie part of this over to that repo. It is less exotic in terms of having no "please think with sum types" culture shock, and also less Nix-specific in that it is pretty easy to imagine random access of large directories as a problem in isolation.

Ericson2314 · 2025-03-12T18:09:44Z

Oh, I forgot about oneof. Guess proto3 is better for sum types. That gives me more hope for discussing this in https://github.com/bazelbuild/remote-apis

Ericson2314 · 2025-03-13T17:29:51Z

(Note I am talking more with @edef1c, and seeing a concrete benefit in more-than-hash directory children because supporting the dirent d_type field without further IO.)

In other words

enum FileSystemObject {
    File { executable: Bool, content: Digest<Vec<u8>>, },
    Symlink { target: Digest<String> },
    Directory { children: Digest<Map<String, FileSystemObject>> },
}

might be best:

FileSystemObject is still fixed sized
Directory has enough info (via the enum tag) for efficient FUSE

(Concretely, the last part is, to create a directory, you must provide enough info up front such that the d_type tag is provided, even if the underlying file system knows nothing of this format per se, Map<String, FileSystemObject> provides enough info that it can build out its own directory object with no IO.)

That substantially winnows down my critique of the existing format, basically saying that except for unbounded metadata and the raw symlink target, the Node* types are OK.

sluongng · 2025-03-13T17:43:45Z

Yeah, I think you got a good hang of it now.

So instead of using oneOf, the remote execution proto is defining a directory like this

message Directory {
  // The files in the directory.
  repeated FileNode files = 1;

  // The subdirectories in the directory.
  repeated DirectoryNode directories = 2;

  // The symlinks in the directory.
  repeated SymlinkNode symlinks = 3;

  // The node properties of the Directory.
  reserved 4;
  NodeProperties node_properties = 5;
}

with FileNode and DirectoryNode being pointers to other File and Directory messages in CAS. All fields in proto3 are treated as optional so this is effectively a list of enums as you wanted.

I guess what's missing here is the "key" portion of the Map<String, FileSystemObject>, which means that you would have to loop through all *Node->name fields to find what you need, but that's usually what you would need to do anyway on the library-level to construct such a map. And there are multiple FUSE implementations built on top of the existing spec as well.

Ericson2314 · 2025-03-14T01:27:22Z

@sluongng Yes, I now am fine with the partitioned 3 repeateds as an acceptable wire encoding.

My initial problem was less in doing the partitioning into 3 lists, but having the information to do so, since I wanted the file/dir/symlink type to be "behind the digest" of the child. Now that I don't want this info to be "behind the digest", for sake of dirent_t's d_type, but alongside the digest, I'm on board with a design that makes the 3-way partition possible and acceptable.

For Nix's purpose, the type + digest, or the *Node types without the name, would collectively be the content-address, rather than just the digest alone (which is insufficient). For example, we would include that on our "store object manifests" as an alternative to the "NAR hash" we use today. There is nothing wrong with such a content-address being semi-transparent, rather than a single opque hash, for our purposes.

The remaining criticism is that it would still be nice for the content-address to be fixed-sized, however. That is where wishing the symlink target was also hashed comes into play. (The perms and other other metadata can I think be also arbitrarily large, but we can just reject those things when parsing, which we should anyways since they represent degrees of freedom outside our model.)

(And also remaining is the serialized format canonicity issue and wide directory random access probably needing trie issue, but those are orthogonal.)

alexlian added remote execution Of relation to the Remote Execution integrations buck2 development Related to the development of buck2 itself question Further information is requested labels Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any interest in a new directory hashing format? #872

Any interest in a new directory hashing format? #872

Ericson2314 commented Mar 12, 2025 •

edited

Loading

sluongng commented Mar 12, 2025

Ericson2314 commented Mar 12, 2025 •

edited

Loading

Ericson2314 commented Mar 12, 2025

Ericson2314 commented Mar 13, 2025 •

edited

Loading

sluongng commented Mar 13, 2025

Ericson2314 commented Mar 14, 2025

Any interest in a new directory hashing format? #872

Any interest in a new directory hashing format? #872

Comments

Ericson2314 commented Mar 12, 2025 • edited Loading

sluongng commented Mar 12, 2025

Ericson2314 commented Mar 12, 2025 • edited Loading

Ericson2314 commented Mar 12, 2025

Ericson2314 commented Mar 13, 2025 • edited Loading

sluongng commented Mar 13, 2025

Ericson2314 commented Mar 14, 2025

Ericson2314 commented Mar 12, 2025 •

edited

Loading

Ericson2314 commented Mar 12, 2025 •

edited

Loading

Ericson2314 commented Mar 13, 2025 •

edited

Loading