Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(gossip): reduce gossip table memory usage #526

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Jan 31, 2025

The gossip table was previously using a regular hashmap to hold up to 1 million instances of GossipData, which is a 732 byte tagged union due to ContactInfo being 728 bytes. This means the hashmap would be at least 732 MB when full. Running on mainnet I witnessed it growing to over 1 GB.

The bulk of this data is unused due to the union being so large. Most of the union fields are actually much smaller, around 50-100 bytes. There are only around ~5000 contact infos being stored, but each of the ~1 million other elements are being allocated 732 bytes since they are instances of the same union.

The solution for this is to store each union variant's unwrapped type in a separate array. So if an item is 50 bytes, it only uses 50 bytes in the array, instead of using 732 bytes as union. This is implemented in SplitUnionList, which contains a separate ArrayList for every union field. GossipMap is a new struct that wraps this, mapping from GossipKey to the items within SplitUnionList.

I still need to do more rigorous benchmarking, but initial tests are showing that the gossip table is about 20% of its previous size.

dnut added 5 commits January 28, 2025 22:00
when overwriting, there was a bug (recently introduced) where the overwritten item is supposed to be returned to the caller, and they deinit it. instead, the *new* item was being returned as if it was the overwritten item. the caller would deinit the new item, leaving corrupted memory in the gossip table (which would eventually be double freed). the overwritten item would never be freed, so it leaked

also adds a test and adds a stack trace to an error log
…usize index, just using tagged index

this simplificaiton means you can no longer index with an integer over all items. this is a limitation, but this extra indexing makes the data structure use a surprising amount of memory. instead i've opted to make the data structure a simpler building block, and the user of the struct can implement a wrapper with this indexing if needed.

in this case, its internal indexing was redundant with indexing that GossipMap is already doing. so by stripping out this indexing, GossipMap doesn't need any extra logic to handle it. i've just simplified the code and reduced the memory usage.
@dnut dnut changed the title perf(gossip): reduce memory usage perf(gossip): reduce gossip table memory usage Jan 31, 2025
@dnut dnut marked this pull request as ready for review January 31, 2025 03:15
Copy link
Contributor

@0xNineteen 0xNineteen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits - otherwise lgtm

@@ -191,6 +191,25 @@ const GossipDataTag = enum(u32) {
ContactInfo,
RestartLastVotedForkSlots,
RestartHeaviestFork,

pub fn Value(self: GossipDataTag) type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value lowercase methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types are conventionally capitalized

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it being capital is correct


const assert = std.debug.assert;

pub const GossipMap = struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a doc comment about this struct with more context on why we have it/the pros of it

};
}

pub const Entry = struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these nested structs make sense but also feel off and make it difficult to read - feel like most of these structs can be outside of the struct - since all the logic is already enclosed in map.zig - so the context/space is already defined okay i think

self.logger.err().logf("failed to generate push messages: {any}", .{e});
self.logger.err().logf(
"failed to generate push messages: {any}\n{any}",
.{ e, @errorReturnTrace() },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered this error message and it wasn't very helpful for debugging the issue. Adding the stack trace enabled me to figure out the problem. I'd actually like to move towards this being a standard pattern any time we log an error. I was considering a few options like:

  • service manager can do this whenever any a service exits with an error
  • the logger could have a method that's specifically for logging errors and capturing their stack trace (not sure if this would work)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea. I understand how the error return trace could be helpful here, but I also find it very noisy and distracting. If this is an error message that could happen, but isn't fatal, and happens somewhat regularly (example: "NoPeers found"), I'd rather not commit code with the error return trace printing.

This is like leaving random debug prints in the code, it can be helpful for debugging, but it's annoying to others using the code. Unless I'm specifically trying to debug why this error message happened, I wouldn't like to see the trace.

@@ -64,6 +64,188 @@ pub fn RecyclingList(
};
}

pub fn SplitUnionList(TaggedUnion: type) type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a doc comment explaining the high level of this helper

}
}

pub const Entry = struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this to be above getEntry

pub fn next(self: *Iterator) ?Entry {
defer self.cursor += 1;
return if (self.cursor < self.keys.len) .{
.key_ptr = &self.keys[self.cursor],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just give iterator self and then call getEntryByIndex ?


// replace data with newly swapped value
const entry = self.store.getEntryByIndex(entry_index);
// gossip_data now points to the element which was swapped in and needs updating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this comment is off


var keys = std.ArrayList(GossipKey).init(std.testing.allocator);
defer keys.deinit();
// var data = std.ArrayList(GossipVersionedData).init(std.testing.allocator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need this commented out code?

for (0..num_items) |i| {
const versioned_value = gossip_values[i];
var gossip_values = gossip_table.store.iterator();
for (0..num_items) |_| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

Suggested change
for (0..num_items) |_| {
while (gossip_values.next()) |versioned_value|

?

@@ -46,7 +47,7 @@ pub fn AutoArrayHashSet(comptime T: type) type {
///
/// Analogous to [Crds](https://github.com/solana-labs/solana/blob/e0203f22dc83cb792fa97f91dbe6e924cbd08af1/gossip/src/crds.rs#L68)
pub const GossipTable = struct {
store: AutoArrayHashMap(GossipKey, GossipVersionedData),
store: GossipMap = .{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add a default value here and not initialize in the init function instead? I'd rather us use the init function if it already exists, default values are a bit of a smell.

const metadata = self.map.metadata.items[index];
if (metadata.timestamp_on_insertion >= self.minimum_insertion_timestamp) {
switch (self.map.tagOfIndex(index)) {
inline .ContactInfo => |tag| return self.map.getTypedPtr(tag, index),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a useless inline keyword, same thing on the next line. I'd rather you just use getTypedPtr(.ContactInfo, no need to be fancy as it's just harder to read.

.ContactInfo => {
const did_remove = self.contact_infos.swapRemove(table_len);
std.debug.assert(did_remove);
self.contact_infos.put(entry_index, {}) catch unreachable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these can't fail.

.ContactInfo => |ci| if (ci.getSocket(.gossip)) |addr| {
if (addr.eql(&gossip_addr)) return try ci.clone();
switch (self.store.tagOfIndex(index)) {
inline .ContactInfo => |tag| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing again, useless inline and just complicates the code.


pub const Index = struct {
tag: Tag,
index: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making the sub-index a non-exhaustive enum for type safety?

@@ -77,6 +77,28 @@ pub fn ReturnType(comptime FnPtr: type) type {
};
}

/// Same as std.EnumFieldStruct, except every field may be a different type
pub fn EnumStruct(comptime E: type, comptime Data: fn (E) type) type {
@setEvalBranchQuota(@typeInfo(E).Enum.fields.len + 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the + 5 come from?

for (&struct_fields, @typeInfo(E).Enum.fields) |*struct_field, enum_field| {
const T = Data(@field(E, enum_field.name));
struct_field.* = .{
.name = enum_field.name ++ "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ++ ""?

.type = T,
.default_value = null,
.is_comptime = false,
.alignment = if (@sizeOf(T) > 0) @alignOf(T) else 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special case @sizeOf(T) == 0? I don't think align(0) does what you think it does. It selects the default alignment for the target, which we don't care about here. (it also doesn't exist in Zig 0.14 anymore).

const Tag = @typeInfo(TaggedUnion).Union.tag_type.?;

return struct {
lists: sig.utils.types.EnumStruct(Tag, List),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to have an ArrayList(u8) thats bump allocated from? Where each item is struct { Tag, DataType } (user reads Tag first, then fieldParentPtr the struct to get the Data ptr).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

4 participants