feat(event cache): unload a linked chunk whenever we get a limited sync #4694

bnjbvr · 2025-02-19T15:15:16Z

This implements unloading the linked chunk, so as to free memory on the one hand, and avoid some weird corner cases like #4684 on the other hand.

Unloading a linked chunk happens in two steps:

first, load the last chunk from storage,
then, replace the current linked chunk with that last chunk.

Then, we make use of that functionality whenever we receive a gap via sync. This resolves the situation where we start with a hot cache store, that has one old event E1; the room's state is actually [E1, E2, E3], and the last sync returns [Gap, E3]. In this case, since we don't render gaps yet in the timeline, the timeline would show [E1, E3], making it look like we missed event E2; although the next pagination would make it appear. Instead, we here unload the linked chunk to its last chunk (E3), so that it clears [E1] from rendering, and the next paginations will start from the latest gap.

Fixes #4684.
Part of #3280.

Hywan

That's really exciting. Thank you for having worked on this! That's exactly what I had in mind and what we've talked together. Super happy we are aligned on this.

The novelty —compared to what I was imagining— is the replace_with API which I find pretty elegant. Kudos for that.

I let a couple of feedback about possible unsafety. The way your patch is implemented doesn't create unsafety I think, but marking one or two methods unsafe is primordial I believe.

Yes, tests are missing, but I know it's a first shot and I know you'll write them.

⚠️ My main concern is the following though. The user of the EventCache, and so of the Timeline will see an Update::Clear, then an Update::NewItemsChunk. Translated by linked_chunk::AsVector, it gives VectorDiff::Clear, then VectorDiff::PushBack. Basically, the timeline will “blink”/“flash”. This is not ideal at all, knowing that it can happen pretty often…

I see two solutions here:

Either we write an heuristic in AsVector:
- when a VectorDiff::Clear is followed by VectorDiff::PushBack { values } or other insertions, it can be folded/merged in a VectorDiff::Reset { values }
- however, the Timeline will re-create the timeline items, with new unique IDs, so the renderer on the app side will not be able to make a clear diff, and… “blink”/“flash” again (all timeline items will be dropped, and new items will be re-created)
- we could optimise that on the Timeline side by re-using the same unique ID for items that have been removed and re-inserted based on their event $event_id, but I think it starts to create many complications
Either, instead of emitting an Update::Clear, we emit a bunch of Update::RemoveChunk until one chunk remains. It slightly changes the approach a bit, because instead of having a replace_with, we get a remove_all_except_last. The underlying code remains the same, but the Updates are different
- ⚠️ note that AsVector expects RemoveChunk to remove… an empty chunk!! It emits zero VectorDiff. If we go in that path, we must update AsVector consequently, nothing fancy, but it must be done (edit: a draft here feat(common): Update::RemoveChunk emits VectorDiff::Remove #4696).

I am not inclined to approve this PR until we have a consensus around this question. I know you understand that. It doesn't mean your work is not good: it is excellent and I couldn't do better myself. Congrats for that. I think however we must answer these fundamental questions before moving forward.

crates/matrix-sdk/src/event_cache/room/events.rs

crates/matrix-sdk-common/src/linked_chunk/builder.rs

crates/matrix-sdk-common/src/linked_chunk/mod.rs

crates/matrix-sdk/src/event_cache/room/mod.rs

This patch updates `Update::RemoveChunk` to emit `VectorDiff::Remove`. Until now, `RemoveChunk` was expecting the chunk to be empty, because it is how it is used so far. However, with matrix-org#4694, it can change rapidly.

This patch updates `Update::RemoveChunk` to emit `VectorDiff::Remove`. Until now, `RemoveChunk` was expecting the chunk to be empty, because it is how it is used so far. However, with #4694, it can change rapidly.

codecov · 2025-02-20T14:01:24Z

Codecov Report

Attention: Patch coverage is 91.86047% with 7 lines in your changes missing coverage. Please review.

Project coverage is 85.91%. Comparing base (bdf5fad) to head (df6108c).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/matrix-sdk/src/event_cache/room/mod.rs	81.25%	6 Missing ⚠️
crates/matrix-sdk/src/event_cache/room/events.rs	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4694      +/-   ##
==========================================
+ Coverage   85.90%   85.91%   +0.01%     
==========================================
  Files         292      292              
  Lines       33850    33903      +53     
==========================================
+ Hits        29078    29128      +50     
- Misses       4772     4775       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bnjbvr · 2025-02-20T14:50:31Z

⚠️ My main concern is the following though. The user of the EventCache, and so of the Timeline will see an Update::Clear, then an Update::NewItemsChunk. Translated by linked_chunk::AsVector, it gives VectorDiff::Clear, then VectorDiff::PushBack. Basically, the timeline will “blink”/“flash”. This is not ideal at all, knowing that it can happen pretty often…

For what it's worth, we've discussed about this offline, and came to the conclusion that correctness is more important than performance here. In the absence of this crucial fix, it might look like there are missing messages in a timeline. I also suspect that the batching at the output of the timeline's subscription would mostly hide the problem described here (or result in a timeline "flash", if the timeline happened to be opened while a new gappy sync happens), but let's see in multiple steps.

Hywan

It's even better! Well done.

I suspect we have a bug and that's why I can't approve the PR for the moment, please see my feedback.

crates/matrix-sdk-common/src/linked_chunk/lazy_loader.rs

crates/matrix-sdk/src/event_cache/room/mod.rs

Hywan · 2025-02-21T14:18:53Z

crates/matrix-sdk/tests/integration/event_cache.rs

+    // Run pagination once: it will consume prev-batch2 first, which is the most
+    // recent token, which returns an empty batch, thus indicating the start of the
+    // room.
+    let pagination = room_event_cache.pagination();
+
+    let outcome = pagination.run_backwards_once(20).await.unwrap();
+    assert!(outcome.reached_start);
+    assert!(outcome.events.is_empty());
+    assert!(stream.is_empty());
+
+    // Next, we lazy-load a next chunk from the store, and get the initial, empty
+    // default events chunk.
+    let outcome = pagination.run_backwards_once(20).await.unwrap();
+    assert!(outcome.reached_start.not());
+    assert!(outcome.events.is_empty());
+    assert!(stream.is_empty());


⚠️

What? We are reaching the start of the timeline, then we paginate again and we are not reaching the start of the timeline?

How the Timeline is supposed to know it has to paginate once again if reached_start is set to true? Is it a bug?

Good catch! This was happening because there was an inconsistency between network (which indicated that we've reached the start of the room) and the persisted storage on disk (where we may have an empty initial events chunk before the final gap we just resolved).

I will add a commit that makes sure to override this value based on the current state of the chunk first, before resorting to the reached_start value obtained from network, if we couldn't figure it out ourselves (i.e. there wasn't any previous chunk).

In the future, we should consider not having empty chunks in the first place, as you hinted on Matrix, but I'd like to keep this PR smallish, and land this as soon as possible, as it's important for correctness purposes (and getting rid of empty chunks is rather an optimization in my opinion).

crates/matrix-sdk/src/event_cache/room/mod.rs

…orage updates And rename it accordingly to `RoomEvents::store_updates`. Note: no changelog, because this is an internal API only.

… during sync

…aw chunk

Hywan

I think we are good now!

Hywan · 2025-02-24T12:33:12Z

crates/matrix-sdk/tests/integration/event_cache.rs

@@ -1335,7 +1331,7 @@ async fn test_no_gap_stored_after_deduplicated_backpagination() {
    let pagination = room_event_cache.pagination();

    let outcome = pagination.run_backwards_once(20).await.unwrap();
-    assert!(outcome.reached_start);
+    assert!(outcome.reached_start.not());


Tweaked an above comment, thanks!

…g, if available

…tween network and disk It could be that we have a mismatch between network and disk, after running a back-pagination: - network indicates start of the timeline, aka there's no previous-batch token - but in the persisted storage, we do have an initial empty events chunk Because of this, we could have weird transitions from "I've reached the start of the room" to "I haven't actually reached it", if calling the `run_backwards()` method manually. This patch rewrites the logic when returning `reached_start`, so that it's more precise: - when reloading an events chunk from disk, rely on the previous chunk property to indicate whether we've reached the start of the timeline, thus avoiding unnecessary calls to back-paginations. - after resolving a gap via the network, override the result of `reached_start` with a boolean that indicates 1. there are no more gaps and 2. there's no previous chunk (actual previous or lazily-loaded). In the future, we should consider NOT having empty events chunks, if we can.

bnjbvr requested a review from a team as a code owner February 19, 2025 15:15

bnjbvr requested review from andybalaam and Hywan and removed request for a team and andybalaam February 19, 2025 15:15

Hywan requested changes Feb 19, 2025

View reviewed changes

Hywan mentioned this pull request Feb 19, 2025

feat(common): Update::RemoveChunk emits VectorDiff::Remove #4696

Merged

bnjbvr force-pushed the bnjbvr/unload-chunk branch from d0d20a3 to b147456 Compare February 20, 2025 13:46

bnjbvr force-pushed the bnjbvr/unload-chunk branch from b147456 to daff99c Compare February 20, 2025 14:50

bnjbvr requested a review from Hywan February 20, 2025 15:28

bnjbvr mentioned this pull request Feb 20, 2025

feat(event cache): automatically shrink a room's linked chunk when all subscribers are gone #4703

Merged

Hywan requested changes Feb 21, 2025

View reviewed changes

bnjbvr added 3 commits February 24, 2025 11:03

doc(event cache): clarify that RoomEvents::updates() is only for st…

d838588

…orage updates And rename it accordingly to `RoomEvents::store_updates`. Note: no changelog, because this is an internal API only.

feat(event cache): don't store a gap if we've deduplicated all events…

9019449

… during sync

feat(linked chunk): allow replacing a linked chunk's content with a r…

4ce1dfa

…aw chunk

bnjbvr force-pushed the bnjbvr/unload-chunk branch from daff99c to 2adce44 Compare February 24, 2025 11:50

bnjbvr requested a review from Hywan February 24, 2025 11:53

Hywan approved these changes Feb 24, 2025

View reviewed changes

bnjbvr added 4 commits February 24, 2025 14:32

feat(event cache): implement RoomEventCacheState::shrink_to_last_chunk

3267d3e

feat(event cache): shrink the linked chunk upon gappy syncs

9c2e705

feat(event cache): include the lazy previous chunk in the debug strin…

337faf2

…g, if available

bnjbvr force-pushed the bnjbvr/unload-chunk branch from 2adce44 to df6108c Compare February 24, 2025 13:32

bnjbvr enabled auto-merge (rebase) February 24, 2025 13:33

bnjbvr merged commit f3f37a3 into main Feb 24, 2025
41 checks passed

bnjbvr deleted the bnjbvr/unload-chunk branch February 24, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(event cache): unload a linked chunk whenever we get a limited sync #4694

feat(event cache): unload a linked chunk whenever we get a limited sync #4694

bnjbvr commented Feb 19, 2025 •

edited

Loading

Hywan left a comment •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading

bnjbvr commented Feb 20, 2025

Hywan left a comment

Hywan Feb 21, 2025 •

edited

Loading

bnjbvr Feb 24, 2025

Hywan left a comment

Hywan Feb 24, 2025

bnjbvr Feb 24, 2025 •

edited

Loading

feat(event cache): unload a linked chunk whenever we get a limited sync #4694

feat(event cache): unload a linked chunk whenever we get a limited sync #4694

Conversation

bnjbvr commented Feb 19, 2025 • edited Loading

Hywan left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 20, 2025 • edited Loading

Codecov Report

bnjbvr commented Feb 20, 2025

Hywan left a comment

Choose a reason for hiding this comment

Hywan Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

bnjbvr Feb 24, 2025

Choose a reason for hiding this comment

Hywan left a comment

Choose a reason for hiding this comment

Hywan Feb 24, 2025

Choose a reason for hiding this comment

bnjbvr Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

bnjbvr commented Feb 19, 2025 •

edited

Loading

Hywan left a comment •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading

Hywan Feb 21, 2025 •

edited

Loading

bnjbvr Feb 24, 2025 •

edited

Loading