-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression after salsa migration #19404
Comments
Thanks for the report and the
Yeah, it sometimes helps, other time it doesn't: it depends largely on your crate graph and editor. To expand on the "editor" details, editor differences can be attributed to the usage of semantic token highlighting. Providing semantic token highlighting is largely largely the same work that cache priming ("indexing") does. VS Code supports semantic tokens, Zed does not, so in VS Code, resolving the first request will be blocked on the semantic tokens request completing. Looking at a subsection (see below) of the profile of new Salsa, I think it's interesting that rust-analyzer is spending 98 seconds resolving imports. I don't know what's going on there, but perhaps you can more details along the lines of:
```
98262ms resolve_imports loop (27 calls)
40ms record_resolved_import (8622 calls)
13ms PerNs::filter_visibility (64690 calls)
98177ms resolve_import @ import_path = (76022 calls)
49ms PerNs::filter_visibility (222119 calls)
90635ms crate_def_map_query @ name = "" (6 calls)
90631ms DefCollector::collect (6 calls)
90631ms DefCollector::resolution_loop (6 calls)
90631ms resolve_macros loop (6 calls)
19424ms file_item_tree_query @ file_id = MacroFile(Id(a8007)) (1129 calls)
19320ms parse_macro_expansion (1129 calls)
23ms TopEntryPoint::parse @ self = MacroItems (1129 calls)
19227ms macro_expand (1129 calls)
62ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(777), Edition2021) } (94 calls)
62ms SourceFile::parse (94 calls)
62ms parse_text (94 calls)
12ms LexedStr::new (94 calls)
14ms TopEntryPoint::parse @ self = SourceFile (94 calls)
32ms build_tree (94 calls)
71175ms resolve_imports loop (49 calls)
58ms record_resolved_import (5167 calls)
21ms PerNs::filter_visibility (90810 calls)
71100ms resolve_import @ import_path = (27761 calls)
12ms PerNs::filter_visibility (53399 calls)
69323ms crate_def_map_query @ name = "" (9 calls)
69321ms DefCollector::collect (9 calls)
69321ms DefCollector::resolution_loop (9 calls)
69321ms resolve_macros loop (9 calls)
9883ms file_item_tree_query @ file_id = MacroFile(Id(a8006)) (101 calls)
9866ms parse_macro_expansion (101 calls)
9850ms macro_expand (101 calls)
19ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(3747), Edition2021) } (28 calls)
19ms SourceFile::parse (28 calls)
19ms parse_text (28 calls)
10ms build_tree (28 calls)
59433ms resolve_imports loop (31 calls)
24ms record_resolved_import (1763 calls)
59406ms resolve_import @ import_path = (3530 calls)
47747ms crate_def_map_query @ name = "" (4 calls)
47746ms DefCollector::collect (4 calls)
47746ms DefCollector::resolution_loop (4 calls)
47746ms resolve_macros loop (4 calls)
15304ms file_item_tree_query @ file_id = MacroFile(Id(a8000)) (797 calls)
15239ms parse_macro_expansion (797 calls)
18ms TopEntryPoint::parse @ self = MacroItems (797 calls)
15165ms macro_expand (797 calls)
10ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(3852), Edition2021) } (52 calls)
10ms SourceFile::parse (52 calls)
10ms parse_text (52 calls)
```
|
Why would new Salsa change anything with imports resolution? |
No clue! |
Hm, to give you a rough idea, I see ~3000 instances of
We do use a few of our own procedural macros pretty heavily, but neither of the two you mentioned. I am not sure I am familiar enough to say how they compare complexity-wise to those you mentioned, but I'd guess they are of moderate complexity.
Yes. This is essentially a client-side UI app so there is one "entry-point" crate and that has ~50 direct dependencies on others in the workspace (and transitive dependencies on everything else). |
@chesleypappas Can you give us the logs with |
I generated the logs, but they are 550MB and contain data that I would need to redact before I could share. Given the size, I am afraid that would be a time consuming process. I got in touch with David on discord and he requested I build a special branch of the rust-analyzer with additional analysis-stats, but the results didn't reveal anything conclusive. After that, since I had it pulled down and building locally anyway, I figured I would play around with adding some deeper tracing to find the bottleneck against my codebase since I cannot share it. So far, it seems that it is related to the expansion of proc_macros. In particular, I have traced down to this lock acquisition and the subsequent request two lines below that (with separate spans) and it appears that is where the bottleneck lies in our case. Here is an excerpt from the profile log of a single
Note that this was taken using Based on this, I suspect that the single proc macro expansion server process has become a bottle neck for some reason, although it doesn't really make much sense to me what might have changed to cause this since it seems quite far removed from the recent Salsa migration. |
That sounds very confusing, as you pointed out the salsa migration can't really have much to do with the locking contention here 😕 |
For what it's worth, you can build rust-analyzer with debug symbols by running |
Got a chance to dig into this slightly deeper yesterday. I have been drilling down to the latency of expanding a specific I'll keep digging and see if I can understand how the |
The only difference in the It would be fairly confusing if the lock is more contended now, given we haven't actually changed logic relevant to that anywhere. |
Interesting. Could the IDs being random potentially have impacts on the data locality/cache hit rates, or am I off base? Also, I am pretty sure there are more context IDs being passed now, not just "different" IDs.
I hacked together a span to measure a "time till first byte" of the response from the proc-macro server, and it is definitely taking longer to receive the response, not deserialize it.
Yes, there are inflated latencies to acquire a lock on the mutex, but the data suggests that is due to the higher latency in receiving the response from the proc-macro server itself (i.e. it is a symptom, not the root cause). The overall number of requests to the proc-macro server is approximately the same, and it isn't just the lock acquisition that is taking longer, it is the actual requests after the lock is acquired.
I agree, that is why I started digging into the duration of the requests themselves. |
I'd be kinda wild/weird if the removal of the Salsa feature in the span crate (and therefore, the non-Salsa fallback) caused this slowdown. I am 99% sure that this isn't the cause, but... I'll confirm by re-introducing that feature and fallback. |
No, semantically those IDs are not used at all within the proc-macro server nor within de/serialization. They only exist to be moved around within the expansions.
That would be surprising and potentially indicative of a bug. Or are you maybe observing more expansions? (this would sound like an incrementality issue)
Just to be sure, the tests are done against the same rust toolchain right? (just to be safe you are not accidentally measuring a different proc-macro server instead)
This is very weird, so that implies either serializing is taking longer (which can't be related to salsa), the proc-macro server is slower (which can't be related to salsa as it shouldn't even be built against the salsa branch yet, not should it depend on salsa), or deserialziing takes longer on the client side which again can't be salsa related 😕 |
I doubt that, the feature was a fairly recent addition in the first place, and the entirety of salsa should be dead code in the proc-macro server anyways. |
Are you sure it was the Salsa PR? I suspect it was #19337, which did touch the scheduling of proc macros (although I can't tell why it would cause this). |
I was able to add tracing inside the proc-macro-server itself and for identical macro expansions using identical proc-macro-server builds, the "pre-salsa" request took ~100ms and the "post-salsa" request took ~2400ms. I haven't confirmed the specific commit that caused the regression, but it occurred between the 3/10 release and the 3/17 release which was mostly the salsa migration if I understand correctly. Attached are the request payloads for the two requests. |
Hm. Interestingly, I could have sworn that the last time I looked there were more span_data_table indices in the slow request... Perhaps I was mistaken. In any case, I will try to find time today to add deeper tracing to the proc-macro-server to see where the additional latency is happening. |
The PR I referred to also landed in that timeframe, and it is a more likely candidate. |
If you can bisect the commit that caused this, that will be very helpful. |
Sure, I will try to do that today. In the meantime, yesterday I added deeper tracing, and arrived all the way down to where the rust internal proc-macro expansion logic is being called here and determined that basically all of the performance regression is happening inside that call (despite the inputs being nearly identical). I even logged the I will work on confirming the precise commit where the regression occurs, but I wonder if perhaps this is a side-effect of incrementality somehow? (grain of salt: I have not read through any of the PRs/implementation details and am just reasoning based on the word "incrementality") Like perhaps the rstest proc-macro is slow to expand the first time it is expanded within an incremental "unit" (not sure if "compilation unit" is the correct term), but since we have a large number of crates, that is a large number of "cold" expansions. Anyway, I will keep you all posted on my findings regarding the offending commit. |
if the slow down is happening inside the proc-macro server then that sounds to me like you are comparing two different proc-macro servers here right now. Swapping out rust-analyzer cannot have a performance impact on the proc-macro server itself if the inputs are the same. And if the only difference of the inputs is the syntax context numbers (not their amount, just the IDs themselves) then we can say the inputs are equal in this regard. (the proc-macro server does not inspect the IDs currently). |
I was definitely using the same proc-macro server, because it was the one that I had built myself with the additional tracing. The only thing I changed between runs was the version of the rust-analyzer extension in VSCode, and the tracing was still being emitted so it was still using my build. |
I have also identified the commit where the performance regression occurs: 7edfeb9 I haven't looked through it in detail, but from glancing at it, it seems wholly unrelated... |
Oh wow, I think I have isolated the specific lines that changed to here. On a whim, I reverted that specific change back to Some(if pkg.is_member {
cargo.workspace_root().to_path_buf()
} else {
pkg.manifest.parent().to_path_buf()
}) And the performance returned to normal. And now it occurs to me that the only other difference in the fast/slow request payloads above was the presence/absence of a @Veykril, I don't understand the context of the change, but could it be as simple as changing that back to pass the directory? |
Ugh, ye that change is actually plain wrong! I mixed up where how this is used. I thought the working directory is looked up for proc-macro definition crate, but its looked up in the calling crate! Very odd that this has a performance impact for you though, if anything this should cause some macros to just not work. |
#19474 will fix this, and it should also prevent us from messing up the cwd here again. Sorry for the time loss, and thanks a lot for debugging this! |
Though I amr eally curious how this manifested as a performance regression for you 😅 |
@Veykril No problem, it was an interesting side-project! Thanks for the guidance and the quick fix!
Yeah, my thoughts exactly! If I had to guess, I suspect it may have to do with how rust-analyzer interacts with Amazon's internal build tooling (which I definitely don't fully understand). I may poke around to see what the default cwd of the proc-macro-server is in relation to the rest of my workspace, perhaps that will give me a clue... |
rust-analyzer version: rust-analyzer version: 0.3.2345-standalone (b0632f7 2025-03-16)
rustc version: rustc 1.85.0 (4d91de4e4 2025-02-17)
editor or extension: VSCode version 1.98, rust-analyzer extension version 0.3.2345
relevant settings: Mostly N/A. I tried disabling cache priming as mentioned in #19402, and while initial startup was faster, the first request becomes so slow that overall performance seems about the same.
repository link (if public, optional): Private
code snippet to reproduce: Monorepo containing one Rust workspace with ~100 member crates
See profile output with the regression:
ra_profile_0_3_2345.txt
For comparison, here is the profile logs after downgrading back to
0.3.2337
:ra_profile_0_3_2337.txt
TL;DR: New version takes 112674ms in
parallel_prime_caches
where the old version took 25970ms inparallel_prime_caches
.Additionally, I noticed that the new version seems to randomly hang indefinitely at 100% cpu usage after using it for a few minutes (then I need to restart which then needs to re-index). I haven't included the logs for this, because they are extremely large and contain a lot of files/paths/etc. (new version log file is ~750MB using
RA_LOG=info
) that I would need to redact before I could share it here.The text was updated successfully, but these errors were encountered: