Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression after salsa migration #19404

Closed
chesleypappas opened this issue Mar 20, 2025 · 28 comments · Fixed by #19474
Closed

Performance regression after salsa migration #19404

chesleypappas opened this issue Mar 20, 2025 · 28 comments · Fixed by #19474
Labels
A-perf performance issues C-bug Category: bug

Comments

@chesleypappas
Copy link

rust-analyzer version: rust-analyzer version: 0.3.2345-standalone (b0632f7 2025-03-16)

rustc version: rustc 1.85.0 (4d91de4e4 2025-02-17)

editor or extension: VSCode version 1.98, rust-analyzer extension version 0.3.2345

relevant settings: Mostly N/A. I tried disabling cache priming as mentioned in #19402, and while initial startup was faster, the first request becomes so slow that overall performance seems about the same.

repository link (if public, optional): Private

code snippet to reproduce: Monorepo containing one Rust workspace with ~100 member crates

See profile output with the regression:
ra_profile_0_3_2345.txt

For comparison, here is the profile logs after downgrading back to 0.3.2337:
ra_profile_0_3_2337.txt

TL;DR: New version takes 112674ms in parallel_prime_caches where the old version took 25970ms in parallel_prime_caches.

Additionally, I noticed that the new version seems to randomly hang indefinitely at 100% cpu usage after using it for a few minutes (then I need to restart which then needs to re-index). I haven't included the logs for this, because they are extremely large and contain a lot of files/paths/etc. (new version log file is ~750MB using RA_LOG=info) that I would need to redact before I could share it here.

@chesleypappas chesleypappas added the C-bug Category: bug label Mar 20, 2025
@davidbarsky
Copy link
Contributor

davidbarsky commented Mar 20, 2025

Thanks for the report and the RA_PROFILE logs. A 4x regression is, ah, suboptimal.

I tried disabling cache priming as mentioned in #19402, and while initial startup was faster, the first request becomes so slow that overall performance seems about the same.

Yeah, it sometimes helps, other time it doesn't: it depends largely on your crate graph and editor. To expand on the "editor" details, editor differences can be attributed to the usage of semantic token highlighting. Providing semantic token highlighting is largely largely the same work that cache priming ("indexing") does. VS Code supports semantic tokens, Zed does not, so in VS Code, resolving the first request will be blocked on the semantic tokens request completing.

Looking at a subsection (see below) of the profile of new Salsa, I think it's interesting that rust-analyzer is spending 98 seconds resolving imports. I don't know what's going on there, but perhaps you can more details along the lines of:

  1. how you use use?
  2. whether or not you have a lot of macros or use any particularly complicated macros like cxx or sqlx? (I'm noticing crate_def_map_querys waiting on a macro expansion.)
  3. Does your workspace crate graph narrow dramatically because one crate consumes a lot of other dependent crates?
``` 98262ms resolve_imports loop (27 calls) 40ms record_resolved_import (8622 calls) 13ms PerNs::filter_visibility (64690 calls) 98177ms resolve_import @ import_path = (76022 calls) 49ms PerNs::filter_visibility (222119 calls) 90635ms crate_def_map_query @ name = "" (6 calls) 90631ms DefCollector::collect (6 calls) 90631ms DefCollector::resolution_loop (6 calls) 90631ms resolve_macros loop (6 calls) 19424ms file_item_tree_query @ file_id = MacroFile(Id(a8007)) (1129 calls) 19320ms parse_macro_expansion (1129 calls) 23ms TopEntryPoint::parse @ self = MacroItems (1129 calls) 19227ms macro_expand (1129 calls) 62ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(777), Edition2021) } (94 calls) 62ms SourceFile::parse (94 calls) 62ms parse_text (94 calls) 12ms LexedStr::new (94 calls) 14ms TopEntryPoint::parse @ self = SourceFile (94 calls) 32ms build_tree (94 calls) 71175ms resolve_imports loop (49 calls) 58ms record_resolved_import (5167 calls) 21ms PerNs::filter_visibility (90810 calls) 71100ms resolve_import @ import_path = (27761 calls) 12ms PerNs::filter_visibility (53399 calls) 69323ms crate_def_map_query @ name = "" (9 calls) 69321ms DefCollector::collect (9 calls) 69321ms DefCollector::resolution_loop (9 calls) 69321ms resolve_macros loop (9 calls) 9883ms file_item_tree_query @ file_id = MacroFile(Id(a8006)) (101 calls) 9866ms parse_macro_expansion (101 calls) 9850ms macro_expand (101 calls) 19ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(3747), Edition2021) } (28 calls) 19ms SourceFile::parse (28 calls) 19ms parse_text (28 calls) 10ms build_tree (28 calls) 59433ms resolve_imports loop (31 calls) 24ms record_resolved_import (1763 calls) 59406ms resolve_import @ import_path = (3530 calls) 47747ms crate_def_map_query @ name = "" (4 calls) 47746ms DefCollector::collect (4 calls) 47746ms DefCollector::resolution_loop (4 calls) 47746ms resolve_macros loop (4 calls) 15304ms file_item_tree_query @ file_id = MacroFile(Id(a8000)) (797 calls) 15239ms parse_macro_expansion (797 calls) 18ms TopEntryPoint::parse @ self = MacroItems (797 calls) 15165ms macro_expand (797 calls) 10ms parse @ file_id = EditionedFileId { editioned_file_id: EditionedFileId(FileId(3852), Edition2021) } (52 calls) 10ms SourceFile::parse (52 calls) 10ms parse_text (52 calls) ```

@ChayimFriedman2
Copy link
Contributor

Why would new Salsa change anything with imports resolution?

@davidbarsky
Copy link
Contributor

Why would new Salsa change anything with imports resolution?

No clue!

@chesleypappas
Copy link
Author

  1. how you use use?

Hm, to give you a rough idea, I see ~3000 instances of use crate::..., ~1000 instances of use super::... (but only ~200 of those are not use super::*; which I suspect are in test mods, if that matters), and ~10000 of use <other_deps>::.... Does that answer your question?

  1. whether or not you have a lot of macros or use any particularly complicated macros like cxx or sqlx? (I'm noticing crate_def_map_querys waiting on a macro expansion.)

We do use a few of our own procedural macros pretty heavily, but neither of the two you mentioned. I am not sure I am familiar enough to say how they compare complexity-wise to those you mentioned, but I'd guess they are of moderate complexity.

  1. Does your workspace crate graph narrow dramatically because one crate consumes a lot of other dependent crates?

Yes. This is essentially a client-side UI app so there is one "entry-point" crate and that has ~50 direct dependencies on others in the workspace (and transitive dependencies on everything else).

@ChayimFriedman2
Copy link
Contributor

@chesleypappas Can you give us the logs with RA_LOG=hir_def::nameres=debug? This should give us more insight into the name resolution process.

@chesleypappas
Copy link
Author

chesleypappas commented Mar 21, 2025

I generated the logs, but they are 550MB and contain data that I would need to redact before I could share. Given the size, I am afraid that would be a time consuming process. I got in touch with David on discord and he requested I build a special branch of the rust-analyzer with additional analysis-stats, but the results didn't reveal anything conclusive.

After that, since I had it pulled down and building locally anyway, I figured I would play around with adding some deeper tracing to find the bottleneck against my codebase since I cannot share it. So far, it seems that it is related to the expansion of proc_macros. In particular, I have traced down to this lock acquisition and the subsequent request two lines below that (with separate spans) and it appears that is where the bottleneck lies in our case. Here is an excerpt from the profile log of a single module crate from our codebase:

   11168ms crate_def_map_query @ name = "common_transform_types" 
     11083ms DefCollector::collect
       11083ms DefCollector::resolution_loop
         11083ms resolve_macros loop
           10895ms file_item_tree_query @ file_id = MacroFile(Id(5ab21))  (1810 calls)
             10637ms parse_macro_expansion (1810 calls)
                 165ms TopEntryPoint::parse @ self = MacroItems  (1810 calls)
               10235ms macro_expand (1810 calls)
                   101ms built_in_derive_expand (780 calls)
                      20ms TopEntryPoint::parse @ self = MacroItems  (780 calls)
                 10086ms proc_macro_expand (1021 calls)
                    2843ms ProcMacroServer::send_task @ macro = rstest cargo_pkg_name = "common-transform-types"
                      1330ms send_task_aquire_lock
                      1513ms send_task_send_request
                      38ms ProcMacroServer::send_task @ macro = rstest cargo_pkg_name = "common-transform-types"
                        34ms send_task_aquire_lock
                      14ms ProcMacroServer::send_task @ macro = NetworkParsed cargo_pkg_name = "common-transform-types"
                        13ms send_task_aquire_lock
                      14ms ProcMacroServer::send_task @ macro = AsyncDeserialize cargo_pkg_name = "common-transform-types"
                        14ms send_task_aquire_lock
                      27ms ProcMacroServer::send_task @ macro = derive_test_only cargo_pkg_name = "common-transform-types"
                        26ms send_task_aquire_lock
                      14ms ProcMacroServer::send_task @ macro = NetworkParsed cargo_pkg_name = "common-transform-types"
                        13ms send_task_aquire_lock
                     164ms ProcMacroServer::send_task @ macro = AsyncDeserialize cargo_pkg_name = "common-transform-types"
                       163ms send_task_aquire_lock
                    1546ms ProcMacroServer::send_task @ macro = Serialize cargo_pkg_name = "common-transform-types"
                      1545ms send_task_aquire_lock
                      17ms ProcMacroServer::send_task @ macro = derive_test_only cargo_pkg_name = "common-transform-types"
                        16ms send_task_aquire_lock
                      20ms ProcMacroServer::send_task @ macro = NetworkParsed cargo_pkg_name = "common-transform-types"
                        19ms send_task_aquire_lock
                      22ms ProcMacroServer::send_task @ macro = derive_test_only cargo_pkg_name = "common-transform-types"
                        22ms send_task_aquire_lock
                      22ms ProcMacroServer::send_task @ macro = derive_test_only cargo_pkg_name = "common-transform-types"
                        21ms send_task_aquire_lock
                      28ms ProcMacroServer::send_task @ macro = AsyncDeserialize cargo_pkg_name = "common-transform-types"
                        27ms send_task_aquire_lock
                    1452ms ProcMacroServer::send_task @ macro = AsyncDeserialize cargo_pkg_name = "common-transform-types"
                      1450ms send_task_aquire_lock
                    2913ms ProcMacroServer::send_task @ macro = derive_test_only cargo_pkg_name = "common-transform-types"
                      2912ms send_task_aquire_lock
                      68ms ProcMacroServer::send_task @ macro = NetworkParsed cargo_pkg_name = "common-transform-types"
                        68ms send_task_aquire_lock
              78ms resolve_imports loop (9 calls)
                18ms record_resolved_import (205 calls)
                52ms resolve_import @ import_path = cfg_test_attr_derive::derive_test_only  (1477 calls)
                  11ms PerNs::filter_visibility (2351 calls)

Note that this was taken using RA_PROFILE=*>10, so the missing send_task_send_request spans were less than 10ms. (Also note: This was from a debug build, so the precise durations are likely inflated, but I would assume that the overall conclusion remains valid)

Based on this, I suspect that the single proc macro expansion server process has become a bottle neck for some reason, although it doesn't really make much sense to me what might have changed to cause this since it seems quite far removed from the recent Salsa migration.

@Veykril Veykril added the A-perf performance issues label Mar 22, 2025
@Veykril
Copy link
Member

Veykril commented Mar 22, 2025

That sounds very confusing, as you pointed out the salsa migration can't really have much to do with the locking contention here 😕

@davidbarsky
Copy link
Contributor

For what it's worth, you can build rust-analyzer with debug symbols by running cargo xtask install --server --dev-rel. No need for this to be an either/or.

@chesleypappas
Copy link
Author

Got a chance to dig into this slightly deeper yesterday. I have been drilling down to the latency of expanding a specific rstest proc_macro in our codebase. In the pre-salsa version, it takes ~100ms for the ProcMacroServer to process the request, but in the post-salsa version it takes ~1000-1500ms. I enabled debug logs to capture the actual request payload sent in each version and found that they are almost identical with the exception of the span_data_table field. I haven't had time to dig into what the content actually means, but the post-salsa value being sent is somewhat longer than the pre-salsa version. Presumably, this is related to why the ProcMacroServer is taking longer to expand it. I am also fairly confident that the ProcMacroServer did not change in any meaningful way between versions, but even if it did, I don't think that running the extension in debug through the VSCode launch configuration does anything to swap out the ProcMacroServer.

I'll keep digging and see if I can understand how the span_data_table parameter is constructed, why it changed, and why it makes the expansion slower.

@Veykril
Copy link
Member

Veykril commented Mar 26, 2025

The only difference in the span_data_table I imagine will be that the encoded syntax context IDs are no longer linear, but instead kind of random due to the way IDs changed. So I don't tihnk that's that. Is it taking longer on the proc-macro server side or on the rust-analyzer client side. Is it taking longer in the de/serialization logic, or is more stuff maybe blocking on the proc-macro server mutex on the client side?

It would be fairly confusing if the lock is more contended now, given we haven't actually changed logic relevant to that anywhere.

@chesleypappas
Copy link
Author

the encoded syntax context IDs are no longer linear, but instead kind of random due to the way IDs changed

Interesting. Could the IDs being random potentially have impacts on the data locality/cache hit rates, or am I off base? Also, I am pretty sure there are more context IDs being passed now, not just "different" IDs.

Is it taking longer on the proc-macro server side or on the rust-analyzer client side.

I hacked together a span to measure a "time till first byte" of the response from the proc-macro server, and it is definitely taking longer to receive the response, not deserialize it.

is more stuff maybe blocking on the proc-macro server mutex on the client side?

Yes, there are inflated latencies to acquire a lock on the mutex, but the data suggests that is due to the higher latency in receiving the response from the proc-macro server itself (i.e. it is a symptom, not the root cause). The overall number of requests to the proc-macro server is approximately the same, and it isn't just the lock acquisition that is taking longer, it is the actual requests after the lock is acquired.

It would be fairly confusing if the lock is more contended now, given we haven't actually changed logic relevant to that anywhere.

I agree, that is why I started digging into the duration of the requests themselves.

@davidbarsky
Copy link
Contributor

It would be fairly confusing if the lock is more contended now, given we haven't actually changed logic relevant to that anywhere.

I'd be kinda wild/weird if the removal of the Salsa feature in the span crate (and therefore, the non-Salsa fallback) caused this slowdown. I am 99% sure that this isn't the cause, but... I'll confirm by re-introducing that feature and fallback.

@Veykril
Copy link
Member

Veykril commented Mar 26, 2025

Interesting. Could the IDs being random potentially have impacts on the data locality/cache hit rates, or am I off base?

No, semantically those IDs are not used at all within the proc-macro server nor within de/serialization. They only exist to be moved around within the expansions.

Also, I am pretty sure there are more context IDs being passed now, not just "different" IDs.

That would be surprising and potentially indicative of a bug. Or are you maybe observing more expansions? (this would sound like an incrementality issue)

I hacked together a span to measure a "time till first byte" of the response from the proc-macro server, and it is definitely taking longer to receive the response, not deserialize it.

Just to be sure, the tests are done against the same rust toolchain right? (just to be safe you are not accidentally measuring a different proc-macro server instead)

The overall number of requests to the proc-macro server is approximately the same, and it isn't just the lock acquisition that is taking longer, it is the actual requests after the lock is acquired.

This is very weird, so that implies either serializing is taking longer (which can't be related to salsa), the proc-macro server is slower (which can't be related to salsa as it shouldn't even be built against the salsa branch yet, not should it depend on salsa), or deserialziing takes longer on the client side which again can't be salsa related 😕

@Veykril
Copy link
Member

Veykril commented Mar 26, 2025

I'd be kinda wild/weird if the removal of the Salsa feature in the span crate (and therefore, the non-Salsa fallback) caused this slowdown. I am 99% sure that this isn't the cause, but... I'll confirm by re-introducing that feature and fallback.

I doubt that, the feature was a fairly recent addition in the first place, and the entirety of salsa should be dead code in the proc-macro server anyways.

@ChayimFriedman2
Copy link
Contributor

Are you sure it was the Salsa PR? I suspect it was #19337, which did touch the scheduling of proc macros (although I can't tell why it would cause this).

@chesleypappas
Copy link
Author

I was able to add tracing inside the proc-macro-server itself and for identical macro expansions using identical proc-macro-server builds, the "pre-salsa" request took ~100ms and the "post-salsa" request took ~2400ms. I haven't confirmed the specific commit that caused the regression, but it occurred between the 3/10 release and the 3/17 release which was mostly the salsa migration if I understand correctly.

Attached are the request payloads for the two requests.

slow_request.txt

fast_request.txt

@chesleypappas
Copy link
Author

Hm. Interestingly, I could have sworn that the last time I looked there were more span_data_table indices in the slow request... Perhaps I was mistaken. In any case, I will try to find time today to add deeper tracing to the proc-macro-server to see where the additional latency is happening.

@ChayimFriedman2
Copy link
Contributor

The PR I referred to also landed in that timeframe, and it is a more likely candidate.

@ChayimFriedman2
Copy link
Contributor

If you can bisect the commit that caused this, that will be very helpful.

@chesleypappas
Copy link
Author

Sure, I will try to do that today. In the meantime, yesterday I added deeper tracing, and arrived all the way down to where the rust internal proc-macro expansion logic is being called here and determined that basically all of the performance regression is happening inside that call (despite the inputs being nearly identical). I even logged the parsed_attributes and parsed_body and they are also effectively identical as near as I can tell.

I will work on confirming the precise commit where the regression occurs, but I wonder if perhaps this is a side-effect of incrementality somehow? (grain of salt: I have not read through any of the PRs/implementation details and am just reasoning based on the word "incrementality") Like perhaps the rstest proc-macro is slow to expand the first time it is expanded within an incremental "unit" (not sure if "compilation unit" is the correct term), but since we have a large number of crates, that is a large number of "cold" expansions.

Anyway, I will keep you all posted on my findings regarding the offending commit.

@Veykril
Copy link
Member

Veykril commented Mar 28, 2025

if the slow down is happening inside the proc-macro server then that sounds to me like you are comparing two different proc-macro servers here right now. Swapping out rust-analyzer cannot have a performance impact on the proc-macro server itself if the inputs are the same. And if the only difference of the inputs is the syntax context numbers (not their amount, just the IDs themselves) then we can say the inputs are equal in this regard. (the proc-macro server does not inspect the IDs currently).

@chesleypappas
Copy link
Author

if the slow down is happening inside the proc-macro server then that sounds to me like you are comparing two different proc-macro servers here right now.

I was definitely using the same proc-macro server, because it was the one that I had built myself with the additional tracing. The only thing I changed between runs was the version of the rust-analyzer extension in VSCode, and the tracing was still being emitted so it was still using my build.

@chesleypappas
Copy link
Author

I have also identified the commit where the performance regression occurs: 7edfeb9

I haven't looked through it in detail, but from glancing at it, it seems wholly unrelated...

@chesleypappas
Copy link
Author

chesleypappas commented Mar 28, 2025

Oh wow, I think I have isolated the specific lines that changed to here. On a whim, I reverted that specific change back to

Some(if pkg.is_member {
             cargo.workspace_root().to_path_buf()
         } else {
             pkg.manifest.parent().to_path_buf()
         })

And the performance returned to normal. And now it occurs to me that the only other difference in the fast/slow request payloads above was the presence/absence of a current_dir field. I didn't think anything of it at the time, because I assumed that if the directory were important then it would cause an error, not a performance regression.

@Veykril, I don't understand the context of the change, but could it be as simple as changing that back to pass the directory?

@Veykril
Copy link
Member

Veykril commented Mar 28, 2025

Ugh, ye that change is actually plain wrong! I mixed up where how this is used. I thought the working directory is looked up for proc-macro definition crate, but its looked up in the calling crate!

Very odd that this has a performance impact for you though, if anything this should cause some macros to just not work.

@Veykril
Copy link
Member

Veykril commented Mar 29, 2025

#19474 will fix this, and it should also prevent us from messing up the cwd here again. Sorry for the time loss, and thanks a lot for debugging this!

@Veykril
Copy link
Member

Veykril commented Mar 29, 2025

Though I amr eally curious how this manifested as a performance regression for you 😅

@chesleypappas
Copy link
Author

#19474 will fix this, and it should also prevent us from messing up the cwd here again. Sorry for the time loss, and thanks a lot for debugging this!

@Veykril No problem, it was an interesting side-project! Thanks for the guidance and the quick fix!

Though I amr eally curious how this manifested as a performance regression for you 😅

Yeah, my thoughts exactly! If I had to guess, I suspect it may have to do with how rust-analyzer interacts with Amazon's internal build tooling (which I definitely don't fully understand). I may poke around to see what the default cwd of the proc-macro-server is in relation to the rest of my workspace, perhaps that will give me a clue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-perf performance issues C-bug Category: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants