Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribution for "insert_backedges" invalidations #41913

Merged
merged 1 commit into from
Aug 18, 2021

Conversation

timholy
Copy link
Member

@timholy timholy commented Aug 18, 2021

Invalidations can be immediate (when an existing MethodInstance gets
invalidated by a new method definition) or delayed. The latter occurs
during deserialization: when a package was built, a particular call
dispatches to Method 1, but when loaded (due to different loading
orders and dependencies) it should instead dispatch to Method 2. These
delayed invalidations are not particularly common, and perhaps because
of this SnoopCompile has never supported them well: they have merely
been dumped to the console during invalidation_tree
construction. However, in larger software stacks they seem to become
more common, and can dramatically affect precompilation success.

This simple PR identifies "causes" for such delayed invalidations,
allowing SnoopCompile to determine (in most cases) the particular
definition or deletion that triggered the change in dispatch.

Using this in conjunction with local changes to SnoopCompile (to be submitted once I get a Base.VERSION number for this change), here's a demonstration on OrdinaryDiffEq (as a screenshot to preserve color):

image

That corresponds to this code block, which until this change had never drawn my attention but which turns out to contribute 19 separate invalidations, including some expensive-to-compile methods like OrdinaryDiffEq.perform_step!, seemingly dependent (in ways not yet fully elucidated) on package build & loading order.

CC @chriselrod, @ChrisRackauckas.

Invalidations can be immediate (when an existing MethodInstance gets
invalidated by a new method definition) or delayed. The latter occurs
during deserialization: when a package was built, a particular call
dispatches to Method 1, but when loaded (due to different loading
orders and dependencies) it should instead dispatch to Method 2. These
delayed invalidations are not particularly common, and perhaps because
of this SnoopCompile has never supported them well: they have merely
been dumped to the console during `invalidation_tree`
construction. However, in larger software stacks they seem to become
more common, and can dramatically affect precompilation success.

This simple PR identifies "causes" for such delayed invalidations,
allowing SnoopCompile to determine (in most cases) the particular
definition or deletion that triggered the change in dispatch.
@timholy timholy force-pushed the teh/insert_backedges_callee branch from 38e9b93 to 9708b66 Compare August 18, 2021 08:09
@ChrisRackauckas
Copy link
Member

Yeah, that doesn't look good... we'll need to chat about this one @chriselrod haha. Could that instead be static_num_threads()?

@vtjnash vtjnash merged commit 2cb8e17 into master Aug 18, 2021
@vtjnash vtjnash deleted the teh/insert_backedges_callee branch August 18, 2021 15:36
@timholy
Copy link
Member Author

timholy commented Aug 18, 2021

Thanks for the review, @vtjnash

@chriselrod
Copy link
Contributor

chriselrod commented Aug 18, 2021

That corresponds to this code block, which until this change had never drawn my attention

Yeah, that doesn't look good... we'll need to chat about this one @chriselrod haha. Could that instead be static_num_threads()?

Yeah, was kind of hoping people wouldn't pay too much attention to that.

Roughly speaking, num_threads() is defined to return Static.static($(Sys.CPU_THREADS)), so that you can specialize on this as a compile time constant.
In the __init__(), it checks whether this is actually equal to Threads.nthreads(). If not, it will @eval to update the definition.
Thus this invalidation should not be occurring if you start Julia with -tauto, for example.

CPUSummary.jl and HostCPUFeatures.jl do this for a lot of attributes, but only the number of threads is likely to be wrong if you precompiled it on the same system that is running it.
Many other features are likely to be wrong if you create a system image on one system and move it to another, or access Julia through a shared filesystem where different nodes have different architectures.

Also, if you don't want num_threads() to return a compile-time constant static int, just use Threads.nthreads().

@timholy
Copy link
Member Author

timholy commented Aug 18, 2021

This seems like a case where we want different precompile caches depending on the number of threads. Or perhaps LV should compile the method for multiple threading options.

EDIT: for the record, I don't think I was changing my number of threads. I'll have a SnoopCompile branch up shortly that might make it easier to investigate. Maybe the parallel precompilation is a factor?

timholy added a commit to timholy/SnoopCompile.jl that referenced this pull request Aug 18, 2021
These were printed to the console but not otherwise stored.
Storing them allows one to give them more emphasis and perform
analysis. With JuliaLang/julia#41913,
it becomes possible to attribute them to specific method
definitions or deletions. (Of course there might be multiple
reasons, but we need to identify at least one in order to make
progress.)
timholy added a commit to timholy/SnoopCompile.jl that referenced this pull request Aug 18, 2021
These were printed to the console but not otherwise stored.
Storing them allows one to give them more emphasis and perform
analysis. With JuliaLang/julia#41913,
it becomes possible to attribute them to specific method
definitions or deletions. (Of course there might be multiple
reasons, but we need to identify at least one in order to make
progress.)
@chriselrod
Copy link
Contributor

This seems like a case where we want different precompile caches depending on the number of threads. Or perhaps LV should compile the method for multiple threading options.

EDIT: for the record, I don't think I was changing my number of threads. I'll have a SnoopCompile branch up shortly that might make it easier to investigate. Maybe the parallel precompilation is a factor?

When precompiling, it should be set to Sys.CPU_THREADS (assuming it agrees with what Hwloc.jl reports) without checking the value of Threads.nthreads().
It is only on __init__ that it checks the value of Threads.nthreads(). So as long as Threads.nthreads() >= Sys.CPU_THREADS, it should not be invalidating anything.
Were you using this many threads when starting Julia?

I will switch Polyester away from num_threads(). Options include:

  1. Threads.nthreads()- the disadvantage here is that Threads.nthreads() can be larger than Sys.CPU_THREADS. ThreadingUtilities does not support this at the moment and will need an adjustment. Polyester's static scheduling will get worse performance with that many threads; variable workloads that could benefit are not a good use case for Polyester anyway, so maybe I could just let people segfault if they try it with Polyester.
  2. min(Threads.nthreads(), Int(sys_threads())) - sys_threads() shouldn't be invalidated often, but I suspect it will be on JuliaHub, making this a bit of a non-starter.
  3. min(Threads.nthreads(), Sys.CPU_THREADS::Int) - Type assert because Sys.CPU_THREADS is a non-const global. Still, overhead should be small enough that this is my favorite option of the three.

@timholy
Copy link
Member Author

timholy commented Aug 18, 2021

I propose we move the rest of the discussion to JuliaSIMD/CPUSummary.jl#3

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
Invalidations can be immediate (when an existing MethodInstance gets
invalidated by a new method definition) or delayed. The latter occurs
during deserialization: when a package was built, a particular call
dispatches to Method 1, but when loaded (due to different loading
orders and dependencies) it should instead dispatch to Method 2. These
delayed invalidations are not particularly common, and perhaps because
of this SnoopCompile has never supported them well: they have merely
been dumped to the console during `invalidation_tree`
construction. However, in larger software stacks they seem to become
more common, and can dramatically affect precompilation success.

This simple PR identifies "causes" for such delayed invalidations,
allowing SnoopCompile to determine (in most cases) the particular
definition or deletion that triggered the change in dispatch.
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Invalidations can be immediate (when an existing MethodInstance gets
invalidated by a new method definition) or delayed. The latter occurs
during deserialization: when a package was built, a particular call
dispatches to Method 1, but when loaded (due to different loading
orders and dependencies) it should instead dispatch to Method 2. These
delayed invalidations are not particularly common, and perhaps because
of this SnoopCompile has never supported them well: they have merely
been dumped to the console during `invalidation_tree`
construction. However, in larger software stacks they seem to become
more common, and can dramatically affect precompilation success.

This simple PR identifies "causes" for such delayed invalidations,
allowing SnoopCompile to determine (in most cases) the particular
definition or deletion that triggered the change in dispatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
latency Latency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants