Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidations #3

Open
timholy opened this issue Aug 18, 2021 · 20 comments
Open

Invalidations #3

timholy opened this issue Aug 18, 2021 · 20 comments

Comments

@timholy
Copy link

timholy commented Aug 18, 2021

Continuing from JuliaLang/julia#41913:

It's not just Polyester:

 deleting num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42 invalidated:
   mt_backedges:  1: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for Polyester.worker_size() (1 children)
                  2: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for Polyester._batch_no_reserve(::Polyester.var"#11#12", ::UInt16, ::UInt32, ::UInt16, ::UInt64, ::UInt64, ::UInt64, ::Static.StaticInt{1}, ::Static.StaticInt{1}) (1 children)
                  3: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.div_dispatch!(::Matrix{Float64}, ::Matrix{Float64}, ::Matrix{Float64}, ::Val{true}, ::Val{true}) (1 children)
                  4: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.nmuladd!(::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::Int64, ::Int64, ::Int64) (1 children)
                  5: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.multithread_rdiv!(::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::Int64, ::Int64, ::Int64, ::Val{false}, ::Static.StaticInt{2}) (1 children)
                  6: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.div_dispatch!(::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::Val{true}, ::Val{true}) (1 children)
                  7: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.multithread_rdiv!(::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::Int64, ::Int64, ::Int64, ::Val{true}, ::Static.StaticInt{2}) (1 children)
                  8: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve._nthreads() (1 children)
                  9: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.div_dispatch!(::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::LinearAlgebra.Transpose{Float64, Matrix{Float64}}, ::Val{false}, ::Val{true}) (1 children)
                 10: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.multithread_rdiv!(::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::Int64, ::Int64, ::Int64, ::Val{false}, ::Static.StaticInt{1}) (1 children)
                 11: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.multithread_rdiv!(::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, ::Int64, ::Int64, ::Int64, ::Val{true}, ::Static.StaticInt{1}) (1 children)
                 12: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for TriangularSolve.div_dispatch!(::Matrix{Float64}, ::Matrix{Float64}, ::Matrix{Float64}, ::Val{false}, ::Val{true}) (1 children)
                 13: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for RecursiveFactorization.recurse!(::StrideArraysCore.PtrArray{Tuple{Int64, Int64}, (true, true), Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, ::Val{true}, ::Int64, ::Int64, ::Int64, ::StrideArraysCore.PtrArray{Tuple{Int64}, (true,), Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, ::Int64, ::Int64) (1 children)
                 14: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for RecursiveFactorization.apply_permutation_threaded!(::StrideArraysCore.PtrArray{Tuple{Int64}, (true,), Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, ::StrideArraysCore.PtrArray{Tuple{Int64, Int64}, (true, false), Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}) (1 children)
                 15: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for RecursiveFactorization.apply_permutation_threaded!(::StrideArraysCore.PtrArray{Tuple{Int64}, (true,), Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, ::Any) (1 children)
                 16: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for RecursiveFactorization.apply_permutation_threaded!(::StrideArraysCore.PtrArray{Tuple{Int64}, (true,), Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, ::StrideArraysCore.PtrArray{Tuple{Int64, Int64}, (true, true), Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}) (1 children)
                 17: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for DiffEqBase.var"#_#32"(::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::DiffEqBase.LUFactorize, ::Vector{Float64}, ::Matrix{Float64}, ::Vector{Float64}, ::Bool) (1 children)
                 18: signature num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:94 (formerly num_threads() in CPUSummary at /home/tim/.julia/dev/CPUSummary/src/topology.jl:42) triggered MethodInstance for OrdinaryDiffEq.compute_step!(::OrdinaryDiffEq.NLSolver{NLNewton{Rational{Int64}, Rational{Int64}, Rational{Int64}}, true, Vector{Float64}, Float64, Nothing, Float64, Ordin...

The way I have it set:

julia> Sys.CPU_THREADS
4

julia> Threads.nthreads()
1

which may explain why I'm seeing those invalidations and others may not?

@chriselrod
Copy link
Member

chriselrod commented Aug 18, 2021

which may explain why I'm seeing those invalidations and others may not?

You shouldn't when starting Julia with -t4.

If we want to fix these, the solution is to have those libraries / functions stop using num_threads().
I don't think the invalidations of num_threads are avoidable if we want it to maintain the current behavior.

@timholy
Copy link
Author

timholy commented Aug 18, 2021

See the good news in SciML/DifferentialEquations.jl#786 (emergency is over 🙂 ). We should still poke at this but it may not beisn't urgent.

@timholy
Copy link
Author

timholy commented Aug 19, 2021

I did just start Julia with julia -t4 and still saw those invalidations. I'm wondering if there's some compilation-state dependence, and in particular whether it matters whether you build CPUSummary via ] precompile or via using SomePackageThatForcesItToBuild.

With master for Julia and ] dev SnoopCompile SnoopCompileCore (or just "regular" when 2.8 comes out) it's pretty easy to check:

using SnoopCompileCore
invalidations = @snoopr using OrdinaryDiffEq ModelingToolKit;
using SnoopCompile
trees = invalidation_trees(invalidations)

and then look for a num_threads tree.

Again, I don't think this is an issue of primary importance, but it does seems worth keeping track of for a rainy day. The perform_step! recompilation is ~0.5s, so not entirely cheap (but not catastrophic either).

@chriselrod
Copy link
Member

chriselrod commented Aug 19, 2021

using SnoopCompileCore
invalidations = @snoopr using OrdinaryDiffEq, ModelingToolkit;
using SnoopCompile, CPUSummary
trees = invalidation_trees(invalidations);
ctrees = filtermod(CPUSummary, trees)

I get

julia> ctrees = filtermod(CPUSummary, trees)
1-element Vector{SnoopCompile.MethodInvalidations}:
 inserting convert(S::Type{<:Union{Number, T}}, p::MultivariatePolynomials.AbstractPolynomialLike{T}) where T in MultivariatePolynomials at /home/chriselrod/.julia/packages/MultivariatePolynomials/vqcb5/src/conversion.jl:65 invalidated:
   mt_backedges: 1: signature Tuple{typeof(convert), Type{Hwloc.Attribute}, Any} triggered MethodInstance for CPUSummary.safe_topology_load!() (1 children)


julia> Threads.nthreads(), Sys.CPU_THREADS
(8, 8)

In another Julia session

julia> ctrees = filtermod(CPUSummary, trees)
2-element Vector{SnoopCompile.MethodInvalidations}:
 inserting convert(S::Type{<:Union{Number, T}}, p::MultivariatePolynomials.AbstractPolynomialLike{T}) where T in MultivariatePolynomials at /home/chriselrod/.julia/packages/MultivariatePolynomials/vqcb5/src/conversion.jl:65 invalidated:
   mt_backedges: 1: signature Tuple{typeof(convert), Type{Hwloc.Attribute}, Any} triggered MethodInstance for CPUSummary.safe_topology_load!() (1 children)

 deleting num_threads() in CPUSummary at /home/chriselrod/.julia/packages/CPUSummary/dEmFX/src/topology.jl:42 invalidated:
   backedges: 1: superseding num_threads() in CPUSummary at /home/chriselrod/.julia/packages/CPUSummary/dEmFX/src/topology.jl:42 with MethodInstance for CPUSummary.num_threads() (2 children)


julia> Threads.nthreads(), Sys.CPU_THREADS
(1, 8)

(ode) pkg> st CPUSummary
      Status `~/Documents/progwork/julia/env/ode/Project.toml`
  [2a0fbf3d] CPUSummary v0.1.2

So it appears to be working as intended for me.

@chriselrod
Copy link
Member

chriselrod commented Aug 19, 2021

Again, I don't think this is an issue of primary importance, but it does seems worth keeping track of for a rainy day. The perform_step! recompilation is ~0.5s, so not entirely cheap (but not catastrophic either).

I strongly suspect this is enough to favor Threads.nthreads() there.
If you or @ChrisRackauckas have a benchmark I can run to confirm negligible runtime difference, I'll do that. I'll also try a few microbenchmarks.

@timholy
Copy link
Author

timholy commented Aug 19, 2021

I have

$ env | grep -i thread
JULIA_CPU_THREADS=4

Is that possibly problematic? This is on a Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz (6 physical cores).

@chriselrod
Copy link
Member

Ah, yes. It doesn't actually use Sys.CPU_THREADS, instead preferring to use information from Hwloc (width adjustments for ARM Macs -- I think Hwloc might have more details to handle things better).
Presumably you won't get invalidations when starting with 12 threads.

Perhaps, in the case of disagreement, I should favor Sys.CPU_THREADS over the actual number of threads, assuming the disagreement is because of a deliberate user choice.

@timholy
Copy link
Author

timholy commented Aug 31, 2022

At least on nightly, and when starting with -t4, the only thing that seems to be holding back good precompilation of LV-generated code is the redefinition of cache_size

@eval cache_size(::Union{Val{3},StaticInt{3}}) = $(static(cache_l3_per_core * nc))
which was formerly defined on
@eval cache_size(::Union{Val{$i},StaticInt{$i}}) = $(static(csi))

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

CC @Tokazama

@timholy
Copy link
Author

timholy commented Aug 31, 2022

I'd be happy to offer an integration test that checks for new inference when running the precompiled workload for a demo consumer of LoopVectorization. If you want it, just let me know which repo I should submit it to.

@chriselrod
Copy link
Member

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

We should stop doing that.

@chriselrod
Copy link
Member

I'd be happy to offer an integration test that checks for new inference when running the precompiled workload for a demo consumer of LoopVectorization. If you want it, just let me know which repo I should submit it to.

Which repo do you think would be best? LoopVectorization.jl itself, or something that depends on it like TriangularSolve.jl or RecursiveFactorization.jl?

@timholy
Copy link
Author

timholy commented Aug 31, 2022

Probably LV itself. The only issue to be aware of is that tracking down the origin of breakage might require a bit of hunting: if a PR to, say, this package breaks the integration test, then you won't know you've broken it until you next run the tests of LoopVectorization.jl. Unless you like the idea of running that specific test in several of LV's dependencies? You can see somethng similar to what I mean in CodeTracking, which exists to serve Revise:

@Tokazama
Copy link
Member

Apologies I'm not sure what the original motivation for redefining it like this was.

@ChrisRackauckas
Copy link
Member

There's https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/.github/workflows/Downstream.yml which is a quick way to setup a bunch of integration tests on subsets of downstream package tests.

@timholy
Copy link
Author

timholy commented Aug 31, 2022

method redefinition

We should stop doing that.

One thing to check: are you aware that you can use the precompilation process to your advantage? Your package can contain

const some_value_or_type_that_must_be_known_to_inference = begin
    # Some complicated computation, calling lots of functions, which may not be inferrable
end

and the only thing that gets written to the .ji cache file is some_value_or_type_that_must_be_known_to_inference itself. In other words, that block only runs at precompile time, it doesn't run when you load the package.

Of course, if you need to some things in __init__, then this won't help.

For things like setting the number of threads, can LV basically do the same thing we do with LLVM multiversioning? @turbo could emit a block that starts with

if Threads.nthreads() == 1
    # single-threaded implementation
elseif Threads.nthreads() = 6 # my laptop has 6 physical cores
    # 6-thread implementation
else
    @debug "Non-optimized implementation"
    # fallback
end

For users who might want to customize the default number (I typically use 4 threads to reserve a couple for something besides Julia) we could use Preferences.

@timholy
Copy link
Author

timholy commented Aug 31, 2022

@ChrisRackauckas, do you have a link to whatever sits on the opposite side of that workflow? It looks useful but I wasn't sure how to trigger it.

@ChrisRackauckas
Copy link
Member

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/test/runtests.jl#L5-L17

It just grabs the group and runs the subset of the tests.

1 similar comment
@ChrisRackauckas
Copy link
Member

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/test/runtests.jl#L5-L17

It just grabs the group and runs the subset of the tests.

@chriselrod
Copy link
Member

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

You give me too much credit!
I'd meant to start working on cache-based blocking in LoopVectorization, but started working on the rewrite instead.

This was added for that, under the theory it's unlikely to change normally.
Then, more recently, I decided to start redefining L3 cache sizes based on how many threads we have, so code using it won't try to use more than its "share".
This causes invalidations, but is maybe helpful for packages like Octavian.

All that said, one fix was to remove it from LoopVectorization:
JuliaSIMD/LoopVectorization.jl@def5ad1
A second fix was to define the cache as cache per core:
e6f6461

Long term, I'm not overly concerned about this library.
The rewrite will get cache sizes via
https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html#a11e8f29aef00ec6b5ffe4bfcc9e965f4
and should hopefully play well with whatever multi-versioning scheme we're using. But we'll see what issues arise when we get there, and that's still a long ways off at the moment.

@chriselrod
Copy link
Member

chriselrod commented Aug 31, 2022

For things like setting the number of threads, can LV basically do the same thing we do with LLVM multiversioning? @turbo could emit a block that starts with

@tturbo or @turbo threads=true could do something like that, but we probably only need the check vs 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants