diff --git a/docs/design/coreclr/jit/DeabstractionAndConditionalEscapeAnalysis.md b/docs/design/coreclr/jit/DeabstractionAndConditionalEscapeAnalysis.md new file mode 100644 index 00000000000000..9b75f5a8c982a8 --- /dev/null +++ b/docs/design/coreclr/jit/DeabstractionAndConditionalEscapeAnalysis.md @@ -0,0 +1,804 @@ +# De-Abstraction and Conditional Escape Analysis + +There are interesting, important, and optimizable patterns where objects of several different types collaborate. + +For example, consider the abstract enumeration supported by `IEnumerable`. Here an enumerable `o` of some type (say `O`) can produce an enumerator `e` of some type (say `E`) that then operates on `o`. The typical pattern is: +```C# +O o = ... +foreach(T t in o) { ... t } +``` +Under the covers, this requires creation of (or access to) a ref class or boxed value class `e` (Footnote 1). + +In the case where both `O` and `E` are concrete types known to the C# compiler, many optimizations are possible, and indeed, something like this generates quite efficient code: +```C# +int[] o = ...; +foreach (int i in o) { } +``` +and for other collection types, there are various patterns (enumerator structs) that the C# compiler can rely on avoid the overhead of enumerator creation and interface calls. So for instance enumerating over a `List` is also fairly efficient. + +But when either type `O` or `E` is unknown, enumeration can involve a fair amount of overhead—an object allocation for `e` and two interface calls per produced element `t`. + +We call this overhead the "abstraction penalty." + +This penalty shows up readily in simple benchmarks. For example +```C# +public class ArrayDeAbstraction +{ + static readonly int[] s_ro_array = new int[512]; + + [MethodImpl(MethodImplOptions.NoInlining)] + IEnumerable get_opaque_array() => s_ro_array; + + [Benchmark(Baseline = true)] + public int foreach_static_readonly_array() + { + int sum = 0; + foreach (int i in s_ro_array) sum += i; + return sum; + } + + [Benchmark] + public int foreach_static_readonly_array_via_interface() + { + IEnumerable o = s_ro_array; + int sum = 0; + foreach (int i in o) sum += i; + return sum; + } + + [Benchmark] + public int foreach_opaque_array_via_interface() + { + IEnumerable o = get_opaque_array(); + int sum = 0; + foreach (int i in o) sum += i; + return sum; + } +} +``` +| (.NET 9) Method | Mean | Ratio | Allocated | +|------------------------------------------------------------- |-----------:|------:|----------:| +| foreach_static_readonly_array | 153.4 ns | 1.00 | - | +| foreach_static_readonly_array_via_interface | 781.2 ns | 5.09 | 32 B | +| foreach_opaque_array_via_interface | 843.2 ns | 5.50 | 32 B | +| foreach_opaque_array_via_interface (no PGO) | 2,076.4 ns | 13.54 | 32 B | +| foreach_static_readonly_array_via_interface (no PGO) | 2,304.5 ns | 15.03 | 32 B | + +With PGO the JIT can now learn about and optimize for likely types of objects at runtime. That information helps the JIT to reduce the abstraction penalty substantially, but as you can see from the above, even with PGO, the penalty is still quite high, around 5x! (Footnote 2) + +Just for reference, here is the inner loop from the `foreach_static_readonly_array` case. We will contrast this with the more abstract cases as we go along. In this loop the array index computation has been strength reduced (hence `rcx` is a pointer into the array incrementing by 4), the loop is down-counting, and the loop top has been aligned. +```asm + align [8 bytes for IG03] + +G_M1640_IG03: ;; offset=0x0020 + add eax, dword ptr [rcx] + add rcx, 4 + dec edx + jne SHORT G_M1640_IG03 +``` + +This note explores approaches to fully removing the abstraction penalty in the more abstract cases above. Roughly speaking, we want to be able to do at JIT time what the C# compiler is able to do at compile time, and so have all three benchmarks deliver similar performance. + +## Analysis of .NET 9 Codegen + +In what follows we're going to use .NET 9 with PGO as our baseline, and see what challenges the JIT faces in removing the remaining abstraction penalty. +Some of these we have already overcome, and some remain. + +To do this, let's take a close look at the code produced by .NET 9 for the `opaque` case (which is the most complex). + +The C# compiler translates the `foreach` into simpler constructs, and the resulting code the JIT sees is more like the following: +```C# + IEnumerable o = get_opaque_array(); + int sum = 0; + IEnumerator e = o.GetEnumerator(); + try + { + while (e.MoveNext()) + { + int i = e.GetCurrent(); + sum += i; + } + } + finally + { + e.Dispose(); + } + return sum; +``` +There are four interface call sites, a loop, and a try-finally. + +In the "tier0-instr" phase of TieredPGO, the JIT adds class probes at each interface call site, and as this version of code runs, these probes observe that `O` (the concrete type of `o`) is very likely `int[]` and `E` is very likely `SZGenericArrayEnumerator` (see Footnote 3). + +Based on this PGO data, the JIT first translates the above into something like the following, where each interface call is now expanded by Guarded Devirtualization (aka GDV): +```C# + IEnumerable o = get_opaque_array(); + int sum = 0; + + // -------------- Begin GDV "diamond" + IEnumerator tt = null; + if (o.GetType().Equals(typeof(int[]))) + { + auto a = (int[]) o; + tt = a.GetEnumerator(); + } + else + { + tt = o.GetEnumerator; + } + e = tt; + // -------------- End GDV "diamond" + + try + { + // -------------- Begin GDV "diamond" + bool b0 = false; + if (e.GetType().Equals(typeof(SZGenericArrayEnumerator))) + { + auto ea = (SZGenericArrayEnumerator) e; + b0 = ea.MoveNext(); + } + else + { + b0 = e.MoveNext(); + } + // -------------- End GDV "diamond" + + if (b0) + { + do + { + int i = e.GetCurrent(); // also a "diamond" + sum += i; + } + while (e.MoveNext()); // also a "diamond" + } + } + finally + { + e.Dispose(); // also a "diamond" + } + return sum; +``` +Here we have shown the GDV expansions for the first two call sites, but left +the other three unexpanded. so as not to clutter things too much. (Note the call to `MoveNext` has been duplicated here by the JIT, so there are now 5 interface call sites; this duplication comes from an optimization known as loop inversion). + +This may not look like an improvement, but the key point is that in the `if` part of each GDV diamond, the interface calls are now made on objects whose types are known exactly, so these calls can be devirtualized and inlined. And thanks to PGO, we know that the `if` true cases for the diamonds are the important cases for performance. + +And that's almost what happens. + +It turns out that in .NET, arrays implement interfaces differently than most types do (see Footnote 4). So in .NET 9 the normal interface devirtualization logic does not work on array interface calls. And the JIT has a heuristic that says that a GDV expansion is not beneficial unless the call can be both devirtualized and inlined. So the above is a bit of a lie—the .NET 9 JIT does not produce the first diamond. + +And that un-optimized call usually allocates an enumerator object, which is something we'd like to avoid. So that leads us to the first challenge in removing the abstraction overhead. + +#### Challenge 1: Array Interface Call Devirtualization + +We are going to gloss over the details in this writeup. For more information, see [dotnet/runtime#108153](https://github.com/dotnet/runtime/pull/108153). + +Devirtualization of array interface calls is supported in .NET 10. + +There's something else to note in the example above (now we're assuming the initial GDV diamond is in fact created by the JIT). We see that the behavior of all these GDV tests is highly correlated. If the upper test discovers that `o` is `int[]`, the lower tests will all discover that `e` is `SZGenericArrayEnumerator`. And even if we didn't know this was the case, we could trace the connection because the result of the upper GDV diamond (`e`) is the thing that is tested in the lower GDV diamonds. So this pattern (where one GDV diamond feeds others) is going to be something we key off in later on. + +### The "Simple" Case + +Before looking more deeply at the `opaque` case, let's dig into the `...readonly_array_via_interface` version. Here the program starts with an `int[]` but casts it to `IEnumerable` for the `foreach`. This seems like a small thing—the type of the collection has been only temporarily obscured—but as you can see from the overhead numbers, it's not. + +The big difference between these versions is in `readonly_array_via_interface` the JIT is able to figure out an exact type for `o` without GDV; it simply has to propagate the type forward. That is, it more or less changes +```C# + IEnumerable o = s_ro_array; +``` +into +```C# + int[] o = s_ro_array; +``` +And that, coupled with the devirtualization and inlining of `o.GetEnumerator`, lets the JIT resolve the lower GDVs, as the inline body reveals the exact type for `E` as well. (Footnote 5). + +You would think the resolution of the lower GDVs, and the inlines of the relatively simple `MoveNext`, `GetCurrent`, and (the do-nothing) `Dispose` methods would enable the JIT to know everything about how the enumerator object is used, and in particular, see that that object cannot "escape" (that is, if newly allocated, it need not live past the end of the method). + +But here we run into a phase ordering problem. It takes a little while for the JIT to propagate these types around, and escape analysis runs fairly early, and so it turns out without some help the JIT learns too late what type of enumerator it has, so the lower GDVs are still there when escape analysis runs, and the JIT has to honor the fact that at that point it still thinks those GDVs can fail, and so invoke an unknown `MoveNext` method, which might cause the enumerator to escape. + +All that is a long winded way of saying that while the JIT eventually is able to do a lot of optimization, it doesn't happen in the right order, and some decisions get locked in early. + +#### Challenge 2: Slow Type Propagation + +While inelegant, there is some prior art for giving the JIT a boost in such cases, at least for some key types. We can annotate the type as `[Intrinsic]` and have the JIT consult an oracle when it sees this annotation, asking the runtime if it happens to know what type of enumerator will be produced. + +[dotnet/runtime#108153](https://github.com/dotnet/runtime/pull/108153) added this annotation and the necessary oracular support in the JIT and runtime. + +#### Challenge 3: Inlining of GetEnumerator for Arrays + +Now it turns out that for arrays, interface methods are implemented by a companion `SZArrayHelper` class, and the body of the `GetEnumerator` is a bit more complicated than you might guess: +```C# + [Intrinsic] + [MethodImpl(MethodImplOptions.AggressiveInlining)] + internal IEnumerator GetEnumerator() + { + // ! Warning: "this" is an array, not an SZArrayHelper. See comments above + // ! or you may introduce a security hole! + T[] @this = Unsafe.As(this); + int length = @this.Length; + return length == 0 ? SZGenericArrayEnumerator.Empty : new SZGenericArrayEnumerator(@this, length); + } +``` +To ensure this method gets inlined we also added `AggressiveInlining`; this was also done in [dotnet/runtime#108153](https://github.com/dotnet/runtime/pull/108153). + +Here `Empty` is a static field in a generic class. The examples we have been discussing all have `T == int`, but in general `T` could be a ref class (say `string`) in which case the `SZArrayHelper` class is a shared class, and the JIT generally was reluctant to inline methods that access statics of shared classes. We relaxed that restriction in [dotnet/runtime#109256](https://github.com/dotnet/runtime/pull/109256). + +Now with `GetEnumerator` inlined the JIT can see almost everything about how the enumerator is produced when the collection type is an array. This plus turns out to be enough to unblock stack allocation of the enumerator in the `...readonly_array_via_interface` example. But it's still not sufficient to allow "promotion" of the object fields (aka scalar replacement)—for full optimization the JIT must be able to treat the fields of the enumerator as if they were local variables. + +#### Challenge 4: Empty Array Enumeration + +The generic static `Empty` is there because there is a special case optimization in the BCL for empty arrays. Since enumeration of an empty array does not require mutable state, an immutable static enumerator object can be used instead of allocating a new object. So although the JIT knows the exact type of `e` it does not know (in general) which object instance is being used. This ambiguity blocks promotion. + +To resolve this we again dip back into our special knowledge of `GetEnumerator`. We want the JIT to "undo" this empty array optimization if it is able to stack allocate the enumerator, because the whole point of the optimization was to avoid heap allocation, and the JIT has avoided heap allocation by means of escape analysis (in other words, what used to be expensive is now cheap). But we don't want the JIT to undo this unless escape analysis succeeds. So if the JIT is able to successfully stack allocate an `SZGenericArrayEnumerator` it then checks if that allocation was from a call to `SZArrayHelper.GetEnumerator` and if so, it simply patches the code to always go down the "allocation" path. (Footnote 6) + +[dotnet/runtime#109237](https://github.com/dotnet/runtime/pull/109237) contains the logic needed for the JIT to recognize and prune away the empty array path, if the enumerator is stack allocated. + +#### Challenge 5: Address Exposure + +We're getting closer to our goal, in the case of `...readonly_array_via_interface`. However promotion is still not happening. Before the JIT will promote a local object's fields it must ensure it knows about all the accesses to the fields. But in our case the propagation of the "stack address" from the allocation site in the inlined `GetEnumerator` to the uses in the inlined `MoveNext` and other methods is not happening in a timely fashion, because those uses are in a loop. + +And, furthermore, there is a use in the `finally` clause. Even though the `Dispose` method doesn't do anything, there is a residual null check and so the enumerator is referenced there. + +So we made some enhancements to our early address propagation to resolve both these issues; see [dotnet/runtime#109182](https://github.com/dotnet/runtime/pull/109182) and [dotnet/runtime#109190](https://github.com/dotnet/runtime/pull/109190). + +#### Challenge 6: Cloning + +With all that in place, the abstraction for the `...readonly_array_via_interface` is largely gone: the enumerator object is now replaced by locals, one per field. And so the JIT is poised to apply its full suite of loop optimization techniques, in particular IV analysis, bounds check elimination, and so on. + +However, there is one last hurdle to overcome: the loop "header" is also the start of a try region. The JIT often relies on cloning to enable bounds check elimination: it analyzes the loop body, and if it sees that the array accesses in the loop can be proven to be in-bounds by a check made before the loop starts, it duplicates the loop, adds the checks, and so ends up with a "fast path" loop with no (or fewer) bounds checks, and a slow path loop with full checking. (Footnote 7). + +So we did work to enable cloning when the loop header is also a try entry [dotnet/runtime#108604](https://github.com/dotnet/runtime/pull/108604). But when working on this we realized it was going to lead to a lot more cloning, and so we first had to put some more realistic limits on how much code the JIT would be willing to duplicate to try and optimize [dotnet/runtime#108771](https://github.com/dotnet/runtime/pull/108771). + +### Simple case (as of 01-Dec-24) + +With all that the "simple" case `...readonly_array_via_interface` is now a good deal faster than it was in .NET 9, but still not perfect: + +| Method | Mean | Ratio | Allocated | +|------------------------------------------------------------- |-----------:|------:|----------:| +| foreach_static_readonly_array | 153.4 ns | 1.00 | - | +| foreach_static_readonly_array_via_interface (.NET 10) | 295.7 ns | 1.93 | - | +| foreach_static_readonly_array_via_interface (.NET 9) | 781.2 ns | 5.09 | 32 B | +| foreach_static_readonly_array_via_interface (.NET 9, no PGO) | 2,304.5 ns | 15.03 | 32 B | + +It's worth calling out that the .NET 10 version does not heap allocate. The enumerator object is now on stack (and promoted). + +The inner loop is middle-entry, up-counting and not strength reduced or IV-widened. There are still some artifacts from the way the inlined enumeration happens that we need to iron out to unblock these last few optimizations. But we're very close. + +```asm + align [6 bytes for IG03] + +G_M36467_IG03: ;; offset=0x0020 + mov edx, edx + add eax, dword ptr [rcx+4*rdx+0x10] + +G_M36467_IG04: ;; offset=0x0026 + inc edx + cmp edx, 512 + jb SHORT G_M36467_IG03 +``` + +#### Challenge 7: Fully Enabling Loop Opts + +To be determined! + +### The "complex" case + +Now let's turn our attention back to the more complex case, where the only way to learn the collection type is via PGO. Did it see any benefit from all the work above? It did, a little: + +| Method | Mean | Ratio | Allocated | +|------------------------------------------------------------- |-----------:|------:|----------:| +| foreach_static_readonly_array | 153.4 ns | 1.00 | - | +| foreach_opaque_array_via_interface (.NET 10) | 680.4 ns | 4.44 | 32 B | +| foreach_opaque_array_via_interface (.NET 9) | 843.2 ns | 5.50 | 32 B | +| foreach_opaque_array_via_interface (.NET 9 ,no PGO) | 2,076.4 ns | 13.54 | 32 B | + +But it still allocates, so the enumerator object is still a heap object. + +We now need to deal with having the upper GDV diamond, so in the code below, we cannot resolve the GDV tests, since we are not 100% sure which type of object we have for `e`. PGO can tell us the likely type, and GDV can guess for that type, but we don't have any guarantees. + +How are we going to arrange to have these guarantees? If you've been reading the footnotes, or perhaps have a good intuitive feel for this sort of thing, the answer is that we are going to clone and create a specialized version of this code for the case where the upper GDV test succeeds; once we do that this case more or less reduces to the case above. (Footnote 8) + +Easier said than done. Cloning takes JIT time and creates a lot more code, so we really only want the JIT to do this transformation if it will in fact enable stack allocation (and promotion), but the JIT seemingly has to clone and try to optimize in order to find out if it is worth cloning (and then perhaps undo it all if it is not worthwhile). This kind of "do-check-undo" approach to optimization might be feasible in compilers that have generous compile time budgets, but the JIT does not. We need to know up front whether the whole thing is going to pay off. + +What this means in practice is that when we run escape analysis, we need it to analyze the code as if we'd cloned and split off the good path from the upper GDV diamond, without actually having done so. And, let's not forget the empty array optimization is still in there possibly confusing things too. + +### Conditional Escape Analysis + +Thanks to PGO and GDV, if we knew all the lower GDV tests succeeded, then we would know the enumerator object could not escape. So the first key insight is that the problematic cases are the ones where the lower GDV tests fail. The second key insight is that the failing side of each GDV diamond is particularly simple: an interface call where the enumerator is the `this`. And the third key insight is that all the lower GDV tests are equivalent and only depend on the value produced at the bottom of the upper GDV diamond. + +So when doing escape analysis, if we can determine that all escaping accesses are interface calls of that form, and all are made under a failing GDV test checking the type of the object produced by the upper GDV diamond, we've conditionally proven that those references won't cause escape, if we can arrange for the GDV test to always succeed, which we can do by cloning the entire region from the enumerator allocation site in the upper diamond down through all the uses in the lower diamonds. + +#### Escape Analysis + +To put the changes needed in escape analysis into context, we first need to understand a little about how it works. The JIT arranges things so that each explicit object allocation site in a method stores its results to a unique local variable (say `Ai`). + +The JIT then analyzes all uses of all variables. +* if a variable is "address exposed" then it is marked as escaping (this means more or less that the JIT cannot reliably track where all reference to that variable might occur) +* if a variable's value is copied to another variable (say `X = Y`), and edge is added to the connection graph `X -> Y`, meaning if `X` escapes then so will `Y`. +* if a variable is otherwise used, the JIT does local analysis to decide +if that use is escaping. + +The JIT then marks any variable connected to an escaping variable as escaping. This process continues until it reaches transitive closure. + +If a given `Ai` escapes, then the associated allocation escapes and must be heap allocated. Otherwise the allocation may be eligible for stack allocation (other restrictions apply, which further winnows down this set). + +The eventual set of surviving allocations are then transformed to be new struct-valued locals, and references to the associated `Ai` in the JIT IR are rewritten to be the address of this struct. + + Let's see how this plays out in a simplified example. Suppose we have + ```C# + IEnumerable e = o.GetEnumerator(); + o.MoveNext(); + ``` + After GDV expansion and inlining, this will look something like: + ```C# + IEnumerable t = null; + if (o.Type == O) + { + ac = new E(); + t = ac; + } + else + { + t = o.GetEnumerator(); + } + e = t; + + if (e.Type == E) + { + // inlined e.MoveNext + ea = (E) e; + // enumerator ref may be copied to other locals + tt = ea; + tt.field--; + } + else + { + e.MoveNext(); + } + ``` +First let's just do the normal escape analysis. The analysis is flow-insensitive so we can look at the operations in the example in any order. We'll go top to bottom: +* `t = null`: no impact +* `o.Type == O`: no impact +* `ac = new E`: no impact +* `t = ac`: add edge `t -> ac` +* `t = o.GetEnumerator()`: `o` escapes +* `e -> t`: add edge `e -> t` +* `e.Type == T`: no impact +* `ea = (E) e`: add edge `ea -> e` +* `tt = ea` : add edge `tt -> ea` +* `tt.field--`: no impact +* `e.MoveNext()`: `e` escapes + +Then closure computation runs: +* `o` escapes (no impact) +* since `e` escapes and `e -> t`, `t` escapes +* since `t` escapes and `t -> ac`, `ac` escapes + +Since `ac` escapes, the `new E()` must be a heap allocation. + +#### Pseudo-Variables + +If you look closely at the simplified example, `new E()` cannot actually escape, because if that path of the upper GDV is taken, then the `e.MoveNext()` on the failing side of the lower GDV cannot happen. Establishing that requires conditional escape analysis. + +To model conditional escape, the JIT first identifies allocation sites like +`ac = new E()` that happen on the successful side of GDV diamonds, where a reference to the allocated object can become the GDV result, and finds the local that holds the result of that GDV (here it is `e`). For each such `E, e` pair the JIT creates a new pseudo-variable `P` that will connect up references to `e` or copies thereof that happen under a failed GDV guard that tests the type of `e` against `E`. + +So it creates a mapping `(e, E) -> P`. + +Escape analysis then runs as normal, save that any otherwise escaping variable reference under a failed `(e.Type == E)` test is modelled as an assignment to `P`. Thus we have (ignoring the statements that have no impact) +* `t = ac`: add edge `t -> ac` +* `t = o.GetEnumerator()`: `o` escapes +* `e -> t`: add edge `e -> t` +* `ea = (E) e`: add edge `ea -> e` +* `tt = ea` : add edge `tt -> ea` +* `e.MoveNext()`: `e` escapes, but model this as `P -> e`. + +Then closure computation runs: +* `o` escapes (no impact) + +So we have proven that if can we arrange things so that any GDV test of the form `(e.Type == E)` succeeds, then the `new E()` does not escape and can be stack allocated. (Footnote 9) + +Note it is also possible (in some other example) that there is an escaping reference that is not under a suitable failing GDV. In such a case the object escapes and is heap allocated. So it's only when all potentially escaping references are under a suitable failing GDV that we move on to the next stage to attempt stack allocation. + +#### Challenge 8: Flow-Insensitivity + +You will notice in the algorithm sketch above we are implicitly relying on certain ordering properties in the code, e.g. the allocation at `ac = new E()` reaches to `e = t` via a chain of temp copies, and that that assignment to `e` reaches to at each `(e.Type == E)` test. But since the analysis is flow-insensitive, we have not yet proven any of this. + +To establish these facts the JIT also tracks all appearances of all variables that might refer to the object, and then, after the initial round of escape analysis, relies on dominance information to demonstrate that those appearances all must refer to the allocated object. + +With those flow relationships established, we have finally shown that *if* we can do suitable cloning, the object will not escape. + +#### Challenge 9: Cloning + +If you refer back to the expanded `foreach` above, you will notice that the lower GDV tests on `e` are located within a loop, and that loop is within a try/finally region. To fully split off this code the JIT must also clone along the path from the allocation site to the assignment to `e`, and any code that can refer to `e` thereafter. + +The JIT has had the ability to clone loops for a long time, but not this more general kind of region-based cloning, and especially not when the cloning involves cloning a try region. + +Cloning a try region turns out to be fairly involved; in addition to the `try` region itself, cloning must also encompass any handler (here a `finally`), any code outside the try that facilitates executing the handlers (aka the `callfinally` region), any promises made to create suitable throw helper blocks within the try or handler (aka `ACDs`), any nested `try` regions (within the `try` or the `finally`), and the associated EH table entries. + +Support for cloning a try was added in [runtime/dotnet#110020](https://github.com/dotnet/runtime/pull/110020). + +The plan is to use this as part of a larger region-based cloning effort. + +There are other considerations to handle here too. A method may have multiple instances of conditional escape for different objects. If the cloning regions for these overlap, then the problem is more complicated. Our plan for now is only allow cloning in cases where there is no overlap. (Footnote 10) + +If for some reason cloning can't (or won't) be done, then the JIT marks the associated pseudo variable as escaping, and restarts the escape closure analysis. + +If we do clone, we make the cloned version the "fast path" and the original the "slow path". + +When cloning we must also modify the cloned GDV tests to always succeed. We would figure this out on our own in time, but leaving the code for the failing GDV tests intact will cause later phases to assume that code can be reached, and address expose `e`. + +We also modify the GDV tests in the original code to always fail. This is not strictly necessary, but we do not want later phases to re-clone any loops here in the hope that those GDV tests might pass. (Footnote 11) + +#### Challenge 10: Finally Cloning + +Once the JIT has cloned to prevent escape, and swapped out the set of enumerating-reference vars to enable promotion, what's left? One thing we need to consider is that the cloned (fast) path and original (slow) path likely involve try/finallys, and because they may share other variable references (for non-enumerator things) we may need to somewhat aggressively optimize the slow path in order to ensure we get the best code in the fast path. In particular the JIT may need to also do "finally cloning" along the slow path even if believes this code is very unlikely to execute. + +It looks like the existing heuristic to not clone any finally that seems to be rarely executed almost never fires, so we can perhaps simply relax that; alternatively, we can mark the finally in some way so that a different heuristic can be applied to the slow/cold finally. + +This was addressed by [dotnet/runtime#110483](https://github.com/dotnet/runtime/pull/110483). + +#### Challenge 11: Profit? + +Once we've established that we can indeed clone, one last question remains: is the benefit from all this code duplication worthwhile? Our initial experience indicates that simply stack allocating an object may not provide enough benefit; we also need to ensure the object can be promoted. (Footnote 12) + +To help ensure stack allocation we extend the cloning to also re-write all of the local vars that might reference the `new E()` so that downstream phases (also somewhat flow-insensitive) do not see references to the same locals in the cloned code and in the original code. + +#### Challenge 12: Multi-Guess GDV + +The JIT is also able to expand a GDV into multiple guesses. This is not (yet) the default, but it would be ideal if it was compatible with conditional escape analysis. Under multi-guess GDV we may see a conditional escape protected by two or more failing guards, and more than one conditional enumerator allocation under a successful (and perhaps some failing) upper GDV guards. + +Generalizing what we have above to handle this case seems fairly straightforward, provided we are willing to create one clone region per distinct GDV guard type. + +If (say for two-guess GDV) the likelihoods of the two guesses are similar, we may also need to handle the case where the order of the guards can change in the lower GDV region, because our class profiling is somewhat stochastic, and each opportunity is profiled separately, and the JIT will always test for the most likely outcome at each point. Say one lower GDV profile shows A-55%, B-45%, other-0%, and a second lower GDV profile has B-55%, A-45%, other-0%. The first GDV will test for A outermost; the second, B. + +Another possibility is that the upper GDV diamond contains multiple allocations of the same enumerator type (eg `A` is `int[]` and `B` is some user written class that wraps an `int[]` and delegates enumeration to the array enumerator). This is similar in spirit to the empty array case -- the JIT can show (via cloning) neither enumerator escapes, but not know which enumerator object is in use, so is unable to promote the field accesses. + +#### Challenge 13: Exact Multi-Guess GDV + +Under NAOT there is one other similar possibility: thanks to whole-program analysis, ILC can enumerate all the classes that can implement, say, `IEnumerable`, and if there are just a small number of possible implementing classes, tell the JIT to guess for each in turn. Handling this would require some extra work as well; the JIT could clone for each known class and then the original code would become unreachable. + +## `List` + +`List` is another very common concrete collection type. Let's see how deabstraction fares when the collection is a list. + +Similar to the array case we'll look at 3 examples: one where the language compiler knows the concrete type, one where the JIT can discover this via normal optimization, and one where the JIT can only learn about the type via GDV: + +```C# + List m_list = Enumerable.Repeat(0, 512).ToList(); + + [MethodImpl(MethodImplOptions.NoInlining)] + IEnumerable get_opaque_list() => m_list; + + [Benchmark] + public int foreach_member_list() + { + List e = m_list; + int sum = 0; + foreach (int i in e) sum += i; + return sum; + } + + [Benchmark] + public int foreach_member_list_via_interface() + { + IEnumerable e = m_list; + int sum = 0; + foreach (int i in e) sum += i; + return sum; + } + + [Benchmark] + public int foreach_opaque_list_via_interface() + { + IEnumerable e = get_opaque_list(); + int sum = 0; + foreach (int i in e) sum += i; + return sum; + } +``` +Once again we'll use .NET 9 as a baseline: + +| (.NET 9) Method | Mean | Ratio | Allocated | +|---------------------------------- | ---------: | ------: | --------: | +| foreach_member_list | 247.1 ns | 1.00 | - | +| foreach_member_list_via_interface | 832.7 ns | 3.37 | 40 B | +| foreach_opaque_list_via_interface | 841.7 ns | 3.41 | 40 B | +| foreach_member_list_via_interface (no PGO) | 3,269.2 ns | 13.23 | 40 B | +| foreach_opaque_list_via_interface (no PGO) | 3,462.4 ns | 14.02 | 40 B | + +In relative terms the abstraction penalty is better than for arrays, but since `List` is a wrapper around an array, we might want to consider the array case as the true baseline (~150ns) in which case the penalty is more like the 5.5x we see in .NET 9 for arrays. + +And clearly PGO plays a big role in getting the penalty that low; without it the penalty is ~13x over the base list case, and 21x over the base array case. + +Why is list enumeration relatively more costly? Lets's look at the implementation: +```C# + public Enumerator GetEnumerator() => new Enumerator(this); + + IEnumerator IEnumerable.GetEnumerator() => + Count == 0 ? SZGenericArrayEnumerator.Empty : + GetEnumerator(); + + public struct Enumerator : IEnumerator, IEnumerator + { + private readonly List _list; + private int _index; + private readonly int _version; + private T? _current; + + internal Enumerator(List list) + { + _list = list; + _index = 0; + _version = list._version; + _current = default; + } + + public void Dispose() + { + } + + public bool MoveNext() + { + List localList = _list; + + if (_version == localList._version && ((uint)_index < (uint)localList._size)) + { + _current = localList._items[_index]; + _index++; + return true; + } + return MoveNextRare(); + } + + private bool MoveNextRare() + { + if (_version != _list._version) + { + ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion(); + } + + _index = _list._size + 1; + _current = default; + return false; + } + + public T Current => _current!; + + } +``` +A couple of things stand out here: +* There is some extra state (the `_version` field and related checks) to try and guard against modification of the list while enumeration is active. We didn't see that for arrays because the array length can't be modified after the array is created, but the number of list elements can vary. +* The enumerator does not cache the underlying array instance, so `MoveNext()` has to first fetch the list from the enumerator, then fetch the array from the list, then fetch the element from the array. This is a three-level dependent chain of memory accesses. (Footnote 13) +* This is a `struct` enumerator, not a `class` enumerator. This is a common pattern to avoid allocation in cases where enumeration is not quite as simple as arrays and the language compiler knows the collection type. And indeed in the base case there is no heap allocation. But not so in the other cases. +* The `GetEnumerator` implementation also has a variant of the empty-collection optimization; interestingly enough it delegates the empty case to the static empty array enumerator, so `GetEnumerator` does not return an object of definite type. (Footnote 14) +* There is effectively a double bounds-check, as first the enumerator checks against the size of the list, and then the array checks against the length of the array. + +As we mentioned above, the `struct` enumerator only avoids allocation when the collection type is known to the language compiler; if we disguise the type as in the second benchmark, or make it unknowable as in the third, the struct instance produced by `GetEnumerator` is boxed so that it becomes an object that can participate in interface dispatch (for the subsequent `MoveNext`, etc). Since the concrete type is unknowable and the list length not knowable either, the JIT can't resolve these dispatches via dataflow analysis and must rely on GDV to devirtualize the calls. Thus we see the second and third cases take the same amount of time. + +Happily most of the work we outlined above for the ref class array enumerator works perfectly well for a boxed struct list enumerator. The only real difference is that there is an extra copy from the struct returned by `GetEnumerator()` to the box payload, but once the box is stack allocated, both the struct fields and the box payload fields are promotable, and the JIT can see through the copy. We can also do the same empty collection trick we do for arrays and force all cases to create a new enumerator (which is now cheap). Currently we don't do this, so enumerating an empty list is not quite as well optimized as it could be. + +In cases where the loop body is very simple (like the above) The JIT is also able to clean up some of the cost of the `_version` checks. While the JIT can't prove that the list won't be modified by some other thread, it is allowed to make optimizations that can change the behavior of "racy" code. So unless there is some kind of synchronization point (method call, etc) in the loop body, the JIT is free to assume the list hasn't changed, so the `_version` field is a loop invariant. It can also remove the list-based bounds check, but currently the array check still remains; this is the case in all versions. + +(as of 07-Dec-24) + +| Method | Mean | Ratio | Allocated | +|------------------------------------------- | ---------: | ------: | --------: | +| foreach_member_list (.NET 9) | 247.1 ns | 1.00 | - | +| foreach_opaque_list_via_interface (.NET 10)| 247.8 ns | 1.00 | - | +| foreach_member_list_via_interface (.NET 10)| 249.5 ns | 1.01 | - | +| foreach_member_list_via_interface (.NET 9) | 832.7 ns | 3.37 | 40 B | +| foreach_opaque_list_via_interface (.NET 9) | 841.7 ns | 3.41 | 40 B | + +The upshot is the inner loop currently looks like this (where we are summing a `List`): +```asm +G_M17657_IG03: ;; offset=0x0040 + mov r8, gword ptr [rcx+0x08] + cmp edx, dword ptr [r8+0x08] + jae G_M17657_IG12 + mov r8d, dword ptr [r8+4*rdx+0x10] + inc edx + add ebx, r8d + ;; size=24 bbWeight=513.74 PerfScore 4366.81 +G_M17657_IG04: ;; offset=0x0058 + cmp edx, eax + jb SHORT G_M17657_IG03 +``` + +It might be that with proper inversion, further cloning, etc, we could eliminate the bounds check, hoist out the fetch of the array, and strength reduce here. + +## Other Collection Types + +I have also looked into `HashSet` and `Dictionary` KVP enumeration. Both of these look pretty good, though (as with list) there is some inefficiency in the residual loop optimizations in both the baseline version and the conditional-escape optimized variants that we might be able to address. + +## Other Limitations + +As we've seen from the above, *if* we can arrange to stack allocate the enumerator, we can greatly reduce the abstraction penalty from enumeration. But there are various limitations on stack allocation itself that will sometimes block this from happening. Let's look at a few of them. + +### Allocations in loops + +The optimization that turns a heap allocation into a stack allocation does so by creating a fixed set of slots on the stack frame for the object. This doesn't work properly if the allocation site can be visited more than once during a call to a method, since it's possible the storage allocated from an early visit is still in use (aka "live") when a later visit happens. Note if there are multiple visits to a bit of code in a method it generally means the method contains a loop (Footnote 15). + +There are a couple of different ways to try and handle these cases. One is to replace the fixed stack allocation with a dynamic one. There is already a mechanism (`stackalloc` in C#) for this, and we could consider using it, but this currently won't interact well with promotion, so the subsequent optimizations won't happen. And there is some extra up-front cost to each method call. And the stack itself is a limited resource, so we don't want to be allocating a potentially large amount of space. So there are tradeoffs here that are not yet well understood. + +The second is to prove that the stack allocated instance is no longer live on any subsequent visit, so that the fixed bit of storage can be recycled. This is perhaps doable, and should be the common case with enumerators, so we will likely be considering it as an enhancement. + +### Multiple Enumerations + +What if there are multiple enumerations happening in a method? Currently we require that each instance be separate from the others; this is checked by ensuring that the cloning regions are disjoint. So for example if there is a nested `foreach` only one of the two will end up being eligible. It turns out with the current implementation that we'll chose the outer one, since we encounter opportunities via RPO and so outer to inner. Likely it would be better to choose the inner one. (Footnote 16) + +It also turns out that if there are sequential `foreach` of the same local that the C# compiler will re-use the same enumerator var. Luckily for the JIT these two different sets of defs and uses are distinct and our current analysis handles it without any special casing. + +I haven't yet checked what happens if there are nested `foreach` with the same collection, presumably this can't share the enumerator var and ends up just like the nested case. + +Another area to explore are `Linq` methods that take two (or more) `IEnumerable` as arguments, for instance `Zip` or `Union` or `Intersect`. Some of these (but not all) will look like nested cases. + +### GDV Chaining + +In some cases two enumerator methods are called back to back without any control flow. An optimization known as GDV chaining can alter what would be two adjacent diamonds into a slighly different flow shape, and when this happens the "slow path" call in the second GDV will no longer be dominated by its own check, appear to be reachable by both paths from the upper GDV check. Seems like we ought to either disable GDV chaining for these types of tests, or else enhance the flow logic to understand the shape produced by GDV chaining. + +## Linq + +Ideally the work above would show improvements in `Linq` methods. However the situation with `Linq` is more complex (and one can safely infer from this that we are not yet seeing many improvements). Let's dig in a little. + +First, let's contrast a "naive" `Linq` implementation for `Where` against the actual implementation, both for .NET 9 and with the changes above, on an array: +```C# + public static IEnumerable NaiveWhere(IEnumerable e, Func p) + { + foreach (var i in e) + { + if (p(i)) yield return i; + } + } + + [MethodImpl(MethodImplOptions.NoInlining)] + public int help_sum_positive_where_naive(IEnumerable e) + { + int sum = 0; + foreach (int i in NaiveWhere(e, x => x > 0)) + { + sum += i; + } + return sum; + } + + [MethodImpl(MethodImplOptions.NoInlining)] + public int help_sum_positive_where(IEnumerable e) + { + int sum = 0; + foreach (int i in e.Where(x => x > 0)) + { + sum += i; + } + return sum; + } + + [Benchmark] + public int sum_positive_where_naive() + { + return help_sum_positive_where_naive(s_array); + } + [Benchmark] + public int sum_positive_where() + { + return help_sum_positive_where(s_array); + } +``` + +| Method | Toolchain | Mean | Ratio | Allocated | +|------------------------- |---------- |-----------:|------:|----------:| +| sum_positive_where_naive | net10.0 | 1,243.4 ns | 1.02 | 104 B | +| sum_positive_where_naive | net9.0 | 1,224.9 ns | 1.00 | 104 B | +| | | | | | +| sum_positive_where | net10.0 | 783.2 ns | 1.28 | 48 B | +| sum_positive_where | net9.0 | 612.7 ns | 1.00 | 48 B | + +Two things stand out: the naive version is about 2x the actual, and the optimizations above have no impact (in fact .net 10 is currently slower). + +### Naive Where + +First consider the naive version. `NaiveWhere` uses `yield return` which causes the C# compiler to create a state machine and object holding the state. `NaiveWhere` allocates one of these objects and returns it. This object has a made-up type name; we'll call it `d__50`. It implements both `IEnumerable` and `IEnumerator`, and has fields referencing the base enumerable and the predicate. + +Our benchmark then calls `GetEnumerator()` on this object. Given that the object is already an enumerator this may seen a bit odd. This method checks various bits of the original object's state, and if things look good, just returns it. If not, another `d__50` is allocated and initialized with the same state as in the constructor, and that is returned instead. (The exact code here is in IL only, but is similar to [this C# version](https://github.com/dotnet/runtime/blob/5c6d1b3f7b63a3150ce6c737aeb4af03b3cce621/src/libraries/System.Linq/src/System/Linq/Iterator.cs#L64-L77) in Linq). + +So as with the "empty collection" case we see here that there are two different possible enumerator objects of the same type that may get enumerated—the one initially created, or a clone. But unlike those cases above the first allocation dominates the second, so the flow relationship is a bit more nuanced. + +In our benchmark the "initially created" path is always taken, so the "clone" path is cold, and because of that the constructor it invokes is not inlined, so the allocation done on the clone path escapes. That is likely ok because the escape happens "before" the clone path merges back into the "just created" path, and so we should still be able to prove the just created object does not escape. + +But, not so fast. The `MoveNext()` for `d__50` is also not inlined, despite being "hot," and that causes the original allocation to escape as well. Why? The primary reason is that the `MoveNext()` contains a `try/finally` and the JIT is currently unable to inline methods with EH. But the method is also quite complicated, so it's not clear if the JIT would inline it even if there was no EH, and it is not so easy to see what else might trip things up after all that, but it looks like the internal IEenumerable's `GetEnumerator()` is invoked conditionally and we'd at least need to be able to prove we knew that this always happens on the first "dominating" call to `MoveNext()`. (Footnote 17) + +In my initial examples the JIT also did not inline `get_Current` despite it being very simple: GDV claims it does not have a good class guess. It turns out it was just an artifact of my particular test case—no element passed the filter so `MoveNext` returned false on the first call, so `get_Current` was never called and so had no class profile. (Footnote 18) + +The upshot is that our inner loop still has two interface calls: +```asm +G_M8017_IG05: ;; offset=0x00AF + mov rcx, gword ptr [rbp-0x28] + mov r11, 0x7FF8046202B0 ; code for System.Collections.IEnumerator:MoveNext():ubyte:this + call [r11]System.Collections.IEnumerator:MoveNext():ubyte:this + test eax, eax + je SHORT G_M8017_IG07 + ;; size=21 bbWeight=1 PerfScore 5.50 +G_M8017_IG06: ;; offset=0x00C4 + mov rcx, gword ptr [rbp-0x28] + mov r11, 0x7FF8046202B8 ; code for System.Collections.Generic.IEnumerator`1[int]:get_Current():int:this + call [r11]System.Collections.Generic.IEnumerator`1[int]:get_Current():int:this + add esi, eax + jmp SHORT G_M8017_IG05 +``` +(todo: take a closer look at `MoveNext`). + +### Linq Where + +Now let's briefly look at the `Linq` case ([link to source](https://github.com/dotnet/runtime/blob/4951e38fc5882ddf7df68c956acb6f5586ef47dd/src/libraries/System.Linq/src/System/Linq/Where.cs#L12)). Here things are similar but different. `Where` orchestrates handling for arrays and lists via special helper objects, and for other collection types relies on something like our `NaiveWhere`. + +For arrays the `ArrayWhereIterator` accesses the array directly rather than creating and using the standard array enumerator; this is presumably the reason for less allocation and faster iteration. The `MoveNext` does not contain EH and gets inlined, but again for the same reason as above `Current` is cold and so not inlined. (Footnote 19) + +The JIT also does GDV on the delegate invocation, so the inner "filter" loop is not too bad (given lack of promotion): +```asm + +G_M3835_IG11: ;; offset=0x012B + mov edx, edi + mov r13d, dword ptr [r14+4*rdx+0x10] + mov edi, dword ptr [rbx+0x0C] + lea edx, [rdi+0x01] + mov dword ptr [rbx+0x0C], edx + mov r12, gword ptr [rbx+0x20] + mov rdx, 0x7FF8050EAD90 ; code for ArrayDeabstraction+<>c:b__52_0(int):ubyte:this + cmp qword ptr [r12+0x18], rdx + jne G_M3835_IG17 + test r13d, r13d + jg G_M3835_IG18 + ;; size=50 bbWeight=505.85 PerfScore 6702.47 +G_M3835_IG12: ;; offset=0x015D + cmp r15d, edi + ja SHORT G_M3835_IG11 +``` +However we don't clone for the delegate and in the outer loop there is still a ton of enumerator state manipulation. + +Cloning for the delegate is blocked by two issues: +* The JIT loses track of an exact type and so the code in loop cloning doesn't recognize the delegate GDV. This was fixed by [dotnet/runtime#110675](https://github.com/dotnet/runtime/pull/110675). But this suggests that object stack allocation (or similar) might also be able to do type refinement, if we fleshed out the points-to side a bit more. Currently this phase tracks which locals can refer to which other locals, and since there is a single-def local associated with each allocation, this also tracks which locals can refer to each allocation. But the analysis doesn't track what other things a local might refer to (yet). +* With that fixed, the analysis now doesn't realize the delegate field in the enumerator is a loop invariant. I think this is just a limitation of the simplistic invariance analysis done by cloning -- the delegate local is assigned in the loop, but the value being assigned is invariant. I also need to look at this case more closely, because the `Linq` sources often hoist the fetch of the delegate (`_predicate`) out of the enumeration loop, but not always (a simple local experiment to "fix" all of `Linq` to consistently hoist these invariants in the C# sources didn't yield interesting benchmark wins). + +## Footnotes + +(1) We generally don't want enumeration to modify `o`, so the enumeration state `e` (which generally is mutable) must be kept somewhere else. Most of this note is really trying to figure out how to arrange it so that in the important cases "somewhere else" is CPU registers and not stack slots or heap fields. + +Note if the collection `o` is empty then it is possible that `e` not be mutable (`e.MoveNext()` can simply return `false`). This will come up again later. + +(2) These benchmarks represent a "worst-case" assessment of enumeration overhead. In more realistic code there would be more work done per enumerated `t` and so the relative overhead would be smaller. But there is still an allocation cost. + +(3) PGO can never prove that these are the only types, but it can say with reasonable confidence that other types are unlikely. + +(4) Arrays were generic types before there were true generic types, so they were handled specially. + +(5) The `GetEnumerator` method on arrays can be a bit too complex to be inlined normally, so we added `AggressiveInlining`. + +(6) It is interesting to think how the JIT might be able to prove this on its own without some oracular knowledge, but it seems hard. And arrays are important enough collections to warrant some built-in special casing. + +(7) You will note a general theme here between cloning and GDV. Both duplicate code and ask questions and then optimize one of the copies when the answers to those questions are favorable. If you look at it properly, inlining is like this, too. + +(8) In runtime systems that support de-opt (eg Java) one can get by with only creating the optimized code variant, and bail out via a complex de-opt transition to unoptimized code if things ever go wrong. This is not something we've seriously considered for >NET, though you can see the appeal. Note if the optimized version stack allocates objects, these must also be undone. + +(9) In the current implementation the JIT is given some upfront hints about which allocations and variables to track, so that the number of possible pseudo-variables is known up front. So in the example above the JIT knows that the `e` and `new E()` are conditional escape candidates. While this seems expedient, it also does not seem to be strictly necessary. + +(10) If it turns out that there are frequently multiple conditionally escaping objects (under different conditions) we will of course reconsider. + +(11) Note we don't know what type of enumerator gets produced if the upper GDV test fails, since we don't know what kind of collection we're enumerating. It could well produce an enumerator of type `E`. So it's possible that those original lower GDVs could succeed even if the upper one fails. + +(12) There is a general benefit to reducing heap allocation, but the code paths for a heap allocated object and a stack allocated but not promoted object will often be quite similar (save perhaps for lack of write barriers), so there may not be a clear performance benefit for the method itself. But if we somehow knew this allocation site was responsible for a lot of heap allocation, we might reconsider. + +(13) Given the `_version` checks it seems plausible the enumerator could safely cache the `_items` array; it would increase the enumerator size a bit but make enumeration a bit faster, so may or may not be a good tradeoff. (sub-footnote: I tried this and did not see any benefit, though it could be once the loop is optimized further this might pay off.) + +(14) Interestingly there are [vestiges of an empty-list static enumerator](https://github.com/dotnet/runtime/blame/5ae9467d6ab7b31bf1cda35ca15ca5a2d21d3046/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/List.cs#L1187), so presumably at one point there was a static enumerator instance like we hae for the empty array case, but we realized that empty list enumeration could share the empty array instance. See [dotnet/runtime#82499)](https://github.com/dotnet/runtime/pull/82499). + +(15) Aside from explicit looping, the JIT is able to transform recursive tail calls into loops, though this can only be done when the live part of the method state is the arguments to the call. Note thanks to inlining and GDV this can kick in in cases where the method is part of a recursive cycle of calls. + +(16) We could (and likely should) try to handle both, but we need to be thoughtful about how much cloning this entails. A naive version would end up creating 4 copies of the inner foreach, likely we don't want to have the version where we have the fast outer path and slow inner path, unless saving a heap allocation ends up being super-valuable. + +(17) Inlining methods with EH is on the .NET 10 roadmap, and perhaps this is one case to use as motivation. + +More broadly, we should consider having the inliner aggressively inline if a call site can be passed a locally allocated object. Here it is tricky because the connection between object allocation and call is indirect, and we don't do data flow during inlining. Now, if we did escape analysis before inlining, we could perhaps target the methods that cause the allocations to escape, but we'd then need to update / redo the analysis for each inline so we can handle the transitive cases... (this is where some kind of up-front hinting would be very useful, so the JIT would know which inlines could end up preventing escape). + +(18) Overall this seems like something we could improve on. The `MoveNext` call site must dominate the `get_Current` site, so surely the class profile for the one likely can be applied to the other. But we have limited abilities to make such inferences today, as all this is done very early in the jit. And (as here) if `get_Current` does not postdominate `MoveNext` we can't be sure that the conditional logic in between is doing some computation that might alter the profile. But when the dominating profile is monomorphic then projecting it to the dominated site is probably a safe bet. + +(19) Note this is an instance where inlining a "cold" method is beneficial, as it touches the same state as a hot path. We have seen these before in other contexts. \ No newline at end of file diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h index a5472a3be8de06..310e46ef3a95bb 100644 --- a/src/coreclr/jit/compiler.h +++ b/src/coreclr/jit/compiler.h @@ -667,6 +667,9 @@ class LclVarDsc unsigned char lvRedefinedInEmbeddedStatement : 1; // Local has redefinitions inside embedded statements that // disqualify it from local copy prop. + + unsigned char lvIsEnumerator : 1; // Local is assigned exact class where : IEnumerable via GDV + private: unsigned char lvIsNeverNegative : 1; // The local is known to be never negative @@ -3711,6 +3714,8 @@ class Compiler return gtNewStmt(exprClone, stmt->GetDebugInfo()); } + Statement* gtLatestStatement(Statement* stmt1, Statement* stmt2); + // Internal helper for cloning a call GenTreeCall* gtCloneExprCallHelper(GenTreeCall* call); @@ -4604,6 +4609,26 @@ class Compiler unsigned impStkSize; // Size of the full stack + // Enumerator de-abstraction support + // + typedef JitHashTable, unsigned> NodeToUnsignedMap; + + // Map is only set on the root instance. + // + NodeToUnsignedMap* impEnumeratorGdvLocalMap = nullptr; + bool hasImpEnumeratorGdvLocalMap() { return impInlineRoot()->impEnumeratorGdvLocalMap != nullptr; } + NodeToUnsignedMap* getImpEnumeratorGdvLocalMap() + { + Compiler* compiler = impInlineRoot(); + if (compiler->impEnumeratorGdvLocalMap == nullptr) + { + CompAllocator alloc(compiler->getAllocator(CMK_Generic)); + compiler->impEnumeratorGdvLocalMap = new (alloc) NodeToUnsignedMap(alloc); + } + + return compiler->impEnumeratorGdvLocalMap; + } + #define SMALL_STACK_SIZE 16 // number of elements in impSmallStack struct SavedStack // used to save/restore stack contents. @@ -11674,8 +11699,6 @@ class Compiler return compRoot->m_fieldSeqStore; } - typedef JitHashTable, unsigned> NodeToUnsignedMap; - NodeToUnsignedMap* m_memorySsaMap[MemoryKindCount]; // In some cases, we want to assign intermediate SSA #'s to memory states, and know what nodes create those memory diff --git a/src/coreclr/jit/fgdiagnostic.cpp b/src/coreclr/jit/fgdiagnostic.cpp index 811511217cca9d..1794c6b57954c7 100644 --- a/src/coreclr/jit/fgdiagnostic.cpp +++ b/src/coreclr/jit/fgdiagnostic.cpp @@ -2860,7 +2860,14 @@ bool BBPredsChecker::CheckEHFinallyRet(BasicBlock* blockPred, BasicBlock* block) } } - assert(found && "BBJ_EHFINALLYRET predecessor of block that doesn't follow a BBJ_CALLFINALLY!"); + if (!found) + { + JITDUMP(FMT_BB " is successor of finallyret " FMT_BB " but prev block is not a callfinally to " FMT_BB + " (search range was [" FMT_BB "..." FMT_BB "]\n", + block->bbNum, blockPred->bbNum, finBeg->bbNum, firstBlock->bbNum, lastBlock->bbNum); + assert(!"BBJ_EHFINALLYRET predecessor of block that doesn't follow a BBJ_CALLFINALLY!"); + } + return found; } diff --git a/src/coreclr/jit/fgehopt.cpp b/src/coreclr/jit/fgehopt.cpp index 940077dc6d2d62..7ea85224a863bc 100644 --- a/src/coreclr/jit/fgehopt.cpp +++ b/src/coreclr/jit/fgehopt.cpp @@ -2552,6 +2552,7 @@ BasicBlock* Compiler::fgCloneTryRegion(BasicBlock* tryEntry, CloneTryInfo& info, auto addBlockToClone = [=, &blocks, &visited, &numberOfBlocksToClone](BasicBlock* block, const char* msg) { if (!BitVecOps::TryAddElemD(traits, visited, block->bbID)) { + JITDUMP("[already seen] %s block " FMT_BB "\n", msg, block->bbNum); return false; } @@ -2781,6 +2782,8 @@ BasicBlock* Compiler::fgCloneTryRegion(BasicBlock* tryEntry, CloneTryInfo& info, // all the EH indicies at or above insertBeforeIndex will shift, // and the EH table may reallocate. // + // This addition may also fail, if the table would become too large... + // EHblkDsc* const clonedOutermostEbd = fgTryAddEHTableEntries(insertBeforeIndex, regionCount, /* deferAdding */ deferCloning); @@ -3000,8 +3003,6 @@ BasicBlock* Compiler::fgCloneTryRegion(BasicBlock* tryEntry, CloneTryInfo& info, cloneTryIndex += indexShift; } - EHblkDsc* const originalEbd = ehGetDsc(originalTryIndex); - EHblkDsc* const clonedEbd = ehGetDsc(cloneTryIndex); newBlock->setTryIndex(cloneTryIndex); updateBlockReferences(cloneTryIndex); } @@ -3016,8 +3017,6 @@ BasicBlock* Compiler::fgCloneTryRegion(BasicBlock* tryEntry, CloneTryInfo& info, cloneHndIndex += indexShift; } - EHblkDsc* const originalEbd = ehGetDsc(originalHndIndex); - EHblkDsc* const clonedEbd = ehGetDsc(cloneHndIndex); newBlock->setHndIndex(cloneHndIndex); updateBlockReferences(cloneHndIndex); diff --git a/src/coreclr/jit/gentree.cpp b/src/coreclr/jit/gentree.cpp index 7c86705f00dd3e..e8ff8998ba388d 100644 --- a/src/coreclr/jit/gentree.cpp +++ b/src/coreclr/jit/gentree.cpp @@ -32668,3 +32668,42 @@ void GenTree::SetMorphed(Compiler* compiler, bool doChildren /* = false */) } } #endif + +//------------------------------------------------------------------------ +// gtLatestStmt: determine which of two statements happens later +// +// Arguments: +// stmt1 - first statement to consider +// stmt2 - second statement to consider +// +// Returns: +// either stmt1 or stmt2, whichever happens later in the block +// +Statement* Compiler::gtLatestStatement(Statement* stmt1, Statement* stmt2) +{ + if (stmt1 == stmt2) + { + return stmt1; + } + + Statement* cursor1 = stmt1->GetNextStmt(); + Statement* cursor2 = stmt2->GetNextStmt(); + + while (true) + { + if ((cursor1 == stmt2) || (cursor2 == nullptr)) + { + return stmt2; + } + + if ((cursor2 == stmt1) || (cursor1 == nullptr)) + { + return stmt1; + } + + cursor1 = cursor1->GetNextStmt(); + cursor2 = cursor2->GetNextStmt(); + } + + assert(!"could not determine latest stmt"); +} diff --git a/src/coreclr/jit/importer.cpp b/src/coreclr/jit/importer.cpp index 0435ecf0b514e4..1f06372b3a14ff 100644 --- a/src/coreclr/jit/importer.cpp +++ b/src/coreclr/jit/importer.cpp @@ -3445,6 +3445,21 @@ void Compiler::impImportAndPushBox(CORINFO_RESOLVED_TOKEN* pResolvedToken) { GenTreeCall* const call = exprToBox->AsRetExpr()->gtInlineCandidate->AsCall(); + // If the call was flagged for possible enumerator cloning, flag the allocation as well. + // + if (compIsForInlining() && hasImpEnumeratorGdvLocalMap()) + { + NodeToUnsignedMap* const map = getImpEnumeratorGdvLocalMap(); + unsigned enumeratorLcl = BAD_VAR_NUM; + GenTreeCall* const call = impInlineInfo->iciCall; + if (map->Lookup(call, &enumeratorLcl)) + { + JITDUMP("Flagging [%06u] for enumerator cloning via V%02u\n", dspTreeID(op1), enumeratorLcl); + map->Remove(call); + map->Set(op1, enumeratorLcl); + } + } + if (call->ShouldHaveRetBufArg()) { JITDUMP("Must insert newobj stmts for box before call [%06u]\n", dspTreeID(call)); @@ -6740,6 +6755,26 @@ void Compiler::impImportBlockCode(BasicBlock* block) { lvaUpdateClass(lclNum, op1, tiRetVal.GetClassHandleForObjRef()); } + + // If we see a local being assigned the result of a GDV-inlineable + // IEnumerable.GetEnumerator, keep track of both the local and the call. + // + if (op1->OperIs(GT_RET_EXPR)) + { + JITDUMP(".... checking for GDV of IEnumerable...\n"); + + GenTreeCall* const call = op1->AsRetExpr()->gtInlineCandidate; + NamedIntrinsic const ni = lookupNamedIntrinsic(call->gtCallMethHnd); + + if (ni == NI_System_Collections_Generic_IEnumerable_GetEnumerator) + { + JITDUMP("V%02u value is GDV of IEnumerable.GetEnumerator\n", lclNum); + lvaTable[lclNum].lvIsEnumerator = true; + JITDUMP("Flagging [%06u] for enumerator cloning via V%02u\n", dspTreeID(call), lclNum); + getImpEnumeratorGdvLocalMap()->Set(call, lclNum); + Metrics.EnumeratorGDV++; + } + } } /* Filter out simple stores to itself */ @@ -8838,6 +8873,23 @@ void Compiler::impImportBlockCode(BasicBlock* block) op1->gtFlags |= GTF_ALLOCOBJ_EMPTY_STATIC; } + // If the method being imported is an inlinee, and the original call was flagged + // for possible enumerator cloning, flag the allocation as well. + // + if (compIsForInlining() && hasImpEnumeratorGdvLocalMap()) + { + NodeToUnsignedMap* const map = getImpEnumeratorGdvLocalMap(); + unsigned enumeratorLcl = BAD_VAR_NUM; + GenTreeCall* const call = impInlineInfo->iciCall; + if (map->Lookup(call, &enumeratorLcl)) + { + JITDUMP("Flagging [%06u] for enumerator cloning via V%02u\n", dspTreeID(op1), + enumeratorLcl); + map->Remove(call); + map->Set(op1, enumeratorLcl); + } + } + // Remember that this basic block contains 'new' of an object block->SetFlags(BBF_HAS_NEWOBJ); optMethodFlags |= OMF_HAS_NEWOBJ; diff --git a/src/coreclr/jit/importercalls.cpp b/src/coreclr/jit/importercalls.cpp index 61021c74f4ea37..bd9da5da7a6dcd 100644 --- a/src/coreclr/jit/importercalls.cpp +++ b/src/coreclr/jit/importercalls.cpp @@ -10553,6 +10553,13 @@ NamedIntrinsic Compiler::lookupNamedIntrinsic(CORINFO_METHOD_HANDLE method) result = NI_System_Collections_Generic_EqualityComparer_get_Default; } } + else if (strcmp(className, "IEnumerable`1") == 0) + { + if (strcmp(methodName, "GetEnumerator") == 0) + { + result = NI_System_Collections_Generic_IEnumerable_GetEnumerator; + } + } } else if (strcmp(namespaceName, "Numerics") == 0) { diff --git a/src/coreclr/jit/indirectcalltransformer.cpp b/src/coreclr/jit/indirectcalltransformer.cpp index f2149f62546b87..02042f68c0b65f 100644 --- a/src/coreclr/jit/indirectcalltransformer.cpp +++ b/src/coreclr/jit/indirectcalltransformer.cpp @@ -907,6 +907,22 @@ class IndirectCallTransformer GenTreeCall* call = compiler->gtCloneCandidateCall(origCall); call->gtArgs.GetThisArg()->SetEarlyNode(compiler->gtNewLclvNode(thisTemp, TYP_REF)); + // If the original call was flagged as one that might inspire enumerator de-abstraction + // cloning, move the flag to the devirtualized call. + // + if (compiler->hasImpEnumeratorGdvLocalMap()) + { + Compiler::NodeToUnsignedMap* const map = compiler->getImpEnumeratorGdvLocalMap(); + unsigned enumeratorLcl = BAD_VAR_NUM; + if (map->Lookup(origCall, &enumeratorLcl)) + { + JITDUMP("Flagging [%06u] for enumerator cloning via V%02u\n", compiler->dspTreeID(call), + enumeratorLcl); + map->Remove(origCall); + map->Set(call, enumeratorLcl); + } + } + INDEBUG(call->SetIsGuarded()); JITDUMP("Direct call [%06u] in block " FMT_BB "\n", compiler->dspTreeID(call), block->bbNum); diff --git a/src/coreclr/jit/inductionvariableopts.cpp b/src/coreclr/jit/inductionvariableopts.cpp index c96bf8036efe31..77f15adfb07a12 100644 --- a/src/coreclr/jit/inductionvariableopts.cpp +++ b/src/coreclr/jit/inductionvariableopts.cpp @@ -1392,7 +1392,6 @@ class StrengthReductionContext BasicBlock* FindPostUseUpdateInsertionPoint(ArrayStack* cursors, BasicBlock* backEdgeDominator, Statement** afterStmt); - Statement* LatestStatement(Statement* stmt1, Statement* stmt2); bool InsertionPointPostDominatesUses(BasicBlock* insertionPoint, ArrayStack* cursors); bool StressProfitability() @@ -2615,7 +2614,7 @@ BasicBlock* StrengthReductionContext::FindPostUseUpdateInsertionPoint(ArrayStack } else { - latestStmt = LatestStatement(latestStmt, cursor.Stmt); + latestStmt = m_comp->gtLatestStatement(latestStmt, cursor.Stmt); } } @@ -2632,44 +2631,6 @@ BasicBlock* StrengthReductionContext::FindPostUseUpdateInsertionPoint(ArrayStack return nullptr; } -//------------------------------------------------------------------------ -// LatestStatement: Given two statements in the same basic block, return the -// latter of the two. -// -// Parameters: -// stmt1 - First statement -// stmt2 - Second statement -// -// Returns: -// Latter of the statements. -// -Statement* StrengthReductionContext::LatestStatement(Statement* stmt1, Statement* stmt2) -{ - if (stmt1 == stmt2) - { - return stmt1; - } - - Statement* cursor1 = stmt1->GetNextStmt(); - Statement* cursor2 = stmt2->GetNextStmt(); - - while (true) - { - if ((cursor1 == stmt2) || (cursor2 == nullptr)) - { - return stmt2; - } - - if ((cursor2 == stmt1) || (cursor1 == nullptr)) - { - return stmt1; - } - - cursor1 = cursor1->GetNextStmt(); - cursor2 = cursor2->GetNextStmt(); - } -} - //------------------------------------------------------------------------ // InsertionPointPostDominatesUses: Check if a basic block post-dominates all // locations specified by the cursors. diff --git a/src/coreclr/jit/jitconfigvalues.h b/src/coreclr/jit/jitconfigvalues.h index ab8c6495027003..2d622cc33b7f5b 100644 --- a/src/coreclr/jit/jitconfigvalues.h +++ b/src/coreclr/jit/jitconfigvalues.h @@ -670,6 +670,8 @@ CONFIG_STRING(JitObjectStackAllocationRange, "JitObjectStackAllocationRange") RELEASE_CONFIG_INTEGER(JitObjectStackAllocation, "JitObjectStackAllocation", 1) RELEASE_CONFIG_INTEGER(JitObjectStackAllocationRefClass, "JitObjectStackAllocationRefClass", 1) RELEASE_CONFIG_INTEGER(JitObjectStackAllocationBoxedValueClass, "JitObjectStackAllocationBoxedValueClass", 1) +RELEASE_CONFIG_INTEGER(JitObjectStackAllocationConditionalEscape, "JitObjectStackAllocationConditionalEscape", 1) +CONFIG_STRING(JitObjectStackAllocationConditionalEscapeRange, "JitObjectStackAllocationConditionalEscapeRange") RELEASE_CONFIG_INTEGER(JitObjectStackAllocationArray, "JitObjectStackAllocationArray", 1) RELEASE_CONFIG_INTEGER(JitObjectStackAllocationSize, "JitObjectStackAllocationSize", 528) diff --git a/src/coreclr/jit/jitmetadatalist.h b/src/coreclr/jit/jitmetadatalist.h index 1918c5ade2b3ed..215ea82ead5606 100644 --- a/src/coreclr/jit/jitmetadatalist.h +++ b/src/coreclr/jit/jitmetadatalist.h @@ -63,6 +63,7 @@ JITMETADATAMETRIC(ClassGDV, int, 0) JITMETADATAMETRIC(MethodGDV, int, 0) JITMETADATAMETRIC(MultiGuessGDV, int, 0) JITMETADATAMETRIC(ChainedGDV, int, 0) +JITMETADATAMETRIC(EnumeratorGDV, int, 0) JITMETADATAMETRIC(InlinerBranchFold, int, 0) JITMETADATAMETRIC(InlineAttempt, int, 0) JITMETADATAMETRIC(InlineCount, int, 0) @@ -92,6 +93,8 @@ JITMETADATAMETRIC(LocalAssertionCount, int, 0) JITMETADATAMETRIC(LocalAssertionOverflow, int, 0) JITMETADATAMETRIC(MorphTrackedLocals, int, 0) JITMETADATAMETRIC(MorphLocals, int, 0) +JITMETADATAMETRIC(EnumeratorGDVProvisionalNoEscape, int, 0) +JITMETADATAMETRIC(EnumeratorGDVCanCloneToEnsureNoEscape, int, 0) #undef JITMETADATA #undef JITMETADATAINFO diff --git a/src/coreclr/jit/namedintrinsiclist.h b/src/coreclr/jit/namedintrinsiclist.h index b4de4cc920e9ad..baf060aaec2e77 100644 --- a/src/coreclr/jit/namedintrinsiclist.h +++ b/src/coreclr/jit/namedintrinsiclist.h @@ -248,6 +248,7 @@ enum NamedIntrinsic : unsigned short // NI_System_SZArrayHelper_GetEnumerator, NI_System_Array_T_GetEnumerator, + NI_System_Collections_Generic_IEnumerable_GetEnumerator, }; #endif // _NAMEDINTRINSICLIST_H_ diff --git a/src/coreclr/jit/objectalloc.cpp b/src/coreclr/jit/objectalloc.cpp index 7736f0b9aaf161..8c84d47cf75536 100644 --- a/src/coreclr/jit/objectalloc.cpp +++ b/src/coreclr/jit/objectalloc.cpp @@ -16,6 +16,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX #endif #include "gentree.h" +#include "jitstd/algorithm.h" //------------------------------------------------------------------------ // DoPhase: Run analysis (if object stack allocation is enabled) and then @@ -65,6 +66,10 @@ PhaseStatus ObjectAllocator::DoPhase() { JITDUMP("enabled, analyzing...\n"); DoAnalysis(); + + // If we have to clone some code to guarantee non-escape, do it now. + // + CloneAndSpecialize(); } else { @@ -149,8 +154,18 @@ void ObjectAllocator::DoAnalysis() if (comp->lvaCount > 0) { - m_EscapingPointers = BitVecOps::MakeEmpty(&m_bitVecTraits); - m_ConnGraphAdjacencyMatrix = new (comp->getAllocator(CMK_ObjectAllocator)) BitSetShortLongRep[comp->lvaCount]; + m_EscapingPointers = BitVecOps::MakeEmpty(&m_bitVecTraits); + m_ConnGraphAdjacencyMatrix = + new (comp->getAllocator(CMK_ObjectAllocator)) BitSetShortLongRep[comp->lvaCount + m_maxPseudoLocals + 1]; + + // If we are doing conditional escape analysis, we also need to compute dominance. + // + if (CanHavePseudoLocals()) + { + assert(comp->m_dfsTree != nullptr); + assert(comp->m_domTree == nullptr); + comp->m_domTree = FlowGraphDominatorTree::Build(comp->m_dfsTree); + } MarkEscapingVarsAndBuildConnGraph(); ComputeEscapingNodes(&m_bitVecTraits, m_EscapingPointers); @@ -180,6 +195,8 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() class BuildConnGraphVisitor final : public GenTreeVisitor { ObjectAllocator* m_allocator; + BasicBlock* m_block; + Statement* m_stmt; public: enum @@ -189,9 +206,11 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() ComputeStack = true, }; - BuildConnGraphVisitor(ObjectAllocator* allocator) + BuildConnGraphVisitor(ObjectAllocator* allocator, BasicBlock* block, Statement* stmt) : GenTreeVisitor(allocator->comp) , m_allocator(allocator) + , m_block(block) + , m_stmt(stmt) { } @@ -212,12 +231,12 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() if (tree->OperIsLocalStore()) { lclEscapes = false; + m_allocator->CheckForGuardedAllocationOrCopy(m_block, m_stmt, use, lclNum); } else if (tree->OperIs(GT_LCL_VAR) && tree->TypeIs(TYP_REF, TYP_BYREF, TYP_I_IMPL)) { assert(tree == m_ancestors.Top()); - - if (!m_allocator->CanLclVarEscapeViaParentStack(&m_ancestors, lclNum)) + if (!m_allocator->CanLclVarEscapeViaParentStack(&m_ancestors, lclNum, m_block)) { lclEscapes = false; } @@ -231,6 +250,12 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() } m_allocator->MarkLclVarAsEscaping(lclNum); } + else if (!tree->OperIsLocalStore()) + { + // Note uses of variables of interest to conditional escape analysis. + // + m_allocator->RecordAppearance(lclNum, m_block, m_stmt, use); + } return Compiler::fgWalkResult::WALK_CONTINUE; } @@ -257,6 +282,11 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() } } + for (unsigned int p = 0; p < m_maxPseudoLocals; p++) + { + m_ConnGraphAdjacencyMatrix[p + comp->lvaCount] = BitVecOps::MakeEmpty(&m_bitVecTraits); + } + // We should have computed the DFS tree already. // FlowGraphDfsTree* const dfs = comp->m_dfsTree; @@ -267,10 +297,9 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() for (unsigned i = dfs->GetPostOrderCount(); i != 0; i--) { BasicBlock* const block = dfs->GetPostOrder(i - 1); - for (Statement* const stmt : block->Statements()) { - BuildConnGraphVisitor buildConnGraphVisitor(this); + BuildConnGraphVisitor buildConnGraphVisitor(this, block, stmt); buildConnGraphVisitor.WalkTree(stmt->GetRootNodePointer(), nullptr); } } @@ -286,35 +315,65 @@ void ObjectAllocator::MarkEscapingVarsAndBuildConnGraph() void ObjectAllocator::ComputeEscapingNodes(BitVecTraits* bitVecTraits, BitVec& escapingNodes) { - BitSetShortLongRep escapingNodesToProcess = BitVecOps::MakeCopy(bitVecTraits, escapingNodes); - BitSetShortLongRep newEscapingNodes = BitVecOps::UninitVal(); + BitVec escapingNodesToProcess = BitVecOps::MakeCopy(bitVecTraits, escapingNodes); - unsigned int lclNum; - - bool doOneMoreIteration = true; - while (doOneMoreIteration) - { - BitVecOps::Iter iterator(bitVecTraits, escapingNodesToProcess); - doOneMoreIteration = false; + auto computeClosure = [&]() { + JITDUMP("\nComputing escape closure\n\n"); + bool doOneMoreIteration = true; + BitSetShortLongRep newEscapingNodes = BitVecOps::UninitVal(); + unsigned int lclNum; - while (iterator.NextElem(&lclNum)) + while (doOneMoreIteration) { - if (m_ConnGraphAdjacencyMatrix[lclNum] != nullptr) + BitVecOps::Iter iterator(bitVecTraits, escapingNodesToProcess); + doOneMoreIteration = false; + + while (iterator.NextElem(&lclNum)) { - doOneMoreIteration = true; - - // newEscapingNodes = adjacentNodes[lclNum] - BitVecOps::Assign(bitVecTraits, newEscapingNodes, m_ConnGraphAdjacencyMatrix[lclNum]); - // newEscapingNodes = newEscapingNodes \ escapingNodes - BitVecOps::DiffD(bitVecTraits, newEscapingNodes, escapingNodes); - // escapingNodesToProcess = escapingNodesToProcess U newEscapingNodes - BitVecOps::UnionD(bitVecTraits, escapingNodesToProcess, newEscapingNodes); - // escapingNodes = escapingNodes U newEscapingNodes - BitVecOps::UnionD(bitVecTraits, escapingNodes, newEscapingNodes); - // escapingNodesToProcess = escapingNodesToProcess \ { lclNum } - BitVecOps::RemoveElemD(bitVecTraits, escapingNodesToProcess, lclNum); + if (m_ConnGraphAdjacencyMatrix[lclNum] != nullptr) + { + doOneMoreIteration = true; + + // newEscapingNodes = adjacentNodes[lclNum] + BitVecOps::Assign(bitVecTraits, newEscapingNodes, m_ConnGraphAdjacencyMatrix[lclNum]); + // newEscapingNodes = newEscapingNodes \ escapingNodes + BitVecOps::DiffD(bitVecTraits, newEscapingNodes, escapingNodes); + // escapingNodesToProcess = escapingNodesToProcess U newEscapingNodes + BitVecOps::UnionD(bitVecTraits, escapingNodesToProcess, newEscapingNodes); + // escapingNodes = escapingNodes U newEscapingNodes + BitVecOps::UnionD(bitVecTraits, escapingNodes, newEscapingNodes); + // escapingNodesToProcess = escapingNodesToProcess \ { lclNum } + BitVecOps::RemoveElemD(bitVecTraits, escapingNodesToProcess, lclNum); + +#ifdef DEBUG + // Print the first witness to new escapes. + // + if (!BitVecOps::IsEmpty(bitVecTraits, newEscapingNodes)) + { + BitVecOps::Iter iterator(bitVecTraits, newEscapingNodes); + unsigned int newLclNum; + while (iterator.NextElem(&newLclNum)) + { + // Note P's never are sources of assignments... + JITDUMP("%c%02u causes V%02u to escape\n", lclNum >= comp->lvaCount ? 'P' : 'V', lclNum, + newLclNum); + } + } +#endif + } } } + }; + + computeClosure(); + + if (m_numPseudoLocals > 0) + { + bool newEscapes = AnalyzeIfCloningCanPreventEscape(bitVecTraits, escapingNodes, escapingNodesToProcess); + if (newEscapes) + { + computeClosure(); + } } } @@ -375,6 +434,31 @@ void ObjectAllocator::ComputeStackObjectPointers(BitVecTraits* bitVecTraits) } } } + + JITDUMP("Definitely stack-pointing locals:"); + { + BitVecOps::Iter iter(bitVecTraits, m_DefinitelyStackPointingPointers); + unsigned lclNum = 0; + while (iter.NextElem(&lclNum)) + { + JITDUMP(" V%02u", lclNum); + } + JITDUMP("\n"); + } + + JITDUMP("Possibly stack-pointing locals:"); + { + BitVecOps::Iter iter(bitVecTraits, m_PossiblyStackPointingPointers); + unsigned lclNum = 0; + while (iter.NextElem(&lclNum)) + { + if (!BitVecOps::IsMember(bitVecTraits, m_DefinitelyStackPointingPointers, lclNum)) + { + JITDUMP(" V%02u", lclNum); + } + } + JITDUMP("\n"); + } } //------------------------------------------------------------------------ @@ -598,6 +682,27 @@ bool ObjectAllocator::MorphAllocObjNodes() MarkLclVarAsDefinitelyStackPointing(lclNum); MarkLclVarAsPossiblyStackPointing(lclNum); + // If this was conditionally escaping enumerator, establish a connection between this local + // and the enumeratorLocal we already allocated. This is needed because we do early rewriting + // in the conditional clone. + // + unsigned pseudoLocal = BAD_VAR_NUM; + if (m_EnumeratorLocalToPseudoLocalMap.TryGetValue(lclNum, &pseudoLocal)) + { + CloneInfo* info = nullptr; + if (m_CloneMap.Lookup(pseudoLocal, &info)) + { + if (info->m_willClone) + { + JITDUMP("Connecting stack allocated enumerator V%02u to its address var V%02u\n", + lclNum, info->m_enumeratorLocal); + AddConnGraphEdge(lclNum, info->m_enumeratorLocal); + MarkLclVarAsPossiblyStackPointing(info->m_enumeratorLocal); + MarkLclVarAsDefinitelyStackPointing(info->m_enumeratorLocal); + } + } + } + if (bashCall) { stmt->GetRootNode()->gtBashToNOP(); @@ -880,13 +985,17 @@ unsigned int ObjectAllocator::MorphAllocObjNodeIntoStackAlloc( // Just lop off the JTRUE, the rest can clean up later // (eg may have side effects) // - controllingStmt->SetRootNode(controllingNode->AsOp()->gtOp1); + controllingStmt->SetRootNode(controllingNode->gtGetOp1()); // We must remove the empty static block now too. assert(removedBlock->bbRefs == 0); assert(removedBlock->KindIs(BBJ_ALWAYS)); comp->fgRemoveBlock(removedBlock, /* unreachable */ true); } + else + { + JITDUMP("ALLOCOBJ [%06u] is not part of an empty static\n", comp->dspTreeID(allocObj)); + } return lclNum; } @@ -898,6 +1007,7 @@ unsigned int ObjectAllocator::MorphAllocObjNodeIntoStackAlloc( // Arguments: // parentStack - Parent stack of the current visit // lclNum - Local variable number +// block - basic block holding the trees // // Return Value: // true if the local can escape via the parent stack; false otherwise @@ -906,13 +1016,17 @@ unsigned int ObjectAllocator::MorphAllocObjNodeIntoStackAlloc( // The method currently treats all locals assigned to a field as escaping. // The can potentially be tracked by special field edges in the connection graph. // -bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parentStack, unsigned int lclNum) +bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parentStack, + unsigned int lclNum, + BasicBlock* block) { assert(parentStack != nullptr); int parentIndex = 1; bool keepChecking = true; bool canLclVarEscapeViaParentStack = true; + bool isCopy = true; + bool isEnumeratorLocal = comp->lvaGetDesc(lclNum)->lvIsEnumerator; while (keepChecking) { @@ -922,12 +1036,15 @@ bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parent break; } + GenTree* tree = parentStack->Top(parentIndex - 1); + GenTree* parent = parentStack->Top(parentIndex); + bool wasCopy = isCopy; + + isCopy = false; canLclVarEscapeViaParentStack = true; - GenTree* tree = parentStack->Top(parentIndex - 1); - GenTree* parent = parentStack->Top(parentIndex); keepChecking = false; - JITDUMP("... L%02u ... checking [%06u]\n", lclNum, comp->dspTreeID(parent)); + JITDUMP("... V%02u ... checking [%06u]\n", lclNum, comp->dspTreeID(parent)); switch (parent->OperGet()) { @@ -941,6 +1058,14 @@ bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parent AddConnGraphEdge(dstLclNum, srcLclNum); canLclVarEscapeViaParentStack = false; + + // If the source of this store is an enumerator local, + // then the dest also becomes an enumerator local. + // + if (isCopy) + { + CheckForEnumeratorUse(srcLclNum, dstLclNum); + } } break; @@ -967,13 +1092,18 @@ bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parent case GT_QMARK: case GT_ADD: case GT_SUB: - case GT_BOX: case GT_FIELD_ADDR: // Check whether the local escapes via its grandparent. ++parentIndex; keepChecking = true; break; + case GT_BOX: + isCopy = wasCopy; + ++parentIndex; + keepChecking = true; + break; + case GT_INDEX_ADDR: if (tree == parent->AsIndexAddr()->Index()) { @@ -1002,13 +1132,30 @@ bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack* parent case GT_CALL: { - GenTreeCall* asCall = parent->AsCall(); + GenTreeCall* const asCall = parent->AsCall(); if (asCall->IsHelperCall()) { canLclVarEscapeViaParentStack = !Compiler::s_helperCallProperties.IsNoEscape(comp->eeGetHelperNum(asCall->gtCallMethHnd)); } + + // Note there is nothing special here about the parent being a call. We could move all this processing + // up to the caller and handle any sort of tree that could lead to escapes this way. + // + // We have it this way because we currently don't expect to see other escaping references on failed + // GDV paths, though perhaps with multi-guess GDV that might change? + // + // In particular it might be tempting to look for references in uncatchable BBJ_THROWs or similar + // and enable a kind of "partial escape analysis" where we copy from stack to heap just before the + // point of escape. We would have to add pseudo-locals for this like we do for GDV, but we wouldn't + // necessarily need to do the predicate analysis or cloning. + // + if (isEnumeratorLocal) + { + JITDUMP("Enumerator V%02u passed to call...\n", lclNum); + canLclVarEscapeViaParentStack = !CheckForGuardedUse(block, parent, lclNum); + } break; } @@ -1335,3 +1482,1699 @@ void ObjectAllocator::RewriteUses() } } } + +//------------------------------------------------------------------------------ +// AnalyzeIfCloningCanPreventEscape: see if by cloning we can ensure an object +// does not escape. +// +// Arguments: +// bitVecTraits - Bit vector traits +// escapingNodes [in] - current set of escaping nodes +// escapingNodesToProcess [in/out] - set of newly escaping nodes +// +// Returns: +// true, if there are any newly escaping nodes +// +// Notes: +// During our analysis we have may have noted conditionally escaping objects +// and var references and connected them to a pseduolocal, along with information +// about how we could clone blocks to ensure that the object could be stack allocated. +// +// The current assumption is that these nodes do not escape, but to ensure +// that we must be able to clone the code and remove the potential for escape +// +// So, we verify for each case that we can clone; if not, mark we the pseudolocal +// as escaping. If any pseudlocal now escapes, we return true so that the main +// analysis can update its closure. +// +// We may choose not to clone a candiate for several reasons: +// * too much EH already in the method, or some other reason cloning is infeasible +// * two different candidates have overlapping clone regions +// * the cost/benefit analysis does not look favorable +// +bool ObjectAllocator::AnalyzeIfCloningCanPreventEscape(BitVecTraits* bitVecTraits, + BitVec& escapingNodes, + BitVec& escapingNodesToProcess) +{ + bool newEscapes = false; + + for (unsigned p = 0; p < m_numPseudoLocals; p++) + { + unsigned const pseudoLocal = p + comp->lvaCount; + bool canClone = true; + CloneInfo* info = nullptr; + + const bool hasInfo = m_CloneMap.Lookup(pseudoLocal, &info); + if (!hasInfo) + { + // We never found any conditional allocation attached to this pseudoLocal. + // + JITDUMP(" P%02u has no guard info\n", pseudoLocal); + canClone = false; + break; + } + + unsigned lclNum = BAD_VAR_NUM; + BitVec pseudoLocalAdjacencies = m_ConnGraphAdjacencyMatrix[pseudoLocal]; + + // If we found an allocation but didn't find any conditionally escaping uses, then cloning is of no use + // + if (BitVecOps::IsEmpty(bitVecTraits, pseudoLocalAdjacencies)) + { + JITDUMP(" No conditionally escaping uses under P%02u, so no reason to clone\n", pseudoLocal); + canClone = false; + break; + } + + // Check if each conditionally escaping local escapes on its own; if so cloning is of no use + // + BitVecOps::Iter iterator(bitVecTraits, pseudoLocalAdjacencies); + while (canClone && iterator.NextElem(&lclNum)) + { + if (BitVecOps::IsMember(bitVecTraits, escapingNodes, lclNum)) + { + // The enumerator var or a related var had escaping uses somewhere in the method, + // not under a failing GDV or any GDV. + // + JITDUMP(" V%02u escapes independently of P%02u\n", lclNum, pseudoLocal); + canClone = false; + break; + } + } + + // Also check the alloc temps + // + if (info->m_allocTemps != nullptr) + { + for (unsigned v : *(info->m_allocTemps)) + { + if (BitVecOps::IsMember(bitVecTraits, escapingNodes, v)) + { + JITDUMP(" alloc temp V%02u escapes independently of P%02u\n", v, pseudoLocal) + canClone = false; + break; + } + } + } + + if (canClone) + { + // We may be able to clone and specialize the enumerator uses to ensure + // that the allocated enumerator does not escape. + // + JITDUMP(" P%02u is guarding the escape of V%02u\n", pseudoLocal, lclNum); + if (info->m_allocTemps != nullptr) + { + JITDUMP(" along with "); + for (unsigned v : *(info->m_allocTemps)) + { + JITDUMP("V%02u ", v); + } + JITDUMP("\n"); + } + JITDUMP(" they escape only when V%02u.Type NE %s\n", info->m_local, comp->eeGetClassName(info->m_type)); + JITDUMP(" V%02u + secondary vars have %u appearances\n", info->m_local, info->m_appearanceCount); + + comp->Metrics.EnumeratorGDVProvisionalNoEscape++; + } + + // See if cloning is actually viable. + // + if (canClone) + { + canClone = CanClone(info); + } + + // See if this clone would overlap with othr clones + // + if (canClone) + { + canClone = !CloneOverlaps(info); + } + + // See if cloning is a good idea. + // + if (canClone) + { + canClone = ShouldClone(info); + } + + // All checks are done + // + if (canClone) + { + JITDUMP("\n*** Can prevent escape under P%02u via cloning ***\n", pseudoLocal); + info->m_willClone = true; + m_regionsToClone++; + } + else + { + JITDUMP(" not optimizing, so will mark P%02u as escaping\n", pseudoLocal); + MarkLclVarAsEscaping(pseudoLocal); + BitVecOps::AddElemD(bitVecTraits, escapingNodesToProcess, pseudoLocal); + newEscapes = true; + } + } + + return newEscapes; +} + +//------------------------------------------------------------------------------ +// NewPseudoLocal: return index of a new pseudo local. +// +// Returns: +// index to use, or BAD_VAR_NUM if no more indices are available. +// +unsigned ObjectAllocator::NewPseudoLocal() +{ + unsigned result = BAD_VAR_NUM; + if (m_numPseudoLocals < m_maxPseudoLocals) + { + result = comp->lvaCount + m_numPseudoLocals; + m_numPseudoLocals++; + } + return result; +} + +//------------------------------------------------------------------------------ +// IsGuarded: does evaluation of `tree` depend on a GDV type test? +// +// Arguments: +// block -- block containing tree +// tree -- tree in question +// info -- [out] closest enclosing guard info, if method returns true +// testOutcome -- outcome of GDV test (true ==> type matches the specific one in the test) +// +// Returns: +// true if tree is only evaluated under a GDV check, where the check result is testOutcome. +// Returns closest such check (in terms of dominators), along with info on the check. +// +// Notes: +// * There may be other checks higher in the tree, consider returning all +// checks rather than just the closest. +// * Possibly try and recognize user-written type checks...? +// * Consider bailing out at some point, for deep dominator trees. +// * R2R/NAOT cases where compile time and runtime handles diverge +// +bool ObjectAllocator::IsGuarded(BasicBlock* block, GenTree* tree, GuardInfo* info, bool testOutcome) +{ + JITDUMP("Checking if [%06u] in " FMT_BB " executes under a %s GDV type test\n", comp->dspTreeID(tree), block->bbNum, + testOutcome ? "successful" : "failing"); + + // Walk up the dominator tree.... + // + for (BasicBlock* idomBlock = block->bbIDom; idomBlock != nullptr; idomBlock = idomBlock->bbIDom) + { + JITDUMP("... examining dominator " FMT_BB "\n", idomBlock->bbNum); + if (!idomBlock->KindIs(BBJ_COND)) + { + JITDUMP("... not BBJ_COND\n"); + continue; + } + + // We require that one idomBlock successor *not* dominate. + // (otherwise idomBlock could be the top of a diamond where both outcomes reach block). + // + const bool trueSuccessorDominates = comp->m_domTree->Dominates(idomBlock->GetTrueTarget(), block); + const bool falseSuccessorDominates = comp->m_domTree->Dominates(idomBlock->GetFalseTarget(), block); + + if (trueSuccessorDominates && falseSuccessorDominates) + { + JITDUMP("... both successors dominate?\n"); + continue; + } + + if (!(trueSuccessorDominates || falseSuccessorDominates)) + { + JITDUMP("... neither successor dominates\n"); + continue; + } + + GenTree* const guardingRelop = IsGuard(idomBlock, info); + if (guardingRelop == nullptr) + { + continue; + } + + // We found a dominating GDV test, see if the condition is the one we're looking for. + // + if (testOutcome) + { + bool const isReachableOnGDVSuccess = (trueSuccessorDominates && guardingRelop->OperIs(GT_EQ)) || + (falseSuccessorDominates && guardingRelop->OperIs(GT_NE)); + if (isReachableOnGDVSuccess) + { + info->m_block = idomBlock; + return true; + } + JITDUMP("... guarded by failing GDV\n"); + continue; + } + else + { + bool const isReachableOnGDVFailure = (trueSuccessorDominates && guardingRelop->OperIs(GT_NE)) || + (falseSuccessorDominates && guardingRelop->OperIs(GT_EQ)); + if (isReachableOnGDVFailure) + { + info->m_block = idomBlock; + return true; + } + JITDUMP("... guarded by successful GDV\n"); + continue; + } + } + + JITDUMP("... no more doms\n"); + return false; +} + +//------------------------------------------------------------------------------ +// IsGuard: does block look like a GDV guard +// +// Arguments: +// block -- block in question +// info -- [out] guard info +// +// Returns: +// Comparison tree if this is a guard, or nullptr +// +GenTree* ObjectAllocator::IsGuard(BasicBlock* block, GuardInfo* info) +{ + if (!block->KindIs(BBJ_COND)) + { + JITDUMP("... not BBJ_COND\n"); + return nullptr; + } + + Statement* const stmt = block->lastStmt(); + if (stmt == nullptr) + { + JITDUMP("... no stmt\n"); + return nullptr; + } + + GenTree* const jumpTree = stmt->GetRootNode(); + if (!jumpTree->OperIs(GT_JTRUE)) + { + JITDUMP("... no JTRUE\n"); + return nullptr; + } + + GenTree* const tree = jumpTree->gtGetOp1(); + + // Must be an equality or inequality + // + if (!tree->OperIs(GT_NE, GT_EQ)) + { + JITDUMP("... not NE/EQ\n"); + return nullptr; + } + + GenTree* op1 = tree->gtGetOp1(); + GenTree* op2 = tree->gtGetOp2(); + bool swapped = false; + + // gdv creates NE(hnd, indir(locl)) + // but let's not rely on that + // + if (!op1->OperIs(GT_IND)) + { + swapped = true; + std::swap(op1, op2); + } + + if (!op1->OperIs(GT_IND)) + { + JITDUMP("... no JTRUE(cmp(ind, ...))\n"); + return nullptr; + } + + if (!op1->TypeIs(TYP_I_IMPL)) + { + JITDUMP("... no JTRUE(cmp(ind:int, ...))\n"); + return nullptr; + } + + GenTree* const addr = op1->AsIndir()->Addr(); + + if (!addr->TypeIs(TYP_REF)) + { + JITDUMP("... no JTRUE(cmp(ind:int(*:ref), ...))\n"); + return nullptr; + } + + if (!addr->OperIs(GT_LCL_VAR)) + { + JITDUMP("... no JTRUE(cmp(ind:int(lcl:ref), ...))\n"); + return nullptr; + } + + if (!op2->IsIconHandle(GTF_ICON_CLASS_HDL)) + { + JITDUMP("... no JTRUE(cmp(ind:int(lcl:ref), clsHnd))\n"); + return nullptr; + } + + // Passed the checks... fill in the info. + // + info->m_local = addr->AsLclVar()->GetLclNum(); + bool isNonNull = false; + bool isExact = false; + info->m_type = (CORINFO_CLASS_HANDLE)op2->AsIntCon()->gtIconVal; + + JITDUMP("... " FMT_BB " is guard for V%02u\n", block->bbNum, info->m_local); + return tree; +} + +//------------------------------------------------------------------------------ +// CheckForGuardedUse - see if this use of lclNum is controlled by a failing +// GDV check that we're tracking as part of conditional escape. +// +// Arguments: +// block - block containing tree +// tree - parent tree using the local +// lclNum - local being read +// +// Returns: +// true if this use is a conditionally escaping use. +// +bool ObjectAllocator::CheckForGuardedUse(BasicBlock* block, GenTree* tree, unsigned lclNum) +{ + // Find pseudo local... + // + unsigned pseudoLocal = BAD_VAR_NUM; + if (!m_EnumeratorLocalToPseudoLocalMap.TryGetValue(lclNum, &pseudoLocal)) + { + JITDUMP("... no pseudo local?\n"); + return false; + } + + // Verify that this call is made under a **failing** GDV test + // + GuardInfo info; + if (!IsGuarded(block, tree, &info, /* testOutcome */ false)) + { + JITDUMP("... not guarded?\n"); + return false; + } + + // Find the GDV guard for the pseudo-local + // + CloneInfo* pseudoGuardInfo; + if (!m_CloneMap.Lookup(pseudoLocal, &pseudoGuardInfo)) + { + JITDUMP("... under non-gdv guard?\n"); + return false; + } + + // Verify this appearance is under the same guard + // + if ((info.m_local == lclNum) && (pseudoGuardInfo->m_local == lclNum) && (info.m_type == pseudoGuardInfo->m_type)) + { + // If so, track this as an assignment pseudoLocal = ... + // + // Later if we don't clone and split off the failing GDV paths, + // we will mark pseudoLocal as escaped, and that will lead + // to lclNum escaping as well. + // + JITDUMP("... under GDV; tracking via pseudo-local P%02u\n", pseudoLocal); + AddConnGraphEdge(pseudoLocal, lclNum); + return true; + } + + JITDUMP("... under different guard?\n"); + return false; +} + +//------------------------------------------------------------------------------ +// CheckForGuardedAllocationOrCopy - see if this store is guarded by GDV and is +// the store of a newly allocated object, or a copy of a local known to hold +// references to such an object +// +// Arguments: +// block - block containing tree +// stmt - statement containing tree +// use - pointer to local store node +// lclNum - local being stored to +// +// Notes: +// Also keeps track of temporaries that convey the new object to its +// final GDV "destination" local. +// +void ObjectAllocator::CheckForGuardedAllocationOrCopy(BasicBlock* block, + Statement* stmt, + GenTree** use, + unsigned lclNum) +{ + GenTree* const tree = *use; + assert(tree->OperIsLocalStore()); + + if (!CanHavePseudoLocals()) + { + // We didn't flag any allocations of interest during importation, + // so there is nothing to do here. + return; + } + + // If we did flag allocations, we should have built dominators + // (needed by calls to IsGuarded, below). + // + assert(comp->m_domTree != nullptr); + GenTree* const data = tree->AsLclVarCommon()->Data(); + + // This may be a conditional allocation. We will try and track the conditions + // under which it escapes. GDVs are a nice subset because the conditions are stylized, + // and the condition analysis seems tractable, and we expect the un-inlined failed + // GDVs to be the main causes of escapes. + // + if (data->OperIs(GT_ALLOCOBJ)) + { + // See if this store is made under a successful GDV test. + // + GuardInfo controllingGDV; + if (IsGuarded(block, tree, &controllingGDV, /* testOutcome */ true)) + { + // This is the allocation of concrete class under GDV. + // + // Find the local that will ultimately represent its uses (we have kept track of + // this during importation and GDV expansion). Note it is usually *not* lclNum. + // + // We will keep special track of all accesses to this local. + // + Compiler::NodeToUnsignedMap* const map = comp->getImpEnumeratorGdvLocalMap(); + unsigned enumeratorLocal = BAD_VAR_NUM; + if (map->Lookup(data, &enumeratorLocal)) + { + // If it turns out we can't stack allocate this new object even if it does not escape, + // then don't bother setting up tracking. Note length here is just set to a nominal + // value that won't cause failure. We will do the real length check later if we decide to allocate. + // + CORINFO_CLASS_HANDLE clsHnd = data->AsAllocObj()->gtAllocObjClsHnd; + const char* reason = nullptr; + unsigned size = 0; + unsigned length = TARGET_POINTER_SIZE; + if (CanAllocateLclVarOnStack(enumeratorLocal, clsHnd, OAT_NEWOBJ, length, &size, &reason, + /* preliminaryCheck */ true)) + { + // We are going to conditionally track accesses to the enumerator local via a pseudo local. + // + const unsigned pseudoLocal = NewPseudoLocal(); + assert(pseudoLocal != BAD_VAR_NUM); + bool added = m_EnumeratorLocalToPseudoLocalMap.AddOrUpdate(enumeratorLocal, pseudoLocal); + + if (!added) + { + // Seems like we have multiple GDVs that can define this local. + // Carry on for now, but later we may see these collide + // and end up not cloning any of them. + // + // Since we are walking in RPO we may also be able to see that + // they are properly disjoint and things will work out just fine. + // + JITDUMP("Looks like enumerator var re-use (multiple defining GDVs)\n"); + } + + // We will query this info if we see CALL(enumeratorLocal) + // during subsequent analysis, to verify that access is + // under the same guard conditions. + // + CompAllocator alloc(comp->getAllocator(CMK_ObjectAllocator)); + CloneInfo* info = new (alloc) CloneInfo(); + info->m_local = enumeratorLocal; + info->m_type = clsHnd; + info->m_pseudoLocal = pseudoLocal; + info->m_appearanceMap = new (alloc) EnumeratorVarMap(alloc); + info->m_allocBlock = block; + info->m_allocStmt = stmt; + info->m_allocTree = data; + info->m_domBlock = controllingGDV.m_block; + m_CloneMap.Set(pseudoLocal, info); + + JITDUMP("Enumerator allocation [%06u]: will track accesses to V%02u guarded by type %s via P%02u\n", + comp->dspTreeID(data), enumeratorLocal, comp->eeGetClassName(clsHnd), pseudoLocal); + + // If this is not a direct assignment to the enumerator var we also need to + // track the temps that will appear in between. Later we will rewrite these + // to a fresh set of temps. + // + if (lclNum != enumeratorLocal) + { + CheckForEnumeratorUse(enumeratorLocal, lclNum); + RecordAppearance(lclNum, block, stmt, use); + } + } + else + { + JITDUMP( + "Enumerator allocation [%06u]: enumerator type %s cannot be stack allocated, so not tracking enumerator local V%02u\n", + comp->dspTreeID(data), comp->eeGetClassName(clsHnd), enumeratorLocal); + } + } + else + { + // This allocation is not currently of interest + // + JITDUMP("Allocation [%06u] was not flagged for conditional escape tracking\n", comp->dspTreeID(data)); + } + } + else + { + // This allocation was not done under a GDV guard + // + JITDUMP("Allocation [%06u] is not under a GDV guard\n", comp->dspTreeID(data)); + } + } + else if (data->OperIs(GT_LCL_VAR, GT_BOX)) + { + // See if we are copying from one enumerator-referring local to another. + // This need not be under any guard. + // + unsigned srcLclNum = BAD_VAR_NUM; + if (data->OperIs(GT_BOX)) + { + srcLclNum = data->AsBox()->BoxOp()->AsLclVarCommon()->GetLclNum(); + } + else + { + srcLclNum = data->AsLclVarCommon()->GetLclNum(); + } + + const bool isEnumeratorUse = CheckForEnumeratorUse(srcLclNum, lclNum); + + if (isEnumeratorUse) + { + RecordAppearance(lclNum, block, stmt, use); + } + } +} + +//------------------------------------------------------------------------------ +// CheckForEnumeratorUse - see if this is a use of an enumerator var that is +// copied to another var. +// +// Arguments: +// lclNum - source of the copy +// dstLclNum - destination of the copy +// +// Returns: +// true if this is a copy +// +bool ObjectAllocator::CheckForEnumeratorUse(unsigned lclNum, unsigned dstLclNum) +{ + unsigned pseudoLocal = BAD_VAR_NUM; + + if (m_EnumeratorLocalToPseudoLocalMap.TryGetValue(dstLclNum, &pseudoLocal)) + { + // We already knew dstLclNum was a potential copy + // + return true; + } + + if (!m_EnumeratorLocalToPseudoLocalMap.TryGetValue(lclNum, &pseudoLocal)) + { + // lclNum is not a potential source + // + return false; + } + + CloneInfo* info = nullptr; + if (!m_CloneMap.Lookup(pseudoLocal, &info)) + { + // We aren't interested in locals under this guard + // + return false; + } + + // lclNum is an interesting enumerator var, so now so is dstLclNum. + // + const bool added = m_EnumeratorLocalToPseudoLocalMap.AddOrUpdate(dstLclNum, pseudoLocal); + + assert(added); + + JITDUMP("Enumerator allocation: will also track accesses to V%02u via P%02u\n", dstLclNum, pseudoLocal); + + if (info->m_allocTemps == nullptr) + { + CompAllocator alloc(comp->getAllocator(CMK_ObjectAllocator)); + info->m_allocTemps = new (alloc) jitstd::vector(alloc); + } + + info->m_allocTemps->push_back(dstLclNum); + + return true; +} + +//------------------------------------------------------------------------------ +// RecordAppearance: note info about an enumerator var appearance +// +// Arguments: +// lclNum -- enumerator var +// block -- block holding the stmt +// stmt -- stmt holding the use +// use -- local var reference +// +void ObjectAllocator::RecordAppearance(unsigned lclNum, BasicBlock* block, Statement* stmt, GenTree** use) +{ + unsigned pseudoLocal = BAD_VAR_NUM; + if (!m_EnumeratorLocalToPseudoLocalMap.TryGetValue(lclNum, &pseudoLocal)) + { + return; + } + + CloneInfo* info; + if (!m_CloneMap.Lookup(pseudoLocal, &info)) + { + return; + } + + GenTree* const tree = *use; + bool const isDef = tree->OperIsLocalStore(); + + JITDUMP("Found enumerator V%02u %s at [%06u]\n", lclNum, isDef ? "def" : "use", comp->dspTreeID(tree)); + + CompAllocator alloc(comp->getAllocator(CMK_ObjectAllocator)); + EnumeratorVarMap* const varMap = info->m_appearanceMap; + assert(varMap != nullptr); + + EnumeratorVar* v = nullptr; + if (!varMap->Lookup(lclNum, &v)) + { + v = new (alloc) EnumeratorVar(); + v->m_appearances = new (alloc) jitstd::vector(alloc); + varMap->Set(lclNum, v); + } + + EnumeratorVarAppearance* const a = new (alloc) EnumeratorVarAppearance(block, stmt, use, lclNum, isDef); + + if (isDef) + { + if (v->m_def != nullptr) + { + if (!v->m_hasMultipleDefs) + { + JITDUMP("Enumerator V%02u has multiple defs\n"); + v->m_hasMultipleDefs = true; + } + } + else + { + v->m_def = a; + } + + if (stmt == info->m_allocStmt) + { + v->m_isInitialAllocTemp = true; + } + } + + v->m_appearances->push_back(a); + + info->m_appearanceCount++; +} + +//------------------------------------------------------------------------------ +// CloneOverlaps: check if this cloning would overlap with other clonings +// +// Arguments: +// info -- [in, out] info about the cloning opportunity +// +// Returns: +// true if cloning overlaps with some other cloning +// +bool ObjectAllocator::CloneOverlaps(CloneInfo* info) +{ + bool overlaps = false; + BitVecTraits traits(comp->compBasicBlockID, comp); + + for (CloneInfo* const c : CloneMap::ValueIteration(&m_CloneMap)) + { + if (c == info) + { + continue; + } + + if (!c->m_willClone) + { + continue; + } + + if (BitVecOps::IsEmptyIntersection(&traits, info->m_blocks, c->m_blocks)) + { + continue; + } + + JITDUMP("Cloned blocks for P%02u overlap with those for P%02u; unable to clone\n", info->m_pseudoLocal, + c->m_pseudoLocal); + + overlaps = true; + break; + } + + return overlaps; +} + +//------------------------------------------------------------------------------ +// ShouldClone: check if this cloning looks profitable +// +// Arguments: +// info -- info about a cloning opportunity +// +// Returns: +// true if cloning looks profitable +// +bool ObjectAllocator::ShouldClone(CloneInfo* info) +{ + // For now, use the same cloning size limit we use for loop cloning + // + int const sizeConfig = JitConfig.JitCloneLoopsSizeLimit(); + unsigned const sizeLimit = (sizeConfig >= 0) ? (unsigned)sizeConfig : UINT_MAX; + unsigned size = 0; + bool shouldClone = true; + + for (BasicBlock* const block : *info->m_blocksToClone) + { + // Note this overstates the size a bit since we'll resolve GDVs + // in the clone and the original... + // + unsigned const slack = sizeLimit - size; + unsigned blockSize = 0; + if (block->ComplexityExceeds(comp, slack, &blockSize)) + { + JITDUMP("Rejecting P%02u cloning: exceeds size limit %u\n", info->m_pseudoLocal, sizeLimit); + return false; + } + size += blockSize; + } + + // TODO: some kind of profile check... + // + JITDUMP("Accepting P%02u cloning: size %u does not exceed size limit %u\n", info->m_pseudoLocal, size, sizeLimit); + return true; +} + +//------------------------------------------------------------------------------ +// CanClone: check that cloning can remove all escaping references and +// is a reasonble thing to do +// +// Arguments: +// info -- [in, out] info about the cloning opportunity +// +// Returns: +// true if cloning can remove all escaping references +// +bool ObjectAllocator::CanClone(CloneInfo* info) +{ + // If we already analyzed this case, return what we learned before. + // + if (!info->m_checkedCanClone) + { + CheckCanClone(info); + info->m_checkedCanClone = true; + } + + return info->m_canClone; +} + +//------------------------------------------------------------------------------ +// CheckCanClone: check that cloning can remove all escaping references and +// is a reasonble thing to do +// +// Arguments: +// info -- [in, out] info about the cloning opportunity +// +// Returns: +// true if cloning can remove all escaping references +// +bool ObjectAllocator::CheckCanClone(CloneInfo* info) +{ + assert(!info->m_checkedCanClone); + JITDUMP("** Seeing if we can clone to guarantee non-escape under V%02u\n", info->m_local); + BasicBlock* const allocBlock = info->m_allocBlock; + + // The allocation site must not be in a loop (stack allocation limitation) + // + // Note if we can prove non-escape but can't stack allocate, we might be + // able to light up an "object is thread exclusive" mode and effectively + // promote the fields anyways. + // + if (allocBlock->HasFlag(BBF_BACKWARD_JUMP)) + { + JITDUMP("allocation block " FMT_BB " is (possibly) in a loop\n", allocBlock->bbNum); + return false; + } + + // Heuristic: if the allocation block was not profiled, or is hit less than 10% of + // the time this method is called, bail... (note we should really look at weight of uses, + // not the weight of the allocation). + // + if (!allocBlock->hasProfileWeight()) + { + JITDUMP("alloc block " FMT_BB " was not profiled\n", allocBlock->bbNum); + return false; + } + + const weight_t thinProfile = 0.1; + const weight_t allocWeight = allocBlock->getBBWeight(comp); + + if (allocWeight < thinProfile) + { + JITDUMP("alloc block " FMT_BB " relative profile weight too low: " FMT_WT " < " FMT_WT "\n", allocBlock->bbNum, + allocWeight, thinProfile); + return false; + } + + // We should know the full set of locals that can refer to the newly allocated + // object in the blocks we intend to clone (and beyond). Verify those have the + // expected def-use behavior. + // + // The goal of all this is to try and ensure that if we rewrite all the T,V,U appeances + // to new locals in the cloned code we get proper behavior. + // + // There is one distingushed local V (info.m_local) that holds the result of the + // initial GDV and is the local tested in subsequent GDVs. It must have a single def. + // + // The other locals are either temps T that refer to the allocated object between + // allocation site and the def of V, or temps U that are copies of V in the code + // dominated by the def of V. + // + // For the T's, there is a "first" T0 that is the destination of the ALLOCOBJ + // and a "last" Tv that is the source of the assignment to V, and some intermediates Ti. + // + // T0 & all Ti should be single def, all uses dominated by their defs. + // All Ti def sources should be another T. + // + // All Ti appearances should be postdominated by the def of V, but we don't explicitly + // check this -- instead we have verified the path from the alloc block to the def of V. + // + // Tv may have two defs (the other thanks to the "empty static" pattern). If so, we can + // ignore the def not dominated by the allocation block, since we are selectively + // cloning along a path from this block down to the def of V. + // + // Tv's use should be at the def of V. + // + // For the U's: all Ui appearances should be dominated by the def of V; all Ui defs + // should have another Ui or V as their source. (We should also verfy each Ui is + // single-def and the def dominates all the Ui uses, but this may not work out...?) + // + // Also we do not expect any Ti or Ui use to be a GDV guard. U's typically arise from + // inlining under a successful GDV of V, and should have exact types, resolving any + // potential GDV in the inlinee. + // + // First, find the one and only def of V (aka `defBlock`). + // + EnumeratorVar* v = nullptr; + bool const foundV = info->m_appearanceMap->Lookup(info->m_local, &v); + + if (!foundV) + { + JITDUMP("Unexpected: no appearance info for V%02u\n", info->m_local); + return false; + } + + if (v->m_hasMultipleDefs) + { + JITDUMP("Unexpected: V%02u multiply defined\n", info->m_local); + return false; + } + + BasicBlock* const defBlock = v->m_def->m_block; + Statement* const defStmt = v->m_def->m_stmt; + + JITDUMP("V%02u has single def in " FMT_BB " at [%06u]\n", info->m_local, defBlock->bbNum, + comp->dspTreeID(defStmt->GetRootNode())); + + // We expect to be able to follow all paths from alloc block to defBlock + // without reaching "beyond" defBlock. + // + // Because we are inside a GDV hammock, we do not expect to see a normal + // flow path from allocBlock that can bypass defBlock. For now we trust + // that is the case. + // + // toVisit: blocks we need to visit to determine extent of cloning + // visited: block we will need to clone + // toVisitTryEntry: subset of above that are try entries. + // + CompAllocator alloc(comp->getAllocator(CMK_ObjectAllocator)); + ArrayStack toVisit(alloc); + jitstd::vector* visited = new (alloc) jitstd::vector(alloc); + jitstd::vector* toVisitTryEntry = new (alloc) jitstd::vector(alloc); + + BitVecTraits traits(comp->compBasicBlockID, comp); + BitVec visitedBlocks(BitVecOps::MakeEmpty(&traits)); + toVisit.Push(allocBlock); + + // todo -- some kind of runaway size limit + // + while (toVisit.Height() > 0) + { + BasicBlock* const visitBlock = toVisit.Pop(); + if (!BitVecOps::TryAddElemD(&traits, visitedBlocks, visitBlock->bbID)) + { + continue; + } + + if (visitBlock != allocBlock) + { + visited->push_back(visitBlock); + } + + // We expect this stretch of blocks to all be in the same EH region. + // + if (!BasicBlock::sameEHRegion(allocBlock, visitBlock)) + { + JITDUMP("Unexpected: new EH region at " FMT_BB "\n", visitBlock->bbNum); + return false; + } + + if (visitBlock == defBlock) + { + continue; + } + + JITDUMP("walking through " FMT_BB "\n", visitBlock->bbNum); + + for (BasicBlock* const succ : visitBlock->Succs()) + { + if (BitVecOps::IsMember(&traits, visitedBlocks, succ->bbID)) + { + continue; + } + toVisit.Push(succ); + } + } + + JITDUMP("def block " FMT_BB " post-dominates allocation site " FMT_BB "\n", defBlock->bbNum, allocBlock->bbNum); + + // -1 here since we won't need to clone the allocation site itself. + // + JITDUMP("allocation side cloning: %u blocks\n", visited->size() - 1); + + // The allocationBlock should not dominate the defBlock. + // (if it does, optimization does not require cloning, as + // there should be only one reaching def...) + // + if (comp->m_domTree->Dominates(allocBlock, defBlock)) + { + JITDUMP("Unexpected, alloc site " FMT_BB " dominates def block " FMT_BB "\n", allocBlock->bbNum, + defBlock->bbNum); + + return false; + } + // Classify the other local appearances + // as Ts (allocTemps) or Us (useTemps), and look for guard appearances. + // + for (unsigned lclNum : EnumeratorVarMap::KeyIteration(info->m_appearanceMap)) + { + EnumeratorVar* ev = nullptr; + info->m_appearanceMap->Lookup(lclNum, &ev); + assert(ev != nullptr); + + for (EnumeratorVarAppearance* const a : *(ev->m_appearances)) + { + // If this is a use, see if it's part of a GDV guard. + // + if (!a->m_isDef && (a->m_stmt == a->m_block->lastStmt())) + { + GuardInfo tempInfo; + GenTree* const guardingRelop = IsGuard(a->m_block, &tempInfo); + a->m_isGuard = (guardingRelop != nullptr); + } + + // If this is V, we're done + // + if (lclNum == info->m_local) + { + continue; + } + + // Since defBlock postdominates all Ts and dominates all Us, + // we can use dfs numbers to sort which temps are Ts and which are Us. + // + if (defBlock->bbPostorderNum < a->m_block->bbPostorderNum) + { + ev->m_isAllocTemp = true; + } + else if (defBlock->bbPostorderNum == a->m_block->bbPostorderNum) + { + if (defStmt == a->m_stmt) + { + ev->m_isAllocTemp = true; + ev->m_isFinalAllocTemp = true; + } + else if (comp->gtLatestStatement(defStmt, a->m_stmt) == a->m_stmt) + { + ev->m_isUseTemp = true; + } + else + { + ev->m_isAllocTemp = true; + } + } + else + { + ev->m_isUseTemp = true; + } + + // We don't expect to see allocTemps or useTemps in guards + // + if (a->m_isGuard) + { + JITDUMP("Unexpected: %s temp V%02u is GDV guard at " FMT_BB "\n", ev->m_isAllocTemp ? "alloc" : "use", + a->m_lclNum, a->m_block->bbNum); + return false; + } + } + + // We don't expect a temp to be both an alloc temp and a use temp. + // + if (ev->m_isAllocTemp && ev->m_isUseTemp) + { + JITDUMP("Unexpected: temp V%02u has appearances both before and after main var assignment in " FMT_BB "\n", + lclNum, defBlock->bbNum); + + return false; + } + + if (ev->m_isAllocTemp || ev->m_isUseTemp) + { + JITDUMP("Temp V%02u is a %s temp", lclNum, ev->m_isAllocTemp ? "alloc" : "use"); + if (ev->m_isInitialAllocTemp) + { + JITDUMP(" [initial]"); + } + if (ev->m_isFinalAllocTemp) + { + JITDUMP(" [final]"); + } + JITDUMP("\n"); + } + } + + // The allocation block must dominate all T appearances, save for the final T use. + // + for (unsigned lclNum : EnumeratorVarMap::KeyIteration(info->m_appearanceMap)) + { + EnumeratorVar* ev = nullptr; + info->m_appearanceMap->Lookup(lclNum, &ev); + assert(ev != nullptr); + + if (!ev->m_isAllocTemp) + { + continue; + } + + for (EnumeratorVarAppearance* const a : *(ev->m_appearances)) + { + if (ev->m_isFinalAllocTemp && (a->m_block == defBlock) && !a->m_isDef) + { + continue; + } + + if (!comp->m_domTree->Dominates(allocBlock, a->m_block)) + { + JITDUMP("Alloc temp V%02u %s in " FMT_BB " not dominated by alloc " FMT_BB "\n", a->m_lclNum, + a->m_isDef ? "def" : "use", a->m_block->bbNum, allocBlock->bbNum); + return false; + } + } + } + + // The definition block must dominate all the V and U uses. + // + // Also collect up all blocks with U appearances; these will help + // us figure out the full extent of cloning. + // + for (unsigned lclNum : EnumeratorVarMap::KeyIteration(info->m_appearanceMap)) + { + EnumeratorVar* ev = nullptr; + info->m_appearanceMap->Lookup(lclNum, &ev); + assert(ev != nullptr); + + if (ev->m_isAllocTemp) + { + continue; + } + + for (EnumeratorVarAppearance* const a : *(ev->m_appearances)) + { + if (!comp->m_domTree->Dominates(defBlock, a->m_block)) + { + JITDUMP("%sV%02u %s in " FMT_BB " not dominated by def " FMT_BB "\n", ev->m_isUseTemp ? "Use temp" : "", + lclNum, a->m_isDef ? "def" : "use", a->m_block->bbNum, defBlock->bbNum); + return false; + } + + toVisit.Push(a->m_block); + } + } + + JITDUMP("The defBlock dominates the right set of enumerator var uses\n"); + + // Determine the initial extent of the cloned region dominated by + // the def block. + // + // Walk back from each use block until we hit closure. + // + while (toVisit.Height() > 0) + { + BasicBlock* const visitBlock = toVisit.Pop(); + if (!BitVecOps::TryAddElemD(&traits, visitedBlocks, visitBlock->bbID)) + { + continue; + } + visited->push_back(visitBlock); + + // If we see try region entries here, we will handle them below. + // + if (comp->bbIsTryBeg(visitBlock)) + { + toVisitTryEntry->push_back(visitBlock); + } + + JITDUMP("walking back through " FMT_BB "\n", visitBlock->bbNum); + + for (FlowEdge* predEdge = comp->BlockPredsWithEH(visitBlock); predEdge != nullptr; + predEdge = predEdge->getNextPredEdge()) + { + BasicBlock* const predBlock = predEdge->getSourceBlock(); + + // We should not be able to reach an un-dominated block. + // (consider eh paths?) + // + assert(comp->m_domTree->Dominates(defBlock, predBlock)); + if (BitVecOps::IsMember(&traits, visitedBlocks, predBlock->bbID)) + { + continue; + } + toVisit.Push(predBlock); + } + } + + JITDUMP("total cloning including all enumerator uses: %u blocks\n", visited->size() - 1); + unsigned numberOfEHRegionsToClone = 0; + + // Now expand the clone block set to include any try regions that need cloning. + // + unsigned numberOfTryRegionsToClone = 0; + CloneTryInfo cloneInfo(traits); + jitstd::vector tryBlocks(alloc); + cloneInfo.BlocksToClone = &tryBlocks; + + // Order the try regions to visit from outer to inner + // + struct bbTryIndexCmp + { + bool operator()(const BasicBlock* bb1, const BasicBlock* bb2) + { + return bb1->getTryIndex() > bb2->getTryIndex(); + } + }; + jitstd::sort(toVisitTryEntry->begin(), toVisitTryEntry->end(), bbTryIndexCmp()); + + for (BasicBlock* const block : *toVisitTryEntry) + { + if (BitVecOps::IsMember(&traits, cloneInfo.Visited, block->bbID)) + { + // nested region + continue; + } + + // This will not clone, but will check if cloning is possible + // + BasicBlock* const result = comp->fgCloneTryRegion(block, cloneInfo); + + if (result == nullptr) + { + return false; + } + + numberOfTryRegionsToClone++; + } + + // Merge visited and cloneInfo visited + // + for (BasicBlock* const block : tryBlocks) + { + if (BitVecOps::TryAddElemD(&traits, visitedBlocks, block->bbID)) + { + visited->push_back(block); + } + } + + // Sort blocks to visit in RPO + // + struct bbRpoCmp + { + bool operator()(const BasicBlock* bb1, const BasicBlock* bb2) + { + return bb1->bbPostorderNum > bb2->bbPostorderNum; + } + }; + + jitstd::sort(visited->begin(), visited->end(), bbRpoCmp()); + + assert(defBlock->hasProfileWeight()); + + // Determine the profile scale factor. + // + weight_t weightForClone = 0.0; + + for (FlowEdge* const predEdge : defBlock->PredEdges()) + { + if (BitVecOps::IsMember(&traits, visitedBlocks, predEdge->getSourceBlock()->bbID)) + { + weightForClone += predEdge->getLikelyWeight(); + } + } + + weight_t scaleFactor = max(1.0, weightForClone / defBlock->bbWeight); + info->m_profileScale = scaleFactor; + JITDUMP("Profile weight for clone " FMT_WT " overall " FMT_WT ", will scale clone at " FMT_WT "\n", weightForClone, + defBlock->bbWeight, scaleFactor); + + // Save off blocks that we need to clone + // + // TODO: use postorder nums to keeping the vector and bitvec? + // + info->m_blocksToClone = visited; + info->m_blocks = visitedBlocks; + info->m_canClone = true; + + JITDUMP("total cloning including all uses and subsequent EH: %u blocks\n", + BitVecOps::Count(&traits, visitedBlocks)); + + comp->Metrics.EnumeratorGDVCanCloneToEnsureNoEscape++; + return true; +} + +//------------------------------------------------------------------------------ +// CloneAndSpecialize: clone and specialize blocks and statements so that +// an enumerator allocation does not escape +// +// Arguments: +// info -- info about the cloning opportunity +// +// +// Notes: +// The cloned blocks become the "fast path" where the enumerator object allocated +// in info.m_allocBlock is used. The original blocks are the slow path where it +// is unclear which object (and which type of object) is used. +// +// In the cloned blocks, the enumerator local is updated to a new local. +// +void ObjectAllocator::CloneAndSpecialize(CloneInfo* info) +{ + assert(info->m_canClone); + assert(info->m_willClone); + JITDUMP("\nCloning to ensure allocation at " FMT_BB " does not escape\n", info->m_allocBlock->bbNum); + + // Clone blocks in RPO order. If we find a try entry, clone that as a whole, + // and skip over those blocks subsequently. + // + BlockToBlockMap map(comp->getAllocator(CMK_ObjectAllocator)); + + // If there is an enclosing EH region, insert the new blocks at the end of this + // region. Otherwise insert the new blocks just after the allocation site + // + BasicBlock* insertionPoint = info->m_allocBlock; + + bool inTry = false; + unsigned enclosingEHRegion = comp->ehGetMostNestedRegionIndex(insertionPoint, &inTry); + + if (enclosingEHRegion != 0) + { + EHblkDsc* const ebd = comp->ehGetDsc(enclosingEHRegion - 1); + + if (inTry) + { + insertionPoint = ebd->ebdTryLast; + } + else + { + insertionPoint = ebd->ebdHndLast; + } + + JITDUMP("Will insert new blocks at end of enclosing EH#%u %s region " FMT_BB "\n", enclosingEHRegion - 1, + inTry ? "try" : "handler", insertionPoint->bbNum); + } + else + { + JITDUMP("Will insert new blocks after allocation block " FMT_BB "\n", insertionPoint->bbNum); + } + BasicBlock** insertAfter = &insertionPoint; + BasicBlock* const oldLast = insertionPoint; + + // Compute profile scale for the original blocks. + // + weight_t originalScale = max(0.0, 1.0 - info->m_profileScale); + + // Seems like if the region exits the try the RPO could mix + // try and non-try blocks... hmm. + // + for (BasicBlock* const block : *info->m_blocksToClone) + { + BasicBlock* newBlock = nullptr; + const bool isCloned = map.Lookup(block, &newBlock); + + if (isCloned) + { + assert(newBlock != nullptr); + continue; + } + + if (comp->bbIsTryBeg(block)) + { + BitVecTraits traits(comp->compBasicBlockID, comp); + CloneTryInfo cloneTryInfo(traits); + cloneTryInfo.Map = ↦ + cloneTryInfo.AddEdges = false; + cloneTryInfo.ProfileScale = info->m_profileScale; + cloneTryInfo.ScaleOriginalBlockProfile = true; + comp->fgCloneTryRegion(block, cloneTryInfo, insertAfter); + continue; + } + + newBlock = comp->fgNewBBafter(BBJ_ALWAYS, *insertAfter, /* extendRegion */ false); + JITDUMP("Adding " FMT_BB " (copy of " FMT_BB ") after " FMT_BB "\n", newBlock->bbNum, block->bbNum, + (*insertAfter)->bbNum); + BasicBlock::CloneBlockState(comp, newBlock, block); + + assert(newBlock->bbRefs == 0); + newBlock->scaleBBWeight(info->m_profileScale); + block->scaleBBWeight(originalScale); + map.Set(block, newBlock, BlockToBlockMap::Overwrite); + *insertAfter = newBlock; + } + + // Fix up flow.. + // + for (BasicBlock* const block : *info->m_blocksToClone) + { + BasicBlock* newBlock = nullptr; + const bool isCloned = map.Lookup(block, &newBlock); + assert(isCloned && (newBlock != nullptr)); + assert(!newBlock->HasInitializedTarget()); + JITDUMP("Updating targets: " FMT_BB " mapped to " FMT_BB "\n", block->bbNum, newBlock->bbNum); + comp->optSetMappedBlockTargets(block, newBlock, &map); + } + + // Fix up any enclosing EH extents + // + if (enclosingEHRegion != 0) + { + // Note enclosing region index may shift because of EH cloning, so refetch it. + // + bool postCloneInTry = false; + unsigned postCloneEnclosingEHRegion = comp->ehGetMostNestedRegionIndex(info->m_allocBlock, &postCloneInTry); + assert(postCloneEnclosingEHRegion >= enclosingEHRegion); + assert(inTry == postCloneInTry); + + // Now update the extents + // + BasicBlock* const newLast = *insertAfter; + EHblkDsc* const ebd = comp->ehGetDsc(enclosingEHRegion - 1); + for (EHblkDsc* const HBtab : EHClauses(comp, ebd)) + { + if (HBtab->ebdTryLast == oldLast) + { + comp->fgSetTryEnd(HBtab, newLast); + } + if (HBtab->ebdHndLast == oldLast) + { + comp->fgSetHndEnd(HBtab, newLast); + } + } + } + + // Create a new local for the enumerator uses in the cloned code + // + // Note: we will map all the allocTemp and useTemp appearances to + // this variable as well. + // + unsigned const newEnumeratorLocal = comp->lvaGrabTemp(/* shortLifetime */ false DEBUGARG("fast-path enumerator")); + info->m_enumeratorLocal = newEnumeratorLocal; + + // Type for now as TYP_REF; this will get rewritten later during RewriteUses + // + comp->lvaTable[newEnumeratorLocal].lvType = TYP_REF; + comp->lvaTable[newEnumeratorLocal].lvSingleDef = 1; + comp->lvaSetClass(newEnumeratorLocal, info->m_type, /* isExact */ true); + + class ReplaceVisitor final : public GenTreeVisitor + { + CloneInfo* m_info; + unsigned m_newLclNum; + + public: + enum + { + DoPreOrder = true, + DoLclVarsOnly = true, + }; + + bool MadeChanges = false; + + ReplaceVisitor(Compiler* comp, CloneInfo* info, unsigned newLclNum) + : GenTreeVisitor(comp) + , m_info(info) + , m_newLclNum(newLclNum) + { + } + + fgWalkResult PreOrderVisit(GenTree** use, GenTree* user) + { + // We could just bash all defs save for the initial temp def + // but this would make validating substitutions a bit harder, + // as some defs or uses would vanish. + // + GenTreeLclVarCommon* const node = (*use)->AsLclVarCommon(); + if (node->OperIs(GT_LCL_VAR, GT_STORE_LCL_VAR)) + { + EnumeratorVar* ev = nullptr; + if (m_info->m_appearanceMap->Lookup(node->GetLclNum(), &ev)) + { + // We leave the initial alloc temp as is; the the + // object allocator rewriting will take care of it. + // + if (!ev->m_isInitialAllocTemp) + { + node->SetLclNum(m_newLclNum); + } + } + MadeChanges = true; + } + + return fgWalkResult::WALK_CONTINUE; + } + }; + + ReplaceVisitor visitor(comp, info, newEnumeratorLocal); + + // Rewrite enumerator var uses in the cloned (fast) blocks to reference + // the new enumerator local. + // + // Specialize GDV tests in the cloned blocks to always return true. + // + // Note we'd figure this out eventually anyways (since the new enumerator + // var has an the exact type, but we want to accelerate this so that our new + // enumerator var does not appear to be exposed. + // + // Specialize GDV tests in the original code to always return false. + // + // We would not figure this out eventually anyways, as the unknown + // enumerator may well have the right type. The main goal here is + // to block GDV-inspired cloning of the "slow" loop. + // + // (This is inefficient/odd, because for copies or trees with multiple + // uses we will process each one twice or more). + // + // Also we are leaving self-copies around (eg V20 = V20) and this seems + // to confuse local morph. We can't fix these on the fly because + // the tree visitor won't do post-order local only traversals. Grr. + // + CompAllocator alloc(comp->getAllocator(CMK_ObjectAllocator)); + ArrayStack defStmts(alloc); + + for (unsigned lclNum : EnumeratorVarMap::KeyIteration(info->m_appearanceMap)) + { + EnumeratorVar* ev = nullptr; + info->m_appearanceMap->Lookup(lclNum, &ev); + assert(ev != nullptr); + + for (EnumeratorVarAppearance* const a : *(ev->m_appearances)) + { + // We do not rewrite the initial temp appearances. These will be rewritten + // when the ALLOCOBJ is turned into a stack allocation. + // + if (ev->m_isInitialAllocTemp) + { + continue; + } + + // Also, we do not clone the allocBlock, but we may need to rewrite + // some appearances there. + // + BasicBlock* newBlock = nullptr; + + if (a->m_block == info->m_allocBlock) + { + newBlock = info->m_allocBlock; + + JITDUMP("Updating V%02u %s in " FMT_BB " (allocation block) to V%02u\n", a->m_lclNum, + a->m_isDef ? "def" : "use", newBlock->bbNum, newEnumeratorLocal); + } + else + { + const bool isCloned = map.Lookup(a->m_block, &newBlock); + assert(isCloned && (newBlock != nullptr)); + + JITDUMP("Updating V%02u %s in " FMT_BB " (clone of " FMT_BB ") to V%02u\n", a->m_lclNum, + a->m_isDef ? "def" : "use", newBlock->bbNum, a->m_block->bbNum, newEnumeratorLocal); + } + + // Find matching stmt/tree in the clone, and update it + // ... note we could simplify this for the allocBlock case + // + Statement* clonedStmt = newBlock->firstStmt(); + for (Statement* const stmt : a->m_block->Statements()) + { + if (stmt == a->m_stmt) + { + JITDUMP("Before\n"); + DISPTREE(clonedStmt->GetRootNode()); + + // walk and replace + visitor.MadeChanges = false; + visitor.WalkTree(clonedStmt->GetRootNodePointer(), nullptr); + + JITDUMP("After\n"); + DISPTREE(clonedStmt->GetRootNode()); + + assert(visitor.MadeChanges); + + if (a->m_isDef) + { + defStmts.Push(clonedStmt); + } + break; + } + + clonedStmt = clonedStmt->GetNextStmt(); + } + + if (!a->m_isGuard) + { + continue; + } + + { + // Original/Slow path -- gdv will always fail + // + GuardInfo slowGuardInfo; + GenTree* const slowGuardRelop = IsGuard(a->m_block, &slowGuardInfo); + assert(slowGuardRelop != nullptr); + bool const keepTrueEdge = slowGuardRelop->OperIs(GT_NE); + FlowEdge* const retainedEdge = keepTrueEdge ? a->m_block->GetTrueEdge() : a->m_block->GetFalseEdge(); + FlowEdge* const removedEdge = keepTrueEdge ? a->m_block->GetFalseEdge() : a->m_block->GetTrueEdge(); + + JITDUMP("Modifying slow path GDV guard " FMT_BB " to always branch to " FMT_BB "\n", a->m_block->bbNum, + retainedEdge->getDestinationBlock()->bbNum); + comp->fgRemoveRefPred(removedEdge); + a->m_block->SetKindAndTargetEdge(BBJ_ALWAYS, retainedEdge); + a->m_block->lastStmt()->SetRootNode(slowGuardRelop); + comp->fgRepairProfileCondToUncond(a->m_block, retainedEdge, removedEdge); + } + + { + // Cloned/Fast path -- gdv will always succeed + // + GuardInfo fastGuardInfo; + GenTree* const fastGuardRelop = IsGuard(newBlock, &fastGuardInfo); + assert(fastGuardRelop != nullptr); + bool const keepTrueEdge = fastGuardRelop->OperIs(GT_EQ); + FlowEdge* const retainedEdge = keepTrueEdge ? newBlock->GetTrueEdge() : newBlock->GetFalseEdge(); + FlowEdge* const removedEdge = keepTrueEdge ? newBlock->GetFalseEdge() : newBlock->GetTrueEdge(); + + JITDUMP("Modifying fast path GDV guard " FMT_BB " to always branch to " FMT_BB "\n", newBlock->bbNum, + retainedEdge->getDestinationBlock()->bbNum); + comp->fgRemoveRefPred(removedEdge); + newBlock->SetKindAndTargetEdge(BBJ_ALWAYS, retainedEdge); + newBlock->lastStmt()->SetRootNode(fastGuardRelop); + comp->fgRepairProfileCondToUncond(newBlock, retainedEdge, removedEdge); + } + } + } + + // Now revisit all the cloned def stmts, and remove any that are self-assignments. + // + while (defStmts.Height() > 0) + { + Statement* const defStmt = defStmts.Pop(); + GenTree* const rootNode = defStmt->GetRootNode(); + + if (rootNode->OperIs(GT_STORE_LCL_VAR)) + { + GenTree* const data = rootNode->AsOp()->Data(); + if (data->OperIs(GT_LCL_VAR)) + { + if (rootNode->AsLclVarCommon()->GetLclNum() == data->AsLclVarCommon()->GetLclNum()) + { + JITDUMP("Bashing self-copy [%06u] to NOP\n", comp->dspTreeID(rootNode)); + rootNode->gtBashToNOP(); + } + } + } + } + + // Modify the allocation block to branch to the start of the fast path + // + BasicBlock* const firstBlock = (*info->m_blocksToClone)[0]; + BasicBlock* firstClonedBlock = nullptr; + bool const firstFound = map.Lookup(firstBlock, &firstClonedBlock); + assert(firstFound); + comp->fgRedirectTargetEdge(info->m_allocBlock, firstClonedBlock); + + // If we are subsequently going to do the "empty collection static enumerator" opt, + // then our profile is now consistent. + // + if ((info->m_allocTree->gtFlags & GTF_ALLOCOBJ_EMPTY_STATIC) != 0) + { + JITDUMP("Anticipating the empty-collection static enumerator opt for [%06u]," + " so not adjusting profile in the initial GDV region\n", + comp->dspTreeID(info->m_allocTree)); + return; + } + + // If not, we need to do more profile repair in the region from the + // allocation-dominating GDV down to the (now cloned) defBlock. + // + JITDUMP("Profile data needs more repair. Data %s inconsistent.\n", + comp->fgPgoConsistent ? "is now" : "was already"); + + if (comp->fgPgoConsistent) + { + comp->fgPgoConsistent = false; + } +} + +//------------------------------------------------------------------------------ +// CloneAndSpecializeAll: clone and specialize any regions needed to guarantee +// objects don't escape +// +void ObjectAllocator::CloneAndSpecialize() +{ + unsigned numberOfClonedRegions = 0; + + for (CloneInfo* const c : CloneMap::ValueIteration(&m_CloneMap)) + { + if (!c->m_willClone) + { + continue; + } + + CloneAndSpecialize(c); + numberOfClonedRegions++; + } + + assert(numberOfClonedRegions == m_regionsToClone); +} diff --git a/src/coreclr/jit/objectalloc.h b/src/coreclr/jit/objectalloc.h index 35e704974969ac..593e7b7915335c 100644 --- a/src/coreclr/jit/objectalloc.h +++ b/src/coreclr/jit/objectalloc.h @@ -18,6 +18,98 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX //=============================================================================== #include "phase.h" #include "smallhash.h" +#include "vector.h" + +// A use or def of an enumerator var in the code +// +struct EnumeratorVarAppearance +{ + EnumeratorVarAppearance(BasicBlock* block, Statement* stmt, GenTree** use, unsigned lclNum, bool isDef) + : m_block(block) + , m_stmt(stmt) + , m_use(use) + , m_lclNum(lclNum) + , m_isDef(isDef) + , m_isGuard(false) + { + } + + BasicBlock* m_block; + Statement* m_stmt; + GenTree** m_use; + unsigned m_lclNum; + bool m_isDef; + bool m_isGuard; +}; + +// Information about def and uses of enumerator vars, plus... +// +struct EnumeratorVar +{ + EnumeratorVarAppearance* m_def = nullptr; + jitstd::vector* m_appearances = nullptr; + bool m_hasMultipleDefs = false; + bool m_isAllocTemp = false; + bool m_isInitialAllocTemp = false; + bool m_isFinalAllocTemp = false; + bool m_isUseTemp = false; +}; + +typedef JitHashTable, EnumeratorVar*> EnumeratorVarMap; + +// Describes a GDV guard +// +struct GuardInfo +{ + unsigned m_local = BAD_VAR_NUM; + CORINFO_CLASS_HANDLE m_type = NO_CLASS_HANDLE; + BasicBlock* m_block = nullptr; +}; + +// Describes a guarded enumerator cloning candidate +// +struct CloneInfo : public GuardInfo +{ + CloneInfo() + { + m_blocks = BitVecOps::UninitVal(); + } + + // Pseudo-local tracking conditinal escapes + unsigned m_pseudoLocal = BAD_VAR_NUM; + + // Local allocated for the address of the enumerator + unsigned m_enumeratorLocal = BAD_VAR_NUM; + + // Locals that must be rewritten in the clone, and map + // to their appearances + EnumeratorVarMap* m_appearanceMap = nullptr; + unsigned m_appearanceCount = 0; + jitstd::vector* m_allocTemps = nullptr; + + // Where the enumerator allocation happens + GenTree* m_allocTree = nullptr; + Statement* m_allocStmt = nullptr; + BasicBlock* m_allocBlock = nullptr; + + // Block holding the GDV test that decides if the enumerator will be allocated + BasicBlock* m_domBlock = nullptr; + + // Blocks to clone (in order), and a set representation + // of the same + jitstd::vector* m_blocksToClone = nullptr; + BitVec m_blocks; + + // How to scale the profile in the cloned code + weight_t m_profileScale = 0.0; + + // Status of this candidate + bool m_checkedCanClone = false; + bool m_canClone = false; + bool m_willClone = false; +}; + +typedef JitHashTable, CloneInfo*> CloneMap; class ObjectAllocator final : public Phase { @@ -43,23 +135,31 @@ class ObjectAllocator final : public Phase BitSetShortLongRep* m_ConnGraphAdjacencyMatrix; unsigned int m_StackAllocMaxSize; + // Info for conditionally-escaping locals + LocalToLocalMap m_EnumeratorLocalToPseudoLocalMap; + CloneMap m_CloneMap; + unsigned m_maxPseudoLocals; + unsigned m_numPseudoLocals; + unsigned m_regionsToClone; + //=============================================================================== // Methods public: ObjectAllocator(Compiler* comp); bool IsObjectStackAllocationEnabled() const; void EnableObjectStackAllocation(); + bool CanAllocateLclVarOnStack(unsigned int lclNum, + CORINFO_CLASS_HANDLE clsHnd, + ObjectAllocationType allocType, + ssize_t length, + unsigned int* blockSize, + const char** reason, + bool preliminaryCheck = false); protected: virtual PhaseStatus DoPhase() override; private: - bool CanAllocateLclVarOnStack(unsigned int lclNum, - CORINFO_CLASS_HANDLE clsHnd, - ObjectAllocationType allocType, - ssize_t length, - unsigned int* blockSize, - const char** reason); bool CanLclVarEscape(unsigned int lclNum); void MarkLclVarAsPossiblyStackPointing(unsigned int lclNum); void MarkLclVarAsDefinitelyStackPointing(unsigned int lclNum); @@ -83,8 +183,35 @@ class ObjectAllocator final : public Phase BasicBlock* block, Statement* stmt); struct BuildConnGraphVisitorCallbackData; - bool CanLclVarEscapeViaParentStack(ArrayStack* parentStack, unsigned int lclNum); + bool CanLclVarEscapeViaParentStack(ArrayStack* parentStack, unsigned int lclNum, BasicBlock* block); void UpdateAncestorTypes(GenTree* tree, ArrayStack* parentStack, var_types newType); + + // Conditionally escaping allocation support + // + void CheckForGuardedAllocationOrCopy(BasicBlock* block, Statement* stmt, GenTree** use, unsigned lclNum); + bool CheckForGuardedUse(BasicBlock* block, GenTree* tree, unsigned lclNum); + bool CheckForEnumeratorUse(unsigned lclNum, unsigned dstLclNum); + bool IsGuarded(BasicBlock* block, GenTree* tree, GuardInfo* info, bool testOutcome); + GenTree* IsGuard(BasicBlock* block, GuardInfo* info); + unsigned NewPseudoLocal(); + + bool CanHavePseudoLocals() + { + return (m_maxPseudoLocals > 0); + } + + void RecordAppearance(unsigned lclNum, BasicBlock* block, Statement* stmt, GenTree** use); + bool AnalyzeIfCloningCanPreventEscape(BitVecTraits* bitVecTraits, + BitVec& escapingNodes, + BitVec& escapingNodesToProcess); + bool CanClone(CloneInfo* info); + bool CheckCanClone(CloneInfo* info); + bool CloneOverlaps(CloneInfo* info); + bool ShouldClone(CloneInfo* info); + void CloneAndSpecialize(CloneInfo* info); + void CloneAndSpecialize(); + + static const unsigned int s_StackAllocMaxSize = 0x2000U; }; //=============================================================================== @@ -93,9 +220,64 @@ inline ObjectAllocator::ObjectAllocator(Compiler* comp) : Phase(comp, PHASE_ALLOCATE_OBJECTS) , m_IsObjectStackAllocationEnabled(false) , m_AnalysisDone(false) - , m_bitVecTraits(comp->lvaCount, comp) - , m_HeapLocalToStackLocalMap(comp->getAllocator()) + , m_bitVecTraits(BitVecTraits(comp->lvaCount, comp)) + , m_HeapLocalToStackLocalMap(comp->getAllocator(CMK_ObjectAllocator)) + , m_EnumeratorLocalToPseudoLocalMap(comp->getAllocator(CMK_ObjectAllocator)) + , m_CloneMap(comp->getAllocator(CMK_ObjectAllocator)) + , m_maxPseudoLocals(0) + , m_numPseudoLocals(0) + , m_regionsToClone(0) + { + // If we are going to do any conditional escape analysis, allocate + // extra BV space for the "pseudo" locals we'll need. + // + // For now, disable conditional escape analysis with OSR + // since the dominance picture is muddled at this point. + // + // The conditionally escaping allocation sites will likely be in loops anyways. + // + bool const hasEnumeratorLocals = comp->hasImpEnumeratorGdvLocalMap(); + + if (hasEnumeratorLocals) + { + unsigned const enumeratorLocalCount = comp->getImpEnumeratorGdvLocalMap()->GetCount(); + assert(enumeratorLocalCount > 0); + + bool const enableConditionalEscape = JitConfig.JitObjectStackAllocationConditionalEscape() > 0; + bool const isOSR = comp->opts.IsOSR(); + + if (enableConditionalEscape && !isOSR) + { + +#ifdef DEBUG + static ConfigMethodRange JitObjectStackAllocationConditionalEscapeRange; + JitObjectStackAllocationConditionalEscapeRange.EnsureInit( + JitConfig.JitObjectStackAllocationConditionalEscapeRange()); + const unsigned hash = comp->info.compMethodHash(); + const bool inRange = JitObjectStackAllocationConditionalEscapeRange.Contains(hash); +#else + const bool inRange = true; +#endif + + if (inRange) + { + m_maxPseudoLocals = enumeratorLocalCount; + m_bitVecTraits = BitVecTraits(comp->lvaCount + enumeratorLocalCount + 1, comp); + JITDUMP("Enabling conditional escape analysis [%u pseudo-vars]\n", enumeratorLocalCount); + } + else + { + JITDUMP("Not enabling conditional escape analysis (disabled by range config)\n"); + } + } + else + { + JITDUMP("Not enabling conditional escape analysis [%u pseudo-vars]: %s\n", enumeratorLocalCount, + enableConditionalEscape ? "OSR" : "disabled by config"); + } + } + m_EscapingPointers = BitVecOps::UninitVal(); m_PossiblyStackPointingPointers = BitVecOps::UninitVal(); m_DefinitelyStackPointingPointers = BitVecOps::UninitVal(); @@ -128,12 +310,14 @@ inline void ObjectAllocator::EnableObjectStackAllocation() // allocated on the stack. // // Arguments: -// lclNum - Local variable number -// clsHnd - Class/struct handle of the variable class +// lclNum - Local variable number +// clsHnd - Class/struct handle of the variable class // allocType - Type of allocation (newobj or newarr) // length - Length of the array (for newarr) // blockSize - [out, optional] exact size of the object -// reason - [out, required] if result is false, reason why +// reason - [out, required] if result is false, reason why +// preliminaryCheck - if true, allow checking before analysis is done +// (for things that inherently disqualify the local) // // Return Value: // Returns true iff local variable can be allocated on the stack. @@ -143,9 +327,10 @@ inline bool ObjectAllocator::CanAllocateLclVarOnStack(unsigned int lclNu ObjectAllocationType allocType, ssize_t length, unsigned int* blockSize, - const char** reason) + const char** reason, + bool preliminaryCheck) { - assert(m_AnalysisDone); + assert(preliminaryCheck || m_AnalysisDone); bool enableBoxedValueClasses = true; bool enableRefClasses = true; @@ -246,6 +431,11 @@ inline bool ObjectAllocator::CanAllocateLclVarOnStack(unsigned int lclNu return false; } + if (preliminaryCheck) + { + return true; + } + const bool escapes = CanLclVarEscape(lclNum); if (escapes) diff --git a/src/coreclr/jit/ssabuilder.cpp b/src/coreclr/jit/ssabuilder.cpp index b3fc9cd6ed80e4..c260945b4a9f82 100644 --- a/src/coreclr/jit/ssabuilder.cpp +++ b/src/coreclr/jit/ssabuilder.cpp @@ -1555,7 +1555,7 @@ bool IncrementalSsaBuilder::FindReachingDefInBlock(const UseDefLocation& use, Ba } if ((candidate.Block == use.Block) && (use.Stmt != nullptr) && - (LatestStatement(use.Stmt, candidate.Stmt) != use.Stmt)) + (m_comp->gtLatestStatement(use.Stmt, candidate.Stmt) != use.Stmt)) { // Def is after use continue; @@ -1565,7 +1565,8 @@ bool IncrementalSsaBuilder::FindReachingDefInBlock(const UseDefLocation& use, Ba { latestTree = nullptr; } - else if ((latestDefStmt == nullptr) || (LatestStatement(candidate.Stmt, latestDefStmt) == candidate.Stmt)) + else if ((latestDefStmt == nullptr) || + (m_comp->gtLatestStatement(candidate.Stmt, latestDefStmt) == candidate.Stmt)) { latestDefStmt = candidate.Stmt; latestTree = candidate.Tree; @@ -1619,44 +1620,6 @@ bool IncrementalSsaBuilder::FindReachingDefInSameStatement(const UseDefLocation& return false; } -//------------------------------------------------------------------------ -// LatestStatement: Given two statements in the same block, find the latest one -// of them. -// -// Parameters: -// stmt1 - The first statement -// stmt2 - The second statement -// -// Returns: -// Latest of the two statements. -// -Statement* IncrementalSsaBuilder::LatestStatement(Statement* stmt1, Statement* stmt2) -{ - if (stmt1 == stmt2) - { - return stmt1; - } - - Statement* cursor1 = stmt1->GetNextStmt(); - Statement* cursor2 = stmt2->GetNextStmt(); - - while (true) - { - if ((cursor1 == stmt2) || (cursor2 == nullptr)) - { - return stmt2; - } - - if ((cursor2 == stmt1) || (cursor1 == nullptr)) - { - return stmt1; - } - - cursor1 = cursor1->GetNextStmt(); - cursor2 = cursor2->GetNextStmt(); - } -} - //------------------------------------------------------------------------ // InsertDef: Record a definition. // diff --git a/src/coreclr/jit/ssabuilder.h b/src/coreclr/jit/ssabuilder.h index 73157291a46e79..fd1ea275bb99cc 100644 --- a/src/coreclr/jit/ssabuilder.h +++ b/src/coreclr/jit/ssabuilder.h @@ -138,7 +138,6 @@ class IncrementalSsaBuilder UseDefLocation FindOrCreateReachingDef(const UseDefLocation& use); bool FindReachingDefInBlock(const UseDefLocation& use, BasicBlock* block, UseDefLocation* def); bool FindReachingDefInSameStatement(const UseDefLocation& use, UseDefLocation* def); - Statement* LatestStatement(Statement* stmt1, Statement* stmt2); public: IncrementalSsaBuilder(Compiler* comp, unsigned lclNum) : m_comp(comp)