You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This modifies fill_idx to operate with a trivial tcf "host"
program, and sets us up to integrate more deeply with the notion of a
"buffer" via the wasm-bound web interface.
Copy file name to clipboardexpand all lines: NOTES.md
+271-1
Original file line number
Diff line number
Diff line change
@@ -596,7 +596,7 @@ if we only implement so called "tail launches" for now, that defers the launches
596
596
in other words, it removes non-determinism from scheduling.
597
597
598
598
599
-
## Scheduling
599
+
###Scheduling
600
600
601
601
TODO residency & threads
602
602
@@ -658,3 +658,273 @@ TODO worth making "setup" concurrent?
658
658
> There is no guarantee of concurrent execution between any number of different thread blocks on a device.
659
659
660
660
(I think they mean "parallel" execution here: there's no guarantee of parallelism, but concurrency is the property that's being expressed)
661
+
662
+
## defining operands
663
+
664
+
We need to offer either 1) a specific opcode for tail-launches, or 2) take a "stream" parameter here that must always be "tail" (for now).
665
+
666
+
As of now, we know we need to take at least:
667
+
668
+
1. function selector ("FILL" or "SERIES"); entry point string or %func_id ?
669
+
- is it fair to say that the difference between a "function" and an "entrypoint" is "who can call it?", i.e. a function that can be called _on_ the GPU from "outside" is an entrypoint, and an undecorated function cannot?
670
+
671
+
2. "group size", e.g. `16x1x1` (how does this map to CUDA's grid(s)/blocks/threads?); should that be an [execution mode] decoration (implies "entry point" above), or a literal, or an ID-based operand?
672
+
3. some mechanism for passing arguments (NB: pointers require special care)
- This magic string maps to a special stream that does something uniqe: it launches after the exit of the currently running kernel. Currently, that's the only support target
693
+
- there are two "special" streams, `NULL`/default (which are subtly different configurations) and the "tail" stream (only supported on NVIDA devices). Users may create streams (up to... ?), which are named, and share different semantics
694
+
- The problem with `Tail` (a literal/enumerant) is that it would imply either 1) no ability to specify user streams, or 2) require overloading the operand, which can be quite confusing (i.e. if `Tail` or `NULL/``Default` meant the special streams, then `"Tail"`/`"NULL"`/`"Default"` would all mean a user-created stream that had no special semantics)
695
+
- A string here is a faux pas in SPIR-V land; since we'll probably refer to the same stream more than once, the aesthetic is to be compact & reference a result id instead
696
+
- That would require a separate opcode to set up, something like: `%nn = OpXXXTALVOS "Tail"` and/or `%nn = OpTailStreamTALVOS`. If we want to express the type for `%nn` as well, that's another e.g. `%nn_t = OpStreamTypeTALVOS`
697
+
*`"FILL"` (an entrypoint name) / `%fill_fn`
698
+
- This string maps to an entrypoint name; nominally something in the user's control, but out-of-line (up at the "top" of the file where `OpEntryPoint ...`s are required to go)
699
+
- Again, a string is kind of a faux pas, instead we ought to use `%fn` from a `%fn = OpFunction` declaration (which may be externally linked if decorated with "Linkage Attributes")
700
+
*`<1 1 1; 16 1 1>` (or `%g_dim %b_dim`): this is invalid SPIR-V syntax, but it represents `<grid/group dim; block dim>`
701
+
-`group dim` is a "multiplier" on `blocks`; balancing out the three-way constraint triangle between work size and blocks/group dim is complicated. For small examples we'd prefer to only use `blocks` as a simplifying assumption, but the need for the second one comes up relatively quickly.
702
+
- so, the choices here are kinda rough: `EnqueueDispatch` opts for using result ids which are constructed by e.g. `%block_dim = OpBuildNDRange %ndrange_ty ...` which requires a correctly-set-up struct type (via the usual typing opcodes) that obeys a whole lotta rules, populated by a bunch of out-of-line setup
703
+
- Spending a bunch of opcodes _elsewhere_ to set things up is typical of a low-level operation-based language like SPIR-V; the short-term memory/inline hinting/symbolic manipulation demands are what makes assembly programming so challenging; it's just extra unfortunate here, because the number of operations it takes to express this core concept is way too high. It's possible to learn to answer the "dimensionality?" question by scanning for/jumping to the approximate "vector setup block" and pattern matching, but the size of the ask is a mismatch with the frequency of the task, and how early it needs to be performed (~immediately).
704
+
705
+
Q: how to do vector constants in SPIR-V? Is there a more compact way than poking the values in one at a time?
706
+
707
+
- an alternative is to decorate w/ `OpExecutionMode %fill_fn LocalSize 16 1 1` (and `GlobalSize` for the other dimension, only supported for `Kernel`s), and then this operand disappears entirely. This requires all dispach'd functions to be an EntryPoint as well (but: we probably have that requirement since AMD doesn't support dynamic/nested parallelism)
708
+
709
+
- Using the decoration is a little funky, though: we'd be extending it in a very natural but also tons-of-work-to-get-working-right kind of way, and it's not at all clear that's something which can't be easily "lifted" out of Talvos
710
+
Q: is this ^ right? What happens if we use `Kernel` instead of `Shader`?
711
+
Seems to be fine, more or less—Talvos now knows about two kinds of compute-focused things to launch, but that's alright. The one big wrinkle is in figuring out how passing data into/back out of a `Kernel` is supposed to work?
712
+
713
+
Ugh, except for this: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_aliasing
714
+
715
+
The main thing the OpenCL memory model permits is aliasing-by-default. Hmmmm, time for a Talvos memory model?
- it's clear why the runtime would need this "as soon as possible," but it's not clear whether this ought to be a per-dispatch tune-able (would it ever make sense to dispatch the same kernel with different sizes? _maybe_ if you were doing different dimensions, right?)
721
+
722
+
*`<fn params...>`: the ids (no literals) of all the parameters to the function
723
+
- NB: any pointers passed here must be to the "global" storage class(es), since the dispatched kernel won't have access to any of the local/"shared" memory of the invoker
724
+
725
+
Does not cover:
726
+
- non-shared-memory configuration parameters; i.e. resizing limits L1 cache usage
727
+
- any (device-)global pointers
728
+
729
+
### ducking the "streams" parameter, for now
730
+
731
+
So, wrapping that up into the opcode (since it's a _special_ stream anyway)
we could even do length-extended overloading (which is "ok kind of overloading") to handle the other sizing too
768
+
769
+
TODO is that just wrapping two opcodes in another opcode? is that worth doing?
770
+
TODO elsewise, write a test for `OpExecutionModeId` (dynamic paralellism lol)
771
+
772
+
773
+
### big oof
774
+
775
+
```
776
+
Initializer
777
+
Indicates that this entry point is a module initializer.
778
+
```
779
+
780
+
&
781
+
782
+
```
783
+
Finalizer
784
+
Indicates that this entry point is a module finalizer.
785
+
```
786
+
787
+
I wonder if that could take the place of the dispatch op.
788
+
789
+
## `OpExecutionGlobalSizeTALVOS` (stepping back from full dispatch for a moment)
790
+
791
+
Instead, let's try doing something smaller and adding a peer of `OpExecutionMode` for setting the global size, called ~~`OpGlobalSizeTalvos`~~`OpExecutionGlobalSizeTALVOS`
792
+
793
+
Did the same as above to add it to the spirv.core.grammar.json, but then the validation started failing. First it was something ~ ID has not yet been declared, then ~ must be in a block, then finally:
794
+
795
+
```
796
+
error: 7: Invalid use of function result id '1[%1]'.
797
+
OpExecutionGlobalSizeTALVOS %1 16 1 1
798
+
```
799
+
800
+
For each, searching for the message (e.g. "Invalid use of function result id") would yield a block of code, like:
801
+
802
+
803
+
```c++
804
+
for (auto& pair : inst->uses()) {
805
+
const auto* use = pair.first;
806
+
if (std::find(acceptable.begin(), acceptable.end(), use->opcode()) ==
807
+
acceptable.end() &&
808
+
!use->IsNonSemantic() && !use->IsDebugInfo()) {
809
+
return _.diag(SPV_ERROR_INVALID_ID, use)
810
+
<< "Invalid use of function result id " << _.getIdName(inst->id())
811
+
<< ".";
812
+
}
813
+
}
814
+
```
815
+
816
+
and then it was just a matter of taking a different branch, i.e. adding `spv::Op::OpExecutionGlobalSizeTALVOS` to the end of the "acceptable" declaration:
so something needs to be an `OpTypePointer`, and it's probably not worth overloading the whole result type machinery to special case just `OpBufferTALVOS` to return a pointer-wrapped type.
890
+
891
+
We still might want an `OpBufferTypeTALVOS` and/or a special storage class; those would both restrict the type argument in about the same way, so it's not clear what the buffer type would give us.
892
+
893
+
The main benefits of being explicit here is:
894
+
1. We can invoke it with some capability other than `Shader`
895
+
2. It's less surprising than overloading SharedBuffer with dump behavior (?), and it's a trivial remapping to change to the SharedBuffer storage class get it working outside Talvos.
896
+
897
+
And potentially:
898
+
899
+
3. We might add an optional flags parameter to control talvos-specific behaviors; too soon to say if that's really useful though.
900
+
901
+
902
+
-[ ] should we leave the `OpVariable` thing as-is ...
903
+
-[ ] and just decorate the buffer with a (mostly) non-semantic `OpBufferTALVOS` ?
904
+
-[ ] and just literally decorate with an entirely non-semantic `OpDecorate %buf0 BufferTALVOS` ?
905
+
906
+
Well, we had to fudge the order, at least, and will probabably have to do the `_StorageBuffer_` type bits. Too bad, `StorageBuffer` requires `OpCapability Shader` & is kind of semantically redundant.
907
+
908
+
Perhaps instead, a `BufferTALVOS`_StorageClass_ w/ `OpName %... "a"` ?
909
+
910
+
### (sort of) aside: what the heck is a `OpAccessChain` ?
911
+
912
+
```
913
+
; given %buf0 ty is `uint32_t[]*` (in StorageBuffer)
%4 is a ptr to a uint32_t, aka `uint32_t*`, offset into the _array_ by %3 "steps"? .... how?
919
+
920
+
Ok, so if `buf0` is `0x1000`, this breaks down to roughly:
921
+
922
+
0x1000 ; "base"
923
+
+ (4 ; sizeof(uint32_t)
924
+
* %3) ; element-wise offset
925
+
---------
926
+
0x103c ; when %3 == 15
927
+
928
+
Which, when interpreted as a `uint32_t *` sure could be right...
929
+
930
+
why does this feel weird? because `uint32_t[]*` ought to be an alternate spelling of `uint32_t**`, which means we ought to have something like `0x1040` in `buf0`, which points to a 8-wide slot containing `0x1000`; so maybe OpAccessChain contains an implicit deref on its first argument? i.e. it's not `(base) + offset`, it's `*(base) + offset`?
0 commit comments