Skip to content

Commit a78578c

Browse files
committed
feat: fill_idx as stand-alone learning example
This modifies fill_idx to operate with a trivial tcf "host" program, and sets us up to integrate more deeply with the notion of a "buffer" via the wasm-bound web interface.
1 parent 6c64cb2 commit a78578c

12 files changed

+308
-38
lines changed

NOTES.md

+271-1
Original file line numberDiff line numberDiff line change
@@ -596,7 +596,7 @@ if we only implement so called "tail launches" for now, that defers the launches
596596
in other words, it removes non-determinism from scheduling.
597597

598598

599-
## Scheduling
599+
### Scheduling
600600

601601
TODO residency & threads
602602

@@ -658,3 +658,273 @@ TODO worth making "setup" concurrent?
658658
> There is no guarantee of concurrent execution between any number of different thread blocks on a device.
659659
660660
(I think they mean "parallel" execution here: there's no guarantee of parallelism, but concurrency is the property that's being expressed)
661+
662+
## defining operands
663+
664+
We need to offer either 1) a specific opcode for tail-launches, or 2) take a "stream" parameter here that must always be "tail" (for now).
665+
666+
As of now, we know we need to take at least:
667+
668+
1. function selector ("FILL" or "SERIES"); entry point string or %func_id ?
669+
- is it fair to say that the difference between a "function" and an "entrypoint" is "who can call it?", i.e. a function that can be called _on_ the GPU from "outside" is an entrypoint, and an undecorated function cannot?
670+
671+
2. "group size", e.g. `16x1x1` (how does this map to CUDA's grid(s)/blocks/threads?); should that be an [execution mode] decoration (implies "entry point" above), or a literal, or an ID-based operand?
672+
3. some mechanism for passing arguments (NB: pointers require special care)
673+
674+
[execution mode]: https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/entry_execution.md#execution-mode
675+
676+
examples:
677+
678+
### ~1:1 with the other launch APIs
679+
680+
```
681+
OpDispatchTALVOS "Tail" "FILL" <1 1 1; 16 1 1> 128kiB <fn params...>
682+
; or
683+
OpDispatchTALVOS Tail %fill_fn %g_dim %b_dim %shm_sz <fn params...>
684+
; or w/ `OpExecutionMode %fill_fn LocalSize 16 1 1` up at the "top"
685+
OpDispatchTALVOS Tail %fill_fn %shm_sz <fn params...>
686+
```
687+
688+
Breaks down as:
689+
690+
* `OpDispatchTALVOS`
691+
* `"Tail"` or `Tail`: the stream (queue) name
692+
- This magic string maps to a special stream that does something uniqe: it launches after the exit of the currently running kernel. Currently, that's the only support target
693+
- there are two "special" streams, `NULL`/default (which are subtly different configurations) and the "tail" stream (only supported on NVIDA devices). Users may create streams (up to... ?), which are named, and share different semantics
694+
- The problem with `Tail` (a literal/enumerant) is that it would imply either 1) no ability to specify user streams, or 2) require overloading the operand, which can be quite confusing (i.e. if `Tail` or `NULL/``Default` meant the special streams, then `"Tail"`/`"NULL"`/`"Default"` would all mean a user-created stream that had no special semantics)
695+
- A string here is a faux pas in SPIR-V land; since we'll probably refer to the same stream more than once, the aesthetic is to be compact & reference a result id instead
696+
- That would require a separate opcode to set up, something like: `%nn = OpXXXTALVOS "Tail"` and/or `%nn = OpTailStreamTALVOS`. If we want to express the type for `%nn` as well, that's another e.g. `%nn_t = OpStreamTypeTALVOS`
697+
* `"FILL"` (an entrypoint name) / `%fill_fn`
698+
- This string maps to an entrypoint name; nominally something in the user's control, but out-of-line (up at the "top" of the file where `OpEntryPoint ...`s are required to go)
699+
- Again, a string is kind of a faux pas, instead we ought to use `%fn` from a `%fn = OpFunction` declaration (which may be externally linked if decorated with "Linkage Attributes")
700+
* `<1 1 1; 16 1 1>` (or `%g_dim %b_dim`): this is invalid SPIR-V syntax, but it represents `<grid/group dim; block dim>`
701+
- `group dim` is a "multiplier" on `blocks`; balancing out the three-way constraint triangle between work size and blocks/group dim is complicated. For small examples we'd prefer to only use `blocks` as a simplifying assumption, but the need for the second one comes up relatively quickly.
702+
- so, the choices here are kinda rough: `EnqueueDispatch` opts for using result ids which are constructed by e.g. `%block_dim = OpBuildNDRange %ndrange_ty ...` which requires a correctly-set-up struct type (via the usual typing opcodes) that obeys a whole lotta rules, populated by a bunch of out-of-line setup
703+
- Spending a bunch of opcodes _elsewhere_ to set things up is typical of a low-level operation-based language like SPIR-V; the short-term memory/inline hinting/symbolic manipulation demands are what makes assembly programming so challenging; it's just extra unfortunate here, because the number of operations it takes to express this core concept is way too high. It's possible to learn to answer the "dimensionality?" question by scanning for/jumping to the approximate "vector setup block" and pattern matching, but the size of the ask is a mismatch with the frequency of the task, and how early it needs to be performed (~immediately).
704+
705+
Q: how to do vector constants in SPIR-V? Is there a more compact way than poking the values in one at a time?
706+
707+
- an alternative is to decorate w/ `OpExecutionMode %fill_fn LocalSize 16 1 1` (and `GlobalSize` for the other dimension, only supported for `Kernel`s), and then this operand disappears entirely. This requires all dispach'd functions to be an EntryPoint as well (but: we probably have that requirement since AMD doesn't support dynamic/nested parallelism)
708+
709+
- Using the decoration is a little funky, though: we'd be extending it in a very natural but also tons-of-work-to-get-working-right kind of way, and it's not at all clear that's something which can't be easily "lifted" out of Talvos
710+
Q: is this ^ right? What happens if we use `Kernel` instead of `Shader`?
711+
Seems to be fine, more or less—Talvos now knows about two kinds of compute-focused things to launch, but that's alright. The one big wrinkle is in figuring out how passing data into/back out of a `Kernel` is supposed to work?
712+
713+
Ugh, except for this: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_aliasing
714+
715+
The main thing the OpenCL memory model permits is aliasing-by-default. Hmmmm, time for a Talvos memory model?
716+
717+
Q: do intel's GPUs support dynamic parallelism?
718+
719+
* `128kiB` / `%shm_sz`: the shared memory size (also probably invalid SPIR-V syntax)
720+
- it's clear why the runtime would need this "as soon as possible," but it's not clear whether this ought to be a per-dispatch tune-able (would it ever make sense to dispatch the same kernel with different sizes? _maybe_ if you were doing different dimensions, right?)
721+
722+
* `<fn params...>`: the ids (no literals) of all the parameters to the function
723+
- NB: any pointers passed here must be to the "global" storage class(es), since the dispatched kernel won't have access to any of the local/"shared" memory of the invoker
724+
725+
Does not cover:
726+
- non-shared-memory configuration parameters; i.e. resizing limits L1 cache usage
727+
- any (device-)global pointers
728+
729+
### ducking the "streams" parameter, for now
730+
731+
So, wrapping that up into the opcode (since it's a _special_ stream anyway)
732+
733+
```
734+
; earlier: `OpExecutionMode %fill_fn LocalSize 16 1 1`
735+
736+
OpDispatchAtEndTALVOS %fill_fn <fn params...>
737+
; or
738+
OpDispatchDeferredTALVOS %fill_fn <fn params...>
739+
; or
740+
OpDispatchTailTALVOS %fill_fn <fn params...>
741+
; or
742+
OpDispatchOnExitTALVOS %fill_fn <fn params...>
743+
; or
744+
OpDispatchLaterTALVOS %fill_fn <fn params...>
745+
```
746+
747+
We're down to the question of "how do we signal that this is dispatching into the special stream that has the 'nesting' semantics
748+
749+
750+
### avoiding bad surprises with `OpExecutionMode`
751+
752+
_not_ to be confused with "execution model"
753+
754+
Rather than:
755+
756+
```
757+
OpEntryPoint GLCompute %fill_fn "FILL" %gl_GlobalInvocationID
758+
OpExecutionMode %fill_fn LocalSize 16 1 1
759+
```
760+
761+
We might have a better time with something like:
762+
763+
```
764+
OpKernelTALVOS %fill_fn "FILL" %gl_GlobalInvocationID 16 1 1
765+
```
766+
767+
we could even do length-extended overloading (which is "ok kind of overloading") to handle the other sizing too
768+
769+
TODO is that just wrapping two opcodes in another opcode? is that worth doing?
770+
TODO elsewise, write a test for `OpExecutionModeId` (dynamic paralellism lol)
771+
772+
773+
### big oof
774+
775+
```
776+
Initializer
777+
Indicates that this entry point is a module initializer.
778+
```
779+
780+
&
781+
782+
```
783+
Finalizer
784+
Indicates that this entry point is a module finalizer.
785+
```
786+
787+
I wonder if that could take the place of the dispatch op.
788+
789+
## `OpExecutionGlobalSizeTALVOS` (stepping back from full dispatch for a moment)
790+
791+
Instead, let's try doing something smaller and adding a peer of `OpExecutionMode` for setting the global size, called ~~`OpGlobalSizeTalvos`~~ `OpExecutionGlobalSizeTALVOS`
792+
793+
Did the same as above to add it to the spirv.core.grammar.json, but then the validation started failing. First it was something ~ ID has not yet been declared, then ~ must be in a block, then finally:
794+
795+
```
796+
error: 7: Invalid use of function result id '1[%1]'.
797+
OpExecutionGlobalSizeTALVOS %1 16 1 1
798+
```
799+
800+
For each, searching for the message (e.g. "Invalid use of function result id") would yield a block of code, like:
801+
802+
803+
```c++
804+
for (auto& pair : inst->uses()) {
805+
const auto* use = pair.first;
806+
if (std::find(acceptable.begin(), acceptable.end(), use->opcode()) ==
807+
acceptable.end() &&
808+
!use->IsNonSemantic() && !use->IsDebugInfo()) {
809+
return _.diag(SPV_ERROR_INVALID_ID, use)
810+
<< "Invalid use of function result id " << _.getIdName(inst->id())
811+
<< ".";
812+
}
813+
}
814+
```
815+
816+
and then it was just a matter of taking a different branch, i.e. adding `spv::Op::OpExecutionGlobalSizeTALVOS` to the end of the "acceptable" declaration:
817+
818+
```diff
819+
diff --git a/source/val/validate_function.cpp b/source/val/validate_function.cpp
820+
index 639817fe..9bd52993 100644
821+
--- a/source/val/validate_function.cpp
822+
+++ b/source/val/validate_function.cpp
823+
@@ -86,7 +86,8 @@ spv_result_t ValidateFunction(ValidationState_t& _, const Instruction* inst) {
824+
spv::Op::OpGetKernelPreferredWorkGroupSizeMultiple,
825+
spv::Op::OpGetKernelLocalSizeForSubgroupCount,
826+
spv::Op::OpGetKernelMaxNumSubgroups,
827+
- spv::Op::OpName};
828+
+ spv::Op::OpName,
829+
+ spv::Op::OpExecutionGlobalSizeTALVOS};
830+
for (auto& pair : inst->uses()) {
831+
const auto* use = pair.first;
832+
if (std::find(acceptable.begin(), acceptable.end(), use->opcode()) ==
833+
```
834+
835+
At this point we're back in the "Unimplemented ..." pipe (as above).
836+
837+
## `OpBufferTALVOS`
838+
839+
Idea is to replace:
840+
841+
```tcf
842+
BUFFER a 64 UNINIT
843+
DESCRIPTOR_SET 0 0 0 a
844+
845+
# ...
846+
847+
DUMP UINT32 a
848+
```
849+
850+
and
851+
852+
```spirv
853+
OpDecorate %buf0 DescriptorSet 0
854+
OpDecorate %buf0 Binding 0
855+
856+
; ...
857+
858+
%_arr_uint32_t = OpTypeRuntimeArray %uint32_t
859+
%_arr_StorageBuffer_uint32_t = OpTypePointer StorageBuffer %_arr_uint32_t
860+
861+
; ...
862+
863+
%buf0 = OpVariable %_arr_StorageBuffer_uint32_t StorageBuffer
864+
```
865+
866+
with something like:
867+
868+
```spirv
869+
%_arr_uint32_t = OpTypeRuntimeArray %uint32_t
870+
871+
%buf0 = OpBufferTALVOS 64 %_arr_uint32_t StorageBuffer "a"
872+
```
873+
874+
and have Talvos automagically DUMP all (named) buffers after execution.
875+
876+
TBD:
877+
878+
- [x] Should we have the %_arr_StorageBuffer_uint32_t type as well?
879+
- [~] Or can we build that up from the `OpBufferTALVOS` arguments? (should we?)
880+
- [ ] maybe we have an `OpBufferTypeTALVOS` ? Or a `BufferTALVOS` storage class?
881+
882+
We do "need" it, because else we see:
883+
884+
```
885+
error: 25: The Base <id> '13[%13]' in OpAccessChain instruction must be a pointer.
886+
%17 = OpAccessChain %_ptr_StorageBuffer_uint %13 %16
887+
```
888+
889+
so something needs to be an `OpTypePointer`, and it's probably not worth overloading the whole result type machinery to special case just `OpBufferTALVOS` to return a pointer-wrapped type.
890+
891+
We still might want an `OpBufferTypeTALVOS` and/or a special storage class; those would both restrict the type argument in about the same way, so it's not clear what the buffer type would give us.
892+
893+
The main benefits of being explicit here is:
894+
1. We can invoke it with some capability other than `Shader`
895+
2. It's less surprising than overloading SharedBuffer with dump behavior (?), and it's a trivial remapping to change to the SharedBuffer storage class get it working outside Talvos.
896+
897+
And potentially:
898+
899+
3. We might add an optional flags parameter to control talvos-specific behaviors; too soon to say if that's really useful though.
900+
901+
902+
- [ ] should we leave the `OpVariable` thing as-is ...
903+
- [ ] and just decorate the buffer with a (mostly) non-semantic `OpBufferTALVOS` ?
904+
- [ ] and just literally decorate with an entirely non-semantic `OpDecorate %buf0 BufferTALVOS` ?
905+
906+
Well, we had to fudge the order, at least, and will probabably have to do the `_StorageBuffer_` type bits. Too bad, `StorageBuffer` requires `OpCapability Shader` & is kind of semantically redundant.
907+
908+
Perhaps instead, a `BufferTALVOS` _StorageClass_ w/ `OpName %... "a"` ?
909+
910+
### (sort of) aside: what the heck is a `OpAccessChain` ?
911+
912+
```
913+
; given %buf0 ty is `uint32_t[]*` (in StorageBuffer)
914+
; and %3 is a uint32_t offset
915+
%4 = OpAccessChain %_ptr_StorageBuffer_uint32_t %buf0 %3
916+
```
917+
918+
%4 is a ptr to a uint32_t, aka `uint32_t*`, offset into the _array_ by %3 "steps"? .... how?
919+
920+
Ok, so if `buf0` is `0x1000`, this breaks down to roughly:
921+
922+
0x1000 ; "base"
923+
+ (4 ; sizeof(uint32_t)
924+
* %3) ; element-wise offset
925+
---------
926+
0x103c ; when %3 == 15
927+
928+
Which, when interpreted as a `uint32_t *` sure could be right...
929+
930+
why does this feel weird? because `uint32_t[]*` ought to be an alternate spelling of `uint32_t**`, which means we ought to have something like `0x1040` in `buf0`, which points to a 8-wide slot containing `0x1000`; so maybe OpAccessChain contains an implicit deref on its first argument? i.e. it's not `(base) + offset`, it's `*(base) + offset`?

content/talvos/fill.test.ts

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ async function getContents(c: Collection, f: Id) {
1818
return file!.data.contents;
1919
}
2020

21-
describe('fill_idx', async () => {
21+
describe('fill', async () => {
2222
let [stdout, stderr] = ['', ''];
2323

2424
beforeEach(() => {

content/talvos/fill_idx.spvasm.

+25-30
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,47 @@
11
; SPIR-V
2-
; Version: 1.3
3-
OpCapability Shader ; TODO we need this for descriptors? really?
2+
; Version: 1.5
43
OpCapability Kernel
4+
OpCapability BuffersTALVOS
5+
OpCapability ExecTALVOS
6+
OpCapability PhysicalStorageBufferAddresses
7+
OpExtension "SPV_TALVOS_buffers"
8+
OpExtension "SPV_TALVOS_exec"
59
OpMemoryModel Logical OpenCL
610

711
OpEntryPoint Kernel %main_fn "main" %gl_GlobalInvocationID
812

9-
; TODO instead of DISPATCH, one of these:
10-
;OpExecutionMode %main_fn LocalSize 16 1 1
11-
;OpExecutionMode %main_fn GlobalSize 16 1 1
13+
OpExecutionGlobalSizeTALVOS %main_fn 16 1 1
1214

13-
; TODO[seth]: is this a spec misread in talvos? Shouldn't the DISPATCH map to local size?
14-
; (or is it execution mode dependent?)
1515
OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
16+
OpDecorate %_arr_uint32_t ArrayStride 4
1617

17-
; TODO instead of descriptors, static allocation via `OpBufferTALVOS 64` (maybe w/ %ty? and/or name?)
18-
; TODO how do people pass data back & forth w/ OpenCL kernels for real?
19-
OpDecorate %buf0 DescriptorSet 0
20-
OpDecorate %buf0 Binding 0
18+
; types
19+
%void_t = OpTypeVoid
20+
%void_fn_t = OpTypeFunction %void_t
21+
%uint32_t = OpTypeInt 32 0
22+
%gbl_id_t = OpTypeVector %uint32_t 3
2123

22-
; TODO instead of `DUMP`, something like `OpDumpAtEndTALVOS %buffer_id` ?
23-
; Or perhaps that's just implied by `OpBufferTALVOS` ?
24+
%arr_len = OpConstant %uint32_t 16
25+
%_arr_uint32_t = OpTypeArray %uint32_t %arr_len
2426

25-
; types
26-
%void_t = OpTypeVoid
27-
%void_fn_t = OpTypeFunction %void_t
28-
%uint32_t = OpTypeInt 32 0
29-
%gbl_id_t = OpTypeVector %uint32_t 3
27+
%_ptr_Input_gbl_id_t = OpTypePointer Input %gbl_id_t
28+
%_ptr_Input_uint32_t = OpTypePointer Input %uint32_t
3029

31-
%_arr_uint32_t = OpTypeRuntimeArray %uint32_t
30+
%_ptr_PhysicalStorageBuffer_uint32_t = OpTypePointer PhysicalStorageBuffer %uint32_t
31+
%_arr_PhysicalStorageBuffer_uint32_t = OpTypePointer PhysicalStorageBuffer %_arr_uint32_t
3232

33-
%_ptr_StorageBuffer_uint32_t = OpTypePointer StorageBuffer %uint32_t
34-
%_arr_StorageBuffer_uint32_t = OpTypePointer StorageBuffer %_arr_uint32_t
35-
%_ptr_Input_gbl_id_t = OpTypePointer Input %gbl_id_t
36-
%_ptr_Input_uint32_t = OpTypePointer Input %uint32_t
3733

38-
39-
; global arguments & constants
34+
; global arguments & constants
4035
%n = OpConstant %uint32_t 0
4136
%gl_GlobalInvocationID = OpVariable %_ptr_Input_gbl_id_t Input
42-
%buf0 = OpVariable %_arr_StorageBuffer_uint32_t StorageBuffer
37+
%buf0 = OpBufferTALVOS %_arr_PhysicalStorageBuffer_uint32_t PhysicalStorageBuffer 64 "a"
4338

44-
; FILL_IDX entry point
39+
; FILL_IDX entry point
4540
%main_fn = OpFunction %void_t None %void_fn_t
4641
%0 = OpLabel
4742
%2 = OpAccessChain %_ptr_Input_uint32_t %gl_GlobalInvocationID %n
48-
%3 = OpLoad %uint32_t %2
49-
%4 = OpAccessChain %_ptr_StorageBuffer_uint32_t %buf0 %3
50-
OpStore %4 %3
43+
%3 = OpLoad %uint32_t %2 Aligned 4
44+
%4 = OpAccessChain %_ptr_PhysicalStorageBuffer_uint32_t %buf0 %3
45+
OpStore %4 %3 Aligned 4
5146
OpReturn
5247
OpFunctionEnd

content/talvos/fill_idx.tcf.

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# MODULE fill_idx.spvasm
2+
# ENTRY main
3+
4+
EXEC
5+

content/talvos/fill_idx.test.ts

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ describe('fill_idx', async () => {
5656
test_entry(
5757
await getContents('talvos', 'fill_idx.spvasm'),
5858
'main',
59-
await getContents('talvos', 'fill.tcf'),
59+
await getContents('talvos', 'fill_idx.tcf'),
6060
)
6161

6262
expect(stderr).to.be.empty;

lib/talvos.ts

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ export class _EntryPoints {
8585
export class Talvos$$Module {
8686
constructor(public ptr: Ptr) { }
8787
static get SIZE() {
88-
return 104;
88+
return 128;
8989
}
9090

9191
get EntryPoints() {

0 commit comments

Comments
 (0)