Improve windows installation user experience #3386

kurapov-peter · 2025-02-10T15:59:40Z

The problem

On Windows, the system driver comes without the L0 development package (read, no L0 headers). Putting the SDK there is not a solution since the driver store location is not searchable (the package contains LIBs and headers that will not be visible for applications). Currently, Triton uses this API in its driver (host stub) to load and build modules, compile kernels, retrieve compiled kernel metadata, and perform platform discovery.

Solution

There are multiple approaches to resolve this. Here, I summarize the explored solution of replacing L0 calls with SYCL equivalents. This contradicts the direction runtime analysis shows (I've written on the runtime choice quite some time ago: https://github.com/intel/intel-xpu-backend-for-triton/blob/main/docs/ARCHITECTURE.md#runtime; the analysis considers L0, UR, and SYCL) but it does solve the problem at hand. It is also not a great approach if

Most of the required APIs to replace L0 are already present in SYCL. Here you can find a prototype that removes the dependency on L0 in the default path of Triton's runtime: https://github.com/intel/intel-xpu-backend-for-triton/tree/pakurapo/sycl-runtime. It currently works for the SPIR-V path (native binary caching is not supported) and lacks the spills querying. The compilation is performed via the experimental kernel compiler SYCL extension (see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler_spirv.asciidoc).

Current status

The two missing pieces to cover all the required capabilities are:

Querying the amount of spilled memory from a compiled SYCL kernel. I already added the support into UR (Add UR_KERNEL_INFO_SPILL_MEM_SIZE kernel info prop oneapi-src/unified-runtime#2614). On the SYCL side, a new extension was proposed and approved, implementation is in progress (Add sycl ext intel kernel queries extension llvm#16834).
For native binary support, the in-progress module API was considered but proved not fitting. The kernel compiler extension will need a tweak. This is in progress, no PRs yet.

sommerlukas · 2025-02-11T11:06:50Z

Did you check whether UR headers are available on Windows? I checked with the UR team, they should be shipped with the toolkit.

If they're available, using UR would align with the goal to use lower-level runtimes directly, potentially avoiding SYCL overheads, and would still be available on Windows.

kurapov-peter · 2025-02-11T12:23:30Z

The toolkit exposes UR API, but AFAIK, the goal is not to rely on the toolkit. There's no technical problem with exposing them through a package via pip (if not already done). It would be an additional dependency, though.

I did not measure the overhead, not sure how different it is. We would still need to rely on SYCL RT to wrap UR objects and still use it to submit kernels for execution, so I guessed it wouldn't add much. If this assumption is wrong, then the option is worth a try.

sommerlukas · 2025-02-11T12:30:24Z

The toolkit exposes UR API, but AFAIK, the goal is not to rely on the toolkit.

Where's the SYCL implementation coming from if not from the toolkit?

We would still need to rely on SYCL RT to wrap UR objects and still use it to submit kernels for execution

Why? Can't we just use UR to launch the kernel?

kurapov-peter · 2025-02-11T13:10:02Z

The implementation comes from pip/conda/whatever else you use to install pytorch.

Can't we just use UR to launch the kernel?

The answer is no for level zero and I'd assume the same for unified runtime. There is a synchronization issue - pytorch needs to wait on an event pushed to a sycl queue to not break the order, but a native queue is not drained before you submit the next job (maybe that's somehow handled by UR? @kbenzie, @pbalcer could you guys comment?). And, of course, there's the dull fact that pytorch passes a sycl queue as the stream.

kbenzie · 2025-02-11T13:23:32Z

(maybe that's somehow handled by UR? @kbenzie, @pbalcer could you guys comment?)

There's no special handling in UR, synchronization is done via event handles so you'd need a way to access the event handle of commands enqueued as part of the PyTorch execution (via SYCL RT?). If you can access those, you can synchronize and enqueuing to UR should be doable.

kurapov-peter · 2025-02-11T13:31:06Z

There's no special handling in UR, synchronization is done via event handles so you'd need a way to access the event handle of commands enqueued as part of the PyTorch execution (via SYCL RT?). If you can access those, you can synchronize and enqueuing to UR should be doable.

Thanks, Benie! So, this synchronization problem is the reason we do the wrapping today. From this answer and my original assumption, it looks to me that we would hit the same problem.

For the context, I'm not against doing the thing through UR. In fact, this is what I was advocating for a year ago. I also think it is doable, but will require additional analysis (I have even more unanswered questions than we raised here) and potentially non-trivial changes to components other than triton.

alexbaden · 2025-02-11T13:45:45Z

The problem we are trying to solve right now is that the L0 headers are not available through any reasonable installation channel on Windows. This prompted an investigation into whether or not we could use SYCL entirely without L0. However, recent events have me (and I think @kurapov-peter agrees) less convinced that SYCL standalone is the right choice. But PyTorch has some fairly complex code around managing the SYCL queues so unless the kernel overhead is really high, it's probably easier to keep the current system in place for kernel launching.

Back to the problem at hand - we use L0 to compile device code which we do at runtime, so we need L0 headers available at runtime. If the UR headers are available through the same installation channel that we use for the toolkit (conda/pip/etc) then UR may be a viable option. But we need all the functionality of L0 - querying spilled memory/kernel properties, compiling SPIR-V or native code from a cached binary, etc. Also, it would be nice to know that going forward UR will maintain feature parity with L0. For example, the memory spill info we get from L0 is currently pretty basic, I suspect that will be improved going forward and we won't want to wait too long to get access to those features.

pbalcer · 2025-02-11T13:50:55Z

SYCL does not expose UR native objects. There's no interop API for UR like there is for L0. It would not be possible to e.g., submit a kernel to an existing SYCL queue through the UR API (bypassing SYCL RT). Otherwise, like @kbenzie said, I don't see a problem with using events for synchronization between queues at different layers.

SYCL also has an API for retrieving the last event on a queue: https://github.com/intel/llvm/blob/a8f30a9212d92068a7649c61a75432f4a66e3ab8/sycl/doc/extensions/experimental/sycl_ext_oneapi_in_order_queue_events.asciidoc#L99. Theoretically you can use this to retrieve a native event and then submit whatever L0 (or UR, provided that we add a way to get a native event) command you want that's synchronized on that last event.

sommerlukas · 2025-02-11T14:10:44Z

But we need all the functionality of L0 - querying spilled memory/kernel properties, compiling SPIR-V or native code from a cached binary, etc. Also, it would be nice to know that going forward UR will maintain feature parity with L0. For example, the memory spill info we get from L0 is currently pretty basic, I suspect that will be improved going forward and we won't want to wait too long to get access to those features.

@kbenzie and @pbalcer can probably provide the official comment on support here, but I would think that anything in L0 could also be exposed in UR, potentially in extensions if other backends don't have that support, which is essentially what @kurapov-peter did in oneapi-src/unified-runtime#2614.

pbalcer · 2025-02-11T14:55:08Z

Yeah, I don't see any issues adding UR APIs needed by higher-level languages/frameworks other than SYCL. UR was always meant to be universal.

vlad-penkin added the dependencies: level_zero label Feb 14, 2025

vlad-penkin added this to the 9. DX/UX milestone Feb 14, 2025

vlad-penkin added umbrella dependencies: sycl runtime windows labels Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve windows installation user experience #3386

Improve windows installation user experience #3386

kurapov-peter commented Feb 10, 2025

sommerlukas commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

kbenzie commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

alexbaden commented Feb 11, 2025

pbalcer commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

pbalcer commented Feb 11, 2025

Improve windows installation user experience #3386

Improve windows installation user experience #3386

Comments

kurapov-peter commented Feb 10, 2025

sommerlukas commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

kbenzie commented Feb 11, 2025

kurapov-peter commented Feb 11, 2025

alexbaden commented Feb 11, 2025

pbalcer commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

pbalcer commented Feb 11, 2025