-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve windows installation user experience #3386
Comments
Did you check whether UR headers are available on Windows? I checked with the UR team, they should be shipped with the toolkit. If they're available, using UR would align with the goal to use lower-level runtimes directly, potentially avoiding SYCL overheads, and would still be available on Windows. |
The toolkit exposes UR API, but AFAIK, the goal is not to rely on the toolkit. There's no technical problem with exposing them through a package via pip (if not already done). It would be an additional dependency, though. I did not measure the overhead, not sure how different it is. We would still need to rely on SYCL RT to wrap UR objects and still use it to submit kernels for execution, so I guessed it wouldn't add much. If this assumption is wrong, then the option is worth a try. |
Where's the SYCL implementation coming from if not from the toolkit?
Why? Can't we just use UR to launch the kernel? |
The implementation comes from pip/conda/whatever else you use to install pytorch.
The answer is no for level zero and I'd assume the same for unified runtime. There is a synchronization issue - pytorch needs to wait on an event pushed to a sycl queue to not break the order, but a native queue is not drained before you submit the next job (maybe that's somehow handled by UR? @kbenzie, @pbalcer could you guys comment?). And, of course, there's the dull fact that pytorch passes a sycl queue as the stream. |
There's no special handling in UR, synchronization is done via event handles so you'd need a way to access the event handle of commands enqueued as part of the PyTorch execution (via SYCL RT?). If you can access those, you can synchronize and enqueuing to UR should be doable. |
Thanks, Benie! So, this synchronization problem is the reason we do the wrapping today. From this answer and my original assumption, it looks to me that we would hit the same problem. For the context, I'm not against doing the thing through UR. In fact, this is what I was advocating for a year ago. I also think it is doable, but will require additional analysis (I have even more unanswered questions than we raised here) and potentially non-trivial changes to components other than triton. |
The problem we are trying to solve right now is that the L0 headers are not available through any reasonable installation channel on Windows. This prompted an investigation into whether or not we could use SYCL entirely without L0. However, recent events have me (and I think @kurapov-peter agrees) less convinced that SYCL standalone is the right choice. But PyTorch has some fairly complex code around managing the SYCL queues so unless the kernel overhead is really high, it's probably easier to keep the current system in place for kernel launching. Back to the problem at hand - we use L0 to compile device code which we do at runtime, so we need L0 headers available at runtime. If the UR headers are available through the same installation channel that we use for the toolkit (conda/pip/etc) then UR may be a viable option. But we need all the functionality of L0 - querying spilled memory/kernel properties, compiling SPIR-V or native code from a cached binary, etc. Also, it would be nice to know that going forward UR will maintain feature parity with L0. For example, the memory spill info we get from L0 is currently pretty basic, I suspect that will be improved going forward and we won't want to wait too long to get access to those features. |
SYCL does not expose UR native objects. There's no interop API for UR like there is for L0. It would not be possible to e.g., submit a kernel to an existing SYCL queue through the UR API (bypassing SYCL RT). Otherwise, like @kbenzie said, I don't see a problem with using events for synchronization between queues at different layers. SYCL also has an API for retrieving the last event on a queue: https://github.com/intel/llvm/blob/a8f30a9212d92068a7649c61a75432f4a66e3ab8/sycl/doc/extensions/experimental/sycl_ext_oneapi_in_order_queue_events.asciidoc#L99. Theoretically you can use this to retrieve a native event and then submit whatever L0 (or UR, provided that we add a way to get a native event) command you want that's synchronized on that last event. |
@kbenzie and @pbalcer can probably provide the official comment on support here, but I would think that anything in L0 could also be exposed in UR, potentially in extensions if other backends don't have that support, which is essentially what @kurapov-peter did in oneapi-src/unified-runtime#2614. |
Yeah, I don't see any issues adding UR APIs needed by higher-level languages/frameworks other than SYCL. UR was always meant to be universal. |
The problem
On Windows, the system driver comes without the L0 development package (read, no L0 headers). Putting the SDK there is not a solution since the driver store location is not searchable (the package contains LIBs and headers that will not be visible for applications). Currently, Triton uses this API in its driver (host stub) to load and build modules, compile kernels, retrieve compiled kernel metadata, and perform platform discovery.
Solution
There are multiple approaches to resolve this. Here, I summarize the explored solution of replacing L0 calls with SYCL equivalents. This contradicts the direction runtime analysis shows (I've written on the runtime choice quite some time ago: https://github.com/intel/intel-xpu-backend-for-triton/blob/main/docs/ARCHITECTURE.md#runtime; the analysis considers L0, UR, and SYCL) but it does solve the problem at hand. It is also not a great approach if
Most of the required APIs to replace L0 are already present in SYCL. Here you can find a prototype that removes the dependency on L0 in the default path of Triton's runtime: https://github.com/intel/intel-xpu-backend-for-triton/tree/pakurapo/sycl-runtime. It currently works for the SPIR-V path (native binary caching is not supported) and lacks the spills querying. The compilation is performed via the experimental kernel compiler SYCL extension (see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler_spirv.asciidoc).
Current status
The two missing pieces to cover all the required capabilities are:
The text was updated successfully, but these errors were encountered: