Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindless_images/sampling_3D.cpp and bindless_images/sampling_2D.cpp tests failing with UR_RESULT_ERROR_UNSUPPORTED_FEATURE on HIP/AMD #16933

Open
againull opened this issue Feb 7, 2025 · 11 comments
Labels
bug Something isn't working hip Issues related to execution on HIP backend.

Comments

@againull
Copy link
Contributor

againull commented Feb 7, 2025

Describe the bug

These tests are failing on unrelated changes, see:
https://github.com/intel/llvm/actions/runs/13208597080/job/36880810856?pr=16882
https://github.com/intel/llvm/actions/runs/13209448038/job/36880861128?pr=16932

2025-02-07T23:33:28.1415303Z ninja: Entering directory `build-e2e'
2025-02-07T23:33:28.1415648Z [0/1] Running SYCL End-to-End tests
2025-02-07T23:33:28.1415932Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:516: note: Targeted devices: hip:gpu
2025-02-07T23:33:28.1416317Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:696: note: Found pre-installed AOT device compiler ocloc
2025-02-07T23:33:28.1416700Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:696: note: Found pre-installed AOT device compiler opencl-aot
2025-02-07T23:33:28.1417976Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:799: note: Aspects for hip:gpu: ext_intel_legacy_image, fp16, queue_profiling, ext_oneapi_bindless_images_1d_usm, ext_intel_free_memory, ext_oneapi_bindless_images_2d_usm, online_compiler, usm_atomic_host_allocations, ext_oneapi_bindless_images_shared_usm, ext_intel_memory_bus_width, ext_intel_pci_address, usm_host_allocations, usm_device_allocations, ext_oneapi_graph, ext_intel_memory_clock_rate, ext_oneapi_limited_graph, ext_intel_device_id, gpu, ext_oneapi_native_assert, online_linker, ext_intel_device_info_uuid, ext_oneapi_queue_profiling_tag, usm_shared_allocations, atomic64, fp64, usm_atomic_shared_allocations, ext_oneapi_bindless_images
2025-02-07T23:33:28.1419355Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:811: note: SG sizes for hip:gpu: 32
2025-02-07T23:33:28.1419780Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:820: note: Architectures for hip:gpu: amd_gpu_gfx1031
2025-02-07T23:33:28.1420053Z -- Testing: 2274 tests, 24 workers --
2025-02-07T23:33:28.1420402Z FAIL: SYCL :: bindless_images/sampling_3D.cpp (2201 of 2274)
2025-02-07T23:33:28.1420644Z ******************** TEST 'SYCL :: bindless_images/sampling_3D.cpp' FAILED ********************
2025-02-07T23:33:28.1420849Z Exit Code: 1
2025-02-07T23:33:28.1420920Z 
2025-02-07T23:33:28.1420975Z Command Output (stdout):
2025-02-07T23:33:28.1421108Z --
2025-02-07T23:33:28.1421217Z # RUN: at line 6
2025-02-07T23:33:28.1421816Z /__w/llvm/llvm/toolchain/bin//clang++  -Werror -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1031  /__w/llvm/llvm/llvm/sycl/test-e2e/bindless_images/sampling_3D.cpp -o /__w/llvm/llvm/build-e2e/bindless_images/Output/sampling_3D.cpp.tmp.out
2025-02-07T23:33:28.1422885Z # executed command: /__w/llvm/llvm/toolchain/bin//clang++ -Werror -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1031 /__w/llvm/llvm/llvm/sycl/test-e2e/bindless_images/sampling_3D.cpp -o /__w/llvm/llvm/build-e2e/bindless_images/Output/sampling_3D.cpp.tmp.out
2025-02-07T23:33:28.1423521Z # note: command had no output on stdout or stderr
2025-02-07T23:33:28.1423697Z # RUN: at line 7
2025-02-07T23:33:28.1424064Z env UR_HIP_ENABLE_IMAGE_SUPPORT=1  env NEOReadDebugKeys=1 UseBindlessMode=1 UseExternalAllocatorForSshAndDsh=1 /__w/llvm/llvm/build-e2e/bindless_images/Output/sampling_3D.cpp.tmp.out
2025-02-07T23:33:28.1424723Z # executed command: env UR_HIP_ENABLE_IMAGE_SUPPORT=1 env NEOReadDebugKeys=1 UseBindlessMode=1 UseExternalAllocatorForSshAndDsh=1 /__w/llvm/llvm/build-e2e/bindless_images/Output/sampling_3D.cpp.tmp.out
2025-02-07T23:33:28.1425348Z # .---command stderr------------
2025-02-07T23:33:28.1425505Z # | <HIP>[ERROR]: 
2025-02-07T23:33:28.1425631Z # | UR HIP ERROR:
2025-02-07T23:33:28.1425757Z # | 	Value:           801
2025-02-07T23:33:28.1425901Z # | 	Name:            hipErrorNotSupported
2025-02-07T23:33:28.1426077Z # | 	Description:     operation not supported
2025-02-07T23:33:28.1426329Z # | 	Function:        urTextureCreate
2025-02-07T23:33:28.1426588Z # | 	Source Location: /__w/llvm/llvm/build/_deps/unified-runtime-src/source/adapters/hip/image.cpp:241
2025-02-07T23:33:28.1426828Z # | 
2025-02-07T23:33:28.1426932Z # | <HIP>[ERROR]: 
2025-02-07T23:33:28.1427054Z # | UR HIP ERROR:
2025-02-07T23:33:28.1427211Z # | 	Value:           UR_RESULT_ERROR_UNSUPPORTED_FEATURE
2025-02-07T23:33:28.1427417Z # | 	Function:        urBindlessImagesSampledImageCreateExp
2025-02-07T23:33:28.1427698Z # | 	Source Location: /__w/llvm/llvm/build/_deps/unified-runtime-src/source/adapters/hip/image.cpp:560
2025-02-07T23:33:28.1427930Z # | 
2025-02-07T23:33:28.1428153Z # | SYCL exception caught! : Native API failed. Native API returns: 44 (UR_RESULT_ERROR_UNSUPPORTED_FEATURE)
2025-02-07T23:33:28.1428410Z # `-----------------------------
2025-02-07T23:33:28.1428562Z # error: command failed with exit status: 1
2025-02-07T23:33:28.1428667Z 
2025-02-07T23:33:28.1428714Z --
2025-02-07T23:33:28.1428775Z 

@againull againull added bug Something isn't working hip Issues related to execution on HIP backend. labels Feb 7, 2025
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 10, 2025

This pre-commit job is still using a unsupported ROCM gpu: gfx1031.
The test passes fine on a officially supported ROCM gpu: gfx1030, that is used in another HIP runner job. see e.g. https://github.com/intel/llvm/actions/runs/13182032752/job/36798886329?pr=16439

I suggest that the pre-commit HIP runner using gfx1031 is simply removed if it is not possible to run it on an officially supported GPU.
Otherwise this is going to happen a lot.

@aelovikov-intel
Copy link
Contributor

https://github.com/intel/llvm/actions/runs/13252932583/job/36996542695?pr=16954

has

Runner name: 'cp-amd-runner'
...
Failed Tests (1):
  SYCL :: bindless_images/sampling_3D.cpp

@sarnex
Copy link
Contributor

sarnex commented Feb 11, 2025

If all the mentioned tests still fail (sporadically) on the supported GPU, then the problem was obviously not the unsupported GPU on the other runner, so if we have no evidence the behavior is different on the supported GPU than the supported GPU, we should re-enable the runner IMO.

@JackAKirk
Copy link
Contributor

https://github.com/intel/llvm/actions/runs/13252932583/job/36996542695?pr=16954

has

Runner name: 'cp-amd-runner'
...
Failed Tests (1):
  SYCL :: bindless_images/sampling_3D.cpp

Thanks, this is a legit failure, we can Xfail this for gfx1030. It is a officially supported card but it is an old card so this unsupported feature error isn't a surprise I think.

@aelovikov-intel
Copy link
Contributor

How is this flaky if the HW is old and the feature is unsupported?

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 11, 2025

If all the mentioned tests still fail (sporadically) on the supported GPU, then the problem was obviously not the unsupported GPU on the other runner, so if we have no evidence the behavior is different on the supported GPU than the supported GPU, we should re-enable the runner IMO.

This is the case for this test, however there are masses of other tests that fail of gfx1031 (unsupported) and pass on gfx1030.

It is true that gfx1030 (RDNA2 architecture) is not an ideal card to test on, since it is very low on AMD support priority for bug fixes, and we are already aware that it does have more failures than e.g. the CDNA series cards (or I imagine RDNA3). But it is the only officially supported card available, and is much more reliable that unsupported cards.

@sarnex
Copy link
Contributor

sarnex commented Feb 11, 2025

This is the case for this test, however there are masses of other tests that fail of gfx1031 (unsupported) and pass on gfx1030.

I didn't know this, that solves my concern. Thanks.

@bader
Copy link
Contributor

bader commented Feb 11, 2025

@JackAKirk, could you please give more information on why gfx1031 (newer HW) is not supported? Is this a due to some SW issues in ROCM drivers? Do we have plans to support this family of AMD GPUs in the future?

DPC++ users should be able to find this information in the product documentation. Right?

It seems like we don't have any diagnostics in our product nor in our testing environment. I might be useful to have some checks in lit.cfg.py or DPC++ runtime for unsupported platform to give users meaningful diagnostics. It's hard for DPC++ developers to identify if the test failure is a real product issue or environmental issue.

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 11, 2025

@JackAKirk, could you please give more information on why gfx1031 (newer HW) is not supported? Is this a due to some SW issues in ROCM drivers? Do we have plans to support this family of AMD GPUs in the future?

For a full list of amd gpus supported by ROCM drivers (and therefore the hip backend of DPC++), you can refer to the link that I referred to in e.g. this comment (maybe the surrounding conversation is useful) #7634 (comment) (note the support matrix depends on Linux/Windows platform)
On the reasoning behind AMD's choice of GPUs, you could consult the ROCM issues board (see link below), where this is discussed quite extensively. I think the short answer is that depending on the platform (windows/linux) AMD prioritize ROCM support for appropriate gpus for target GPGPU applications, such as deep learning (RDNA3+/CDNA), image processing (RDNA2+), double precision HPC (CDNA architecture)

DPC++ users should be able to find this information in the product documentation. Right?

I think that the plugin documentation refers users to rocm information/ informs on supported amd devices, @npmiller ?

It seems like we don't have any diagnostics in our product nor in our testing environment. I might be useful to have some checks in lit.cfg.py or DPC++ runtime for unsupported platform to give users meaningful diagnostics. It's hard for DPC++ developers to identify if the test failure is a real product issue or environmental issue.

Indeed this is a massive headache for amd developers. a short survey of issues in the https://github.com/ROCm/ROCm/issues board will give you a flavour of this.

@JackAKirk
Copy link
Contributor

I've marked gfx1030 unsupported for the 3d sampling failure here: #16971

@JackAKirk
Copy link
Contributor

It is true that gfx1030 (RDNA2 architecture) is not an ideal card to test on, since it is very low on AMD support priority for bug fixes, and we are already aware that it does have more failures than e.g. the CDNA series cards (or I imagine RDNA3).

e.g. see ROCm/HIP#3368

aelovikov-intel pushed a commit that referenced this issue Feb 12, 2025
This fixes the gfx1030 3D sampling failure mentioned here:
#16933
by marking this device unsupported in the test

---------

Signed-off-by: JackAKirk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

5 participants