Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Feb 20, 2025

running graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5: 28.533 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5: 45.884 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100: 353.814 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100: 386.979 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5: 25.278 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5: 20.202 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100: 235.466 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100: 222.501 μs).

Copy link
Contributor

Compute Benchmarks level_zero run (with params: --filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13451723888

Copy link
Contributor

Benchmarks level_zero run (--filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13451723888
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 2 (threshold 2.00%)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 278.788000 μs 353070.778 μs 126544.90%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 317.578000 μs 353311.985 μs 111152.03%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SinKernelGraph 5 (6)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5 28.805000 μs -
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5 39.309000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5 26.120000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5 28.314000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5 33.229000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:5 43.357000 μs -
Relative perf in group SinKernelGraph 100 (6)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 278.788000 μs 353070.778 μs 126544.90%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 317.578000 μs 353311.985 μs 111152.03%
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100 247.620000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100 247.960000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100 271.047000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:100 283.033000 μs -
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_l0 SubmitKernel out of order - 11.594000 μs
api_overhead_benchmark_l0 SubmitKernel in order - 11.545000 μs
api_overhead_benchmark_sycl SubmitKernel out of order - 23.412000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 24.864000 μs
api_overhead_benchmark_ur SubmitKernel out of order - 15.620000 μs
api_overhead_benchmark_ur SubmitKernel in order - 16.478000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion - 21.210000 μs
Relative perf in group Other (17)
Benchmark This PR baseline Change
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 258.560000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 135.766000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.693000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 - 3.163000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.184000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.753000 μs
miscellaneous_benchmark_sycl VectorSum - 860.959000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6942.349000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17133.149000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 47235.674000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 2118.735000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7497.452000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8816.540000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 26140.651000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1209.662000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events - 41245.348000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events - 110960.467000 μs
Relative perf in group SinKernelGraph (2)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 - 71722.137000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 - 72509.435000 μs
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 - 55.647000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 - 63.862000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 - 684.854000 μs
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 - 5589.786000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 - 5595.030000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 - 56452.586000 μs
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count - 105303.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count - 110655.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count - 123544.000000 instr
Velocity Bench
Relative perf in group Other (5)
Benchmark This PR baseline Change
Velocity-Bench Hashtable - 358.469128 M keys/sec
Velocity-Bench Bitcracker - 38.110900 s
Velocity-Bench CudaSift - 206.696000 ms
Velocity-Bench QuickSilver - 116.820000 MMS/CTT
Velocity-Bench Sobel Filter - 613.551000 ms
SYCL-Bench
Relative perf in group Other (53)
Benchmark This PR baseline Change
Runtime_IndependentDAGTaskThroughput_SingleTask - 254.598000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 270.118000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 271.340000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 271.937000 ms
Runtime_DAGTaskThroughput_SingleTask - 1658.826000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1710.018000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1698.279000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1680.010000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 5.268000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.727000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.618000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.668000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 617.473000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 617.510000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.685000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.890000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.977000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 5.116000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 616.902000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 616.873000 ms
MicroBench_LocalMem_int32_4096 - 29.544000 ms
MicroBench_LocalMem_fp32_4096 - 29.698000 ms
Pattern_Reduction_NDRange_int32 - 15.853000 ms
Pattern_Reduction_Hierarchical_int32 - 15.597000 ms
ScalarProduct_NDRange_int32 - 3.907000 ms
ScalarProduct_NDRange_int64 - 5.579000 ms
ScalarProduct_NDRange_fp32 - 3.844000 ms
ScalarProduct_Hierarchical_int32 - 11.337000 ms
ScalarProduct_Hierarchical_int64 - 12.306000 ms
ScalarProduct_Hierarchical_fp32 - 11.003000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.406000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.315000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.507000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.315000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 12.411000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 12.300000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 12.487000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 12.298000 ms
USM_Allocation_latency_fp32_device - 0.058000 ms
USM_Allocation_latency_fp32_host - 37.228000 ms
USM_Allocation_latency_fp32_shared - 0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.645000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.041000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.808000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.199000 ms
VectorAddition_int32 - 1.524000 ms
VectorAddition_int64 - 3.103000 ms
VectorAddition_fp32 - 1.446000 ms
Polybench_2mm - 1.268000 ms
Polybench_3mm - 1.793000 ms
Polybench_Atax - 7.006000 ms
Kmeans_fp32 - 16.165000 ms
MolecularDynamics - 0.030000 ms

Details

Benchmark details - environment, command...
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

Copy link
Contributor

Compute Benchmarks level_zero_v2 run (with params: --filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13452231406

Copy link
Contributor

Benchmarks level_zero_v2 run (--filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13452231406
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 2 (threshold 2.00%)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 249.158000 μs 353070.778 μs 141605.58%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 275.341000 μs 353311.985 μs 128217.97%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SinKernelGraph 5 (5)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5 27.199000 μs -
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5 26.265000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5 25.321000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5 28.319000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5 25.563000 μs -
Relative perf in group SinKernelGraph 100 (5)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 249.158000 μs 353070.778 μs 141605.58%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 275.341000 μs 353311.985 μs 128217.97%
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100 246.261000 μs -
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100 251.641000 μs -
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100 244.341000 μs -
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_l0 SubmitKernel out of order - 11.594000 μs
api_overhead_benchmark_l0 SubmitKernel in order - 11.545000 μs
api_overhead_benchmark_sycl SubmitKernel out of order - 23.412000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 24.864000 μs
api_overhead_benchmark_ur SubmitKernel out of order - 15.620000 μs
api_overhead_benchmark_ur SubmitKernel in order - 16.478000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion - 21.210000 μs
Relative perf in group Other (17)
Benchmark This PR baseline Change
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 258.560000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 135.766000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.693000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 - 3.163000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.184000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.753000 μs
miscellaneous_benchmark_sycl VectorSum - 860.959000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6942.349000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17133.149000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 47235.674000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 2118.735000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7497.452000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8816.540000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 26140.651000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1209.662000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events - 41245.348000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events - 110960.467000 μs
Relative perf in group SinKernelGraph (2)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 - 71722.137000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 - 72509.435000 μs
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 - 55.647000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 - 63.862000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 - 684.854000 μs
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 - 5589.786000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 - 5595.030000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 - 56452.586000 μs
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count - 105303.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count - 110655.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count - 123544.000000 instr
Velocity Bench
Relative perf in group Other (5)
Benchmark This PR baseline Change
Velocity-Bench Hashtable - 358.469128 M keys/sec
Velocity-Bench Bitcracker - 38.110900 s
Velocity-Bench CudaSift - 206.696000 ms
Velocity-Bench QuickSilver - 116.820000 MMS/CTT
Velocity-Bench Sobel Filter - 613.551000 ms
SYCL-Bench
Relative perf in group Other (53)
Benchmark This PR baseline Change
Runtime_IndependentDAGTaskThroughput_SingleTask - 254.598000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 270.118000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 271.340000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 271.937000 ms
Runtime_DAGTaskThroughput_SingleTask - 1658.826000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1710.018000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1698.279000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1680.010000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 5.268000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.727000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.618000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.668000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 617.473000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 617.510000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.685000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.890000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.977000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 5.116000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 616.902000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 616.873000 ms
MicroBench_LocalMem_int32_4096 - 29.544000 ms
MicroBench_LocalMem_fp32_4096 - 29.698000 ms
Pattern_Reduction_NDRange_int32 - 15.853000 ms
Pattern_Reduction_Hierarchical_int32 - 15.597000 ms
ScalarProduct_NDRange_int32 - 3.907000 ms
ScalarProduct_NDRange_int64 - 5.579000 ms
ScalarProduct_NDRange_fp32 - 3.844000 ms
ScalarProduct_Hierarchical_int32 - 11.337000 ms
ScalarProduct_Hierarchical_int64 - 12.306000 ms
ScalarProduct_Hierarchical_fp32 - 11.003000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.406000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.315000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.507000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.315000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 12.411000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 12.300000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 12.487000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 12.298000 ms
USM_Allocation_latency_fp32_device - 0.058000 ms
USM_Allocation_latency_fp32_host - 37.228000 ms
USM_Allocation_latency_fp32_shared - 0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.645000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.041000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.808000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.199000 ms
VectorAddition_int32 - 1.524000 ms
VectorAddition_int64 - 3.103000 ms
VectorAddition_fp32 - 1.446000 ms
Polybench_2mm - 1.268000 ms
Polybench_3mm - 1.793000 ms
Polybench_Atax - 7.006000 ms
Kmeans_fp32 - 16.165000 ms
MolecularDynamics - 0.030000 ms

Details

Benchmark details - environment, command...
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

@pbalcer
Copy link
Contributor Author

pbalcer commented Feb 21, 2025

@intel/unified-runtime-reviewers please review. The failures are unrelated since this is only adding new scripts. The new tests work as evident by the benchmark runs above.

@pbalcer
Copy link
Contributor Author

pbalcer commented Feb 21, 2025

@intel/llvm-gatekeepers please merge

The failures in SYCL :: ESIMD/matrix_transpose_glb.cpp are unrelated since this isn't touching tests or any of the sycl implementation.

@sarnex sarnex merged commit 8a9e847 into intel:sycl Feb 21, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants