[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

pbalcer · 2025-02-20T11:28:33Z

running graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5: 28.533 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5: 45.884 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100: 353.814 μs).
running graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100, iteration 0... complete (graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100: 386.979 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5: 25.278 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5: 20.202 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100: 235.466 μs).
running graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100, iteration 0... complete (graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100: 222.501 μs).

github-actions · 2025-02-21T07:00:18Z

Compute Benchmarks level_zero run (with params: --filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13451723888

github-actions · 2025-02-21T07:12:03Z

Benchmarks level_zero run (--filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13451723888
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 2 (threshold 2.00%)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	278.788000 μs	353070.778 μs	126544.90%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	317.578000 μs	353311.985 μs	111152.03%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SinKernelGraph 5 (6)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5	28.805000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5	39.309000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5	26.120000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5	28.314000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5	33.229000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:5	43.357000 μs	-

Relative perf in group SinKernelGraph 100 (6)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	278.788000 μs	353070.778 μs	126544.90%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	317.578000 μs	353311.985 μs	111152.03%
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100	247.620000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100	247.960000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100	271.047000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:100	283.033000 μs	-

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.594000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.545000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.412000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.864000 μs
api_overhead_benchmark_ur SubmitKernel out of order	-	15.620000 μs
api_overhead_benchmark_ur SubmitKernel in order	-	16.478000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.210000 μs

Relative perf in group Other (17)

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	258.560000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	135.766000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.693000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.163000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.184000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.753000 μs
miscellaneous_benchmark_sycl VectorSum	-	860.959000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6942.349000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17133.149000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	47235.674000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2118.735000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7497.452000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8816.540000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	26140.651000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1209.662000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	41245.348000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	110960.467000 μs

Relative perf in group SinKernelGraph (2)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71722.137000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72509.435000 μs

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	55.647000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	63.862000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	684.854000 μs

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5589.786000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5595.030000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	56452.586000 μs

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	105303.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110655.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123544.000000 instr

Velocity Bench

Relative perf in group Other (5)

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.469128 M keys/sec
Velocity-Bench Bitcracker	-	38.110900 s
Velocity-Bench CudaSift	-	206.696000 ms
Velocity-Bench QuickSilver	-	116.820000 MMS/CTT
Velocity-Bench Sobel Filter	-	613.551000 ms

SYCL-Bench

Relative perf in group Other (53)

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	254.598000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	270.118000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	271.340000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	271.937000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1658.826000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1710.018000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1698.279000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1680.010000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.268000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.727000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.618000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.668000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.473000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.510000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.685000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.890000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.977000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.116000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	616.902000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	616.873000 ms
MicroBench_LocalMem_int32_4096	-	29.544000 ms
MicroBench_LocalMem_fp32_4096	-	29.698000 ms
Pattern_Reduction_NDRange_int32	-	15.853000 ms
Pattern_Reduction_Hierarchical_int32	-	15.597000 ms
ScalarProduct_NDRange_int32	-	3.907000 ms
ScalarProduct_NDRange_int64	-	5.579000 ms
ScalarProduct_NDRange_fp32	-	3.844000 ms
ScalarProduct_Hierarchical_int32	-	11.337000 ms
ScalarProduct_Hierarchical_int64	-	12.306000 ms
ScalarProduct_Hierarchical_fp32	-	11.003000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.406000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.315000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.507000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.315000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	12.411000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	12.300000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	12.487000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	12.298000 ms
USM_Allocation_latency_fp32_device	-	0.058000 ms
USM_Allocation_latency_fp32_host	-	37.228000 ms
USM_Allocation_latency_fp32_shared	-	0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.645000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.041000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.808000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.199000 ms
VectorAddition_int32	-	1.524000 ms
VectorAddition_int64	-	3.103000 ms
VectorAddition_fp32	-	1.446000 ms
Polybench_2mm	-	1.268000 ms
Polybench_3mm	-	1.793000 ms
Polybench_Atax	-	7.006000 ms
Kmeans_fp32	-	16.165000 ms
MolecularDynamics	-	0.030000 ms

Details

Benchmark details - environment, command...

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

github-actions · 2025-02-21T07:37:59Z

Compute Benchmarks level_zero_v2 run (with params: --filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13452231406

github-actions · 2025-02-21T07:42:34Z

Benchmarks level_zero_v2 run (--filter "SinKernelGraph"):
https://github.com/intel/llvm/actions/runs/13452231406
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 2 (threshold 2.00%)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	249.158000 μs	353070.778 μs	141605.58%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	275.341000 μs	353311.985 μs	128217.97%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SinKernelGraph 5 (5)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5	27.199000 μs	-
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5	26.265000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5	25.321000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5	28.319000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5	25.563000 μs	-

Relative perf in group SinKernelGraph 100 (5)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	249.158000 μs	353070.778 μs	141605.58%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	275.341000 μs	353311.985 μs	128217.97%
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100	246.261000 μs	-
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100	251.641000 μs	-
graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100	244.341000 μs	-

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.594000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.545000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.412000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.864000 μs
api_overhead_benchmark_ur SubmitKernel out of order	-	15.620000 μs
api_overhead_benchmark_ur SubmitKernel in order	-	16.478000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.210000 μs

Relative perf in group Other (17)

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	258.560000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	135.766000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.693000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.163000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.184000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.753000 μs
miscellaneous_benchmark_sycl VectorSum	-	860.959000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6942.349000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17133.149000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	47235.674000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2118.735000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7497.452000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8816.540000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	26140.651000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1209.662000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	41245.348000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	110960.467000 μs

Relative perf in group SinKernelGraph (2)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71722.137000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72509.435000 μs

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	55.647000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	63.862000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	684.854000 μs

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5589.786000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5595.030000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	56452.586000 μs

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	105303.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110655.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123544.000000 instr

Velocity Bench

Relative perf in group Other (5)

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.469128 M keys/sec
Velocity-Bench Bitcracker	-	38.110900 s
Velocity-Bench CudaSift	-	206.696000 ms
Velocity-Bench QuickSilver	-	116.820000 MMS/CTT
Velocity-Bench Sobel Filter	-	613.551000 ms

SYCL-Bench

Relative perf in group Other (53)

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	254.598000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	270.118000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	271.340000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	271.937000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1658.826000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1710.018000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1698.279000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1680.010000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.268000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.727000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.618000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.668000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.473000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.510000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.685000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.890000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.977000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.116000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	616.902000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	616.873000 ms
MicroBench_LocalMem_int32_4096	-	29.544000 ms
MicroBench_LocalMem_fp32_4096	-	29.698000 ms
Pattern_Reduction_NDRange_int32	-	15.853000 ms
Pattern_Reduction_Hierarchical_int32	-	15.597000 ms
ScalarProduct_NDRange_int32	-	3.907000 ms
ScalarProduct_NDRange_int64	-	5.579000 ms
ScalarProduct_NDRange_fp32	-	3.844000 ms
ScalarProduct_Hierarchical_int32	-	11.337000 ms
ScalarProduct_Hierarchical_int64	-	12.306000 ms
ScalarProduct_Hierarchical_fp32	-	11.003000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.406000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.315000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.507000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.315000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	12.411000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	12.300000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	12.487000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	12.298000 ms
USM_Allocation_latency_fp32_device	-	0.058000 ms
USM_Allocation_latency_fp32_host	-	37.228000 ms
USM_Allocation_latency_fp32_shared	-	0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.645000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.041000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.808000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.199000 ms
VectorAddition_int32	-	1.524000 ms
VectorAddition_int64	-	3.103000 ms
VectorAddition_fp32	-	1.446000 ms
Polybench_2mm	-	1.268000 ms
Polybench_3mm	-	1.793000 ms
Polybench_Atax	-	7.006000 ms
Kmeans_fp32	-	16.165000 ms
MolecularDynamics	-	0.030000 ms

Details

Benchmark details - environment, command...

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_ur SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_ur --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

pbalcer · 2025-02-21T11:32:27Z

@intel/unified-runtime-reviewers please review. The failures are unrelated since this is only adding new scripts. The new tests work as evident by the benchmark runs above.

pbalcer · 2025-02-21T15:29:11Z

@intel/llvm-gatekeepers please merge

The failures in SYCL :: ESIMD/matrix_transpose_glb.cpp are unrelated since this isn't touching tests or any of the sycl implementation.

pbalcer requested a review from a team as a code owner February 20, 2025 11:28

pbalcer temporarily deployed to WindowsCILock February 20, 2025 11:29 — with GitHub Actions Inactive

pbalcer had a problem deploying to WindowsCILock February 20, 2025 11:44 — with GitHub Actions Error

pbalcer force-pushed the add-sycl-graphs-benches branch from 1681da8 to dd85b9a Compare February 20, 2025 12:08

pbalcer temporarily deployed to WindowsCILock February 20, 2025 12:10 — with GitHub Actions Inactive

pbalcer had a problem deploying to WindowsCILock February 20, 2025 12:36 — with GitHub Actions Error

pbalcer force-pushed the add-sycl-graphs-benches branch from dd85b9a to 636fc2d Compare February 20, 2025 12:37

pbalcer temporarily deployed to WindowsCILock February 20, 2025 12:38 — with GitHub Actions Inactive

pbalcer temporarily deployed to WindowsCILock February 20, 2025 13:20 — with GitHub Actions Inactive

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark

636fc2d

kbenzie approved these changes Feb 21, 2025

View reviewed changes

sarnex merged commit 8a9e847 into intel:sycl Feb 21, 2025
26 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

pbalcer commented Feb 20, 2025

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

pbalcer commented Feb 21, 2025

pbalcer commented Feb 21, 2025

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

[UR][PERF] add L0, UR, and SYCL Sin Kernel Graph benchmark #17087

Conversation

pbalcer commented Feb 20, 2025

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

Summary

Performance change in benchmark groups

Details

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

Summary

Performance change in benchmark groups

Details

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

Command:

pbalcer commented Feb 21, 2025

pbalcer commented Feb 21, 2025