PR #23654: [ROCm] Enable mfma instructions by passing the correct arch name

steeve · Google-ML-Automation · commit ef1a142d7e4f · 2025-03-13T08:51:00.000-07:00
Imported from GitHub PR #23654 This PR enables the Triton pipeline to emit `#triton_gpu.amd_mfma` annotations during the Triton to TritonGPU lowering. This is done by the `TritonAMDGPUAccelerateMatmulPass`, which checks the GFX version to do that. Correctly passing the `gfx_version` reduces our kernel runtime from **~5ms** to **~620us** on MI300X, matching the performance of the Python Triton Compiler used in Torch. We expect this change to radiate quite a bit given that this pipeline is shared by the IR Fusion Emitter used widely across XLA if `tt.dot` ops are emitted. Closes #23574 Copybara import of the project: -- c256551 by Steeve Morin <steeve@zml.ai>: [ROCm] Enable mfma instructions by passing the correct arch name Without this commit, mfma instructions would not be emitted by this pass. Merging this change closes #23654 COPYBARA_INTEGRATE_REVIEW=#23654 from zml:zml/rocm/mfma c256551 PiperOrigin-RevId: 736519517
diff --git a/xla/backends/gpu/codegen/triton/compilation_pipeline_rocm.cc b/xla/backends/gpu/codegen/triton/compilation_pipeline_rocm.cc
@@ -90,7 +90,7 @@ absl::Status CreateTritonPipeline(mlir::OpPassManager* pm,
   pm->addPass(mt::gpu::createTritonGPUCoalesce());
   pm->addPass(mt::gpu::createTritonGPURemoveLayoutConversions());
   pm->addPass(mt::gpu::createTritonGPUOptimizeThreadLocality());
-  pm->addPass(mlir::createTritonAMDGPUAccelerateMatmulPass());
+  pm->addPass(mlir::createTritonAMDGPUAccelerateMatmulPass(cc.gfx_version()));
   pm->addPass(mt::gpu::createTritonGPURemoveLayoutConversions());
   // TODO ROCm Check if we want to compare MI100 and greater
   pm->addPass(mlir::createTritonAMDGPUOptimizeEpiloguePass());