[Host ir] support for set reduce and binary op #4146

samnordmann · 2025-03-26T12:58:09Z

This PR belongs to a series of stacked PRs:

Add support for LoadStoreOp, BinaryOp, ReductionOp, including support for pre-allocated output, which is not provided by ExprEvaluator.

github-actions · 2025-03-26T12:58:59Z

Review updated until commit 10daa92

Description

Added support for LoadStoreOp in HostIrEvaluator
Implemented BinaryOp handling in HostIrEvaluator
Implemented ReductionOp handling in HostIrEvaluator
Updated HostIrLower to recognize new operations

Changes walkthrough 📝

Relevant files

Enhancement

executor.cpp `Implement LoadStoreOp, BinaryOp, and ReductionOp handling` csrc/host_ir/executor.cpp Added `handle(LoadStoreOp)` method Added `handle(BinaryOp)` method Added `handle(ReductionOp*)` method	+99/-0
lower.cpp `Update HostIrLower to recognize new operations` csrc/host_ir/lower.cpp Updated `isLoweredAsStandaloneHostOp` to include LoadStoreOp, BinaryOp, and ReductionOp	+3/-0
executor.h `Add declarations for new operation handlers` csrc/host_ir/executor.h Added declarations for `handle(LoadStoreOp)`, `handle(BinaryOp)`, and `handle(ReductionOp*)`	+3/-0

Tests

test_host_irs.cpp `Add tests for LoadStoreOp, BinaryOp, and ReductionOp` tests/cpp/test_host_irs.cpp Added tests for LoadStoreOp Added parameterized tests for BinaryOp Added tests for ReductionOp	+180/-0

Cleanup

test_multidevice_pipeline.cpp `Remove outdated staged reduction tests` tests/cpp/test_multidevice_pipeline.cpp Removed outdated staged reduction tests	+0/-131

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The handle(ReductionOp* reduction_op) function does not handle cases where the output tensor already has data. It should check if the output tensor is known and handle it accordingly to avoid overwriting existing data.

auto input_tv = reduction_op->in()->as<TensorView>();
auto output_tv = reduction_op->out()->as<TensorView>();
if (!isKnown(output_tv)) {
  return unhandled(reduction_op);
}

NVF_ERROR(
    !output_tv->hasRoot(),
    "Evaluation for rFactored reductions is not supported.");
auto input = getKnownConcreteData(input_tv).as<at::Tensor>();

Test Coverage

The tests for LoadStoreOp, BinaryOp, and ReductionOp are focused on CUDA devices. Consider adding tests for CPU or other devices to ensure broader compatibility.

using HirSetTest = NVFuserTest;

TEST_F(HirSetTest, HostIr) {
  const std::vector<int64_t> sizes = {8, 64};

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  auto* in = makeConcreteTensor(sizes);
  auto* out = makeConcreteTensor(sizes);
  auto* set = IrBuilder::create<LoadStoreOp>(LoadStoreOpType::Set, out, in);
  hic->addInput(in);
  hic->addInput(out);
  hic->pushBackTopLevelExprs(set);

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0);
  auto in_aten = at::randn(sizes, options);
  auto out_aten = at::empty(sizes, options);

  hie.runWithInput({{in, in_aten}, {out, out_aten}});

  EXPECT_TRUE(out_aten.equal(in_aten))
      << "Obtained output: " << out_aten << "\n"
      << "Expected output: " << in_aten;
}

class HirBinaryOpTest : public NVFuserFixtureParamTest<BinaryOpType> {
 protected:
  at::Tensor executeBinaryOp(at::Tensor lhs, at::Tensor rhs) {
    switch (GetParam()) {
      case BinaryOpType::Add:
        return lhs + rhs;
      case BinaryOpType::Sub:
        return lhs - rhs;
      case BinaryOpType::Mul:
        return lhs * rhs;
      case BinaryOpType::Div:
        return lhs / rhs;
      default:
        NVF_ERROR("Unsupported binary op type ", GetParam());
        return at::Tensor();
    }
  }
};

TEST_P(HirBinaryOpTest, PreAllocatedOutputs) {
  const std::vector<int64_t> sizes = {8, 64};
  const auto& binary_op_type = GetParam();

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  auto* lhs = makeConcreteTensor(sizes);
  auto* rhs = makeConcreteTensor(sizes);
  auto* out = makeConcreteTensor(sizes);
  auto* binary_op = IrBuilder::create<BinaryOp>(binary_op_type, out, lhs, rhs);
  hic->addInput(lhs);
  hic->addInput(rhs);
  hic->addInput(out);
  hic->pushBackTopLevelExprs(binary_op);

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0);
  auto lhs_aten = at::randn(sizes, options);
  auto rhs_aten = at::randn(sizes, options);
  auto out_aten = at::empty(sizes, options);

  hie.runWithInput({{lhs, lhs_aten}, {rhs, rhs_aten}, {out, out_aten}});

  at::Tensor expected_out = executeBinaryOp(lhs_aten, rhs_aten);
  EXPECT_TRUE(expected_out.equal(out_aten))
      << "Obtained output: " << out_aten << "\n"
      << "Expected output: " << expected_out;
}

TEST_P(HirBinaryOpTest, NonPreAllocatedOutputs) {
  const std::vector<int64_t> sizes = {8, 64};
  const auto& binary_op_type = GetParam();

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  auto* lhs = makeConcreteTensor(sizes);
  auto* rhs = makeConcreteTensor(sizes);
  auto* out = binaryOp(binary_op_type, lhs, rhs);
  hic->addInput(lhs);
  hic->addInput(rhs);
  hic->addOutput(out);
  hic->pushBackTopLevelExprs(out->definition());

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0);
  auto lhs_aten = at::randn(sizes, options);
  auto rhs_aten = at::randn(sizes, options);

  auto out_aten =
      hie.runWithInput({{lhs, lhs_aten}, {rhs, rhs_aten}})[0].as<at::Tensor>();

  at::Tensor expected_out = executeBinaryOp(lhs_aten, rhs_aten);
  EXPECT_TRUE(expected_out.equal(out_aten))
      << "Obtained output: " << out_aten << "\n"
      << "Expected output: " << expected_out;
}

INSTANTIATE_TEST_SUITE_P(
    ,
    HirBinaryOpTest,
    testing::Values(
        BinaryOpType::Add,
        BinaryOpType::Sub,
        BinaryOpType::Mul,
        BinaryOpType::Div),
    [](const testing::TestParamInfo<BinaryOpType>& info) -> std::string {
      std::stringstream ss;
      ss << "BinaryOpType_" << info.param;
      return ss.str();
    });

using HirReductionOpTest = NVFuserTest;

TEST_F(HirReductionOpTest, PreAllocatedOutputs) {
  constexpr int64_t size0 = 8, size1 = 64;
  constexpr int64_t reduction_axis = 1;

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  auto* in = makeConcreteTensor({size0, size1});
  auto* out = newForReduction(in, {reduction_axis}, in->dtype());
  auto* reduction_op = IrBuilder::create<ReductionOp>(
      BinaryOpType::Add, hic->zeroVal(), out, in);
  hic->addInput(in);
  hic->addOutput(out);
  hic->pushBackTopLevelExprs(reduction_op);

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0);
  auto in_aten = at::randn({size0, size1}, options);
  auto out_aten = at::empty({size0}, options);

  hie.runWithInput({{in, in_aten}, {out, out_aten}});

  at::Tensor expected_out = in_aten.sum(reduction_axis);
  EXPECT_TRUE(expected_out.equal(out_aten))
      << "Obtained output: " << out_aten << "\n"
      << "Expected output: " << expected_out;
}

TEST_F(HirReductionOpTest, NonPreAllocatedOutputs) {
  constexpr int64_t size0 = 8, size1 = 64;
  constexpr int64_t reduction_axis = 1;

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  auto* in = makeConcreteTensor({size0, size1});
  auto* out = sum(in, {reduction_axis});
  hic->addInput(in);
  hic->addOutput(out);
  hic->pushBackTopLevelExprs(out->definition());

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0);
  auto in_aten = at::randn({size0, size1}, options);
  auto out_aten = at::empty({size0}, options);

  hie.runWithInput({{in, in_aten}, {out, out_aten}});

  at::Tensor expected_out = in_aten.sum(reduction_axis);
  EXPECT_TRUE(expected_out.equal(out_aten))
      << "Obtained output: " << out_aten << "\n"
      << "Expected output: " << expected_out;
}

} // namespace hir

Error Handling

The handle(BinaryOp* binary_op) function returns early if the output is not known, which might lead to unhandled cases. Ensure that all possible scenarios are covered and appropriate error messages are provided.

void HostIrEvaluator::handle(BinaryOp* binary_op) {
  if (!isKnown(binary_op->outputs().at(0))) {
    return unhandled(binary_op);
  }

  auto lhs = getKnownConcreteData(binary_op->inputs().at(0)).as<at::Tensor>();
  auto rhs = getKnownConcreteData(binary_op->inputs().at(1)).as<at::Tensor>();
  auto output =
      getKnownConcreteData(binary_op->outputs().at(0)).as<at::Tensor>();

  switch (binary_op->getBinaryOpType()) {

samnordmann · 2025-03-26T13:32:55Z

!test

samnordmann · 2025-03-26T14:48:54Z

csrc/host_ir/executor.cpp

+        permutation.has_value(),
+        "The logical domain of a Set.Permute is supposed to be a permutation of the root domain: ",
+        out_tv->toString());
+    in_tensor = in_tensor.permute(*permutation).contiguous();


note that the .contiguous() is necessary here, and I think this is an unexposed bug in LoadStoreOp::evaluate() -- however fixing it there incidentally causes another test failure.

The bug was not exposed because we never "host evaluate" a set.Permute op before this PR

samnordmann · 2025-03-26T14:49:55Z

tests/cpp/test_multidevice_pipeline.cpp

@@ -457,135 +457,4 @@ INSTANTIATE_TEST_SUITE_P(
        testing::Values(0, 1),
        testing::Values(true)));

-// Different scheduling modes used in


This test is not relevant anymore since we don't use generated kernels for now. So we'll add it back in times if we think this is useful. But in the meantime it is just technical debt

add host ir support for set reduce and binary op

10daa92

samnordmann force-pushed the host_irs/LoadStore_Reduction_binaryOp_support branch from 588e130 to 10daa92 Compare March 26, 2025 13:05

samnordmann changed the title ~~add host ir support for set reduce and binary op~~ [Host ir] support for set reduce and binary op Mar 26, 2025

This was referenced Mar 26, 2025

[Host Ir] stream lowering, first milestone #4148

Draft

[Host irs] alias and preallocated output support #4144

Open

[Host Ir] refactor and cleanup lowering and segmentation #4145

Open

[Host irs] Stream lowering of single device fusions #4147

Open

samnordmann commented Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Host ir] support for set reduce and binary op #4146

[Host ir] support for set reduce and binary op #4146

samnordmann commented Mar 26, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025 •

edited

Loading

samnordmann commented Mar 26, 2025

samnordmann Mar 26, 2025

samnordmann Mar 26, 2025

[Host ir] support for set reduce and binary op #4146

Are you sure you want to change the base?

[Host ir] support for set reduce and binary op #4146

Conversation

samnordmann commented Mar 26, 2025 • edited Loading

github-actions bot commented Mar 26, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

samnordmann commented Mar 26, 2025

samnordmann Mar 26, 2025

Choose a reason for hiding this comment

samnordmann Mar 26, 2025

Choose a reason for hiding this comment

samnordmann commented Mar 26, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025 •

edited

Loading