Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Host ir] support for set reduce and binary op #4146

Open
wants to merge 1 commit into
base: host_irs/refactor_lowering_and_segmentation
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Mar 26, 2025

This PR belongs to a series of stacked PRs:

  1. [Host irs] alias and preallocated output support #4144
  2. [Host Ir] refactor and cleanup lowering and segmentation #4145
  3. => You are here: [Host ir] support for set reduce and binary op #4146
  4. [Host irs] Stream lowering of single device fusions #4147

Add support for LoadStoreOp, BinaryOp, ReductionOp, including support for pre-allocated output, which is not provided by ExprEvaluator.

Copy link

github-actions bot commented Mar 26, 2025

Review updated until commit 10daa92

Description

  • Added support for LoadStoreOp in HostIrEvaluator

  • Implemented BinaryOp handling in HostIrEvaluator

  • Implemented ReductionOp handling in HostIrEvaluator

  • Updated HostIrLower to recognize new operations


Changes walkthrough 📝

Relevant files
Enhancement
executor.cpp
Implement LoadStoreOp, BinaryOp, and ReductionOp handling

csrc/host_ir/executor.cpp

  • Added handle(LoadStoreOp*) method
  • Added handle(BinaryOp*) method
  • Added handle(ReductionOp*) method
  • +99/-0   
    lower.cpp
    Update HostIrLower to recognize new operations                     

    csrc/host_ir/lower.cpp

  • Updated isLoweredAsStandaloneHostOp to include LoadStoreOp, BinaryOp,
    and ReductionOp
  • +3/-0     
    executor.h
    Add declarations for new operation handlers                           

    csrc/host_ir/executor.h

  • Added declarations for handle(LoadStoreOp*), handle(BinaryOp*), and
    handle(ReductionOp*)
  • +3/-0     
    Tests
    test_host_irs.cpp
    Add tests for LoadStoreOp, BinaryOp, and ReductionOp         

    tests/cpp/test_host_irs.cpp

  • Added tests for LoadStoreOp
  • Added parameterized tests for BinaryOp
  • Added tests for ReductionOp
  • +180/-0 
    Cleanup
    test_multidevice_pipeline.cpp
    Remove outdated staged reduction tests                                     

    tests/cpp/test_multidevice_pipeline.cpp

    • Removed outdated staged reduction tests
    +0/-131 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The handle(ReductionOp* reduction_op) function does not handle cases where the output tensor already has data. It should check if the output tensor is known and handle it accordingly to avoid overwriting existing data.

    auto input_tv = reduction_op->in()->as<TensorView>();
    auto output_tv = reduction_op->out()->as<TensorView>();
    if (!isKnown(output_tv)) {
      return unhandled(reduction_op);
    }
    
    NVF_ERROR(
        !output_tv->hasRoot(),
        "Evaluation for rFactored reductions is not supported.");
    auto input = getKnownConcreteData(input_tv).as<at::Tensor>();
    Test Coverage

    The tests for LoadStoreOp, BinaryOp, and ReductionOp are focused on CUDA devices. Consider adding tests for CPU or other devices to ensure broader compatibility.

    using HirSetTest = NVFuserTest;
    
    TEST_F(HirSetTest, HostIr) {
      const std::vector<int64_t> sizes = {8, 64};
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      auto* in = makeConcreteTensor(sizes);
      auto* out = makeConcreteTensor(sizes);
      auto* set = IrBuilder::create<LoadStoreOp>(LoadStoreOpType::Set, out, in);
      hic->addInput(in);
      hic->addInput(out);
      hic->pushBackTopLevelExprs(set);
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0);
      auto in_aten = at::randn(sizes, options);
      auto out_aten = at::empty(sizes, options);
    
      hie.runWithInput({{in, in_aten}, {out, out_aten}});
    
      EXPECT_TRUE(out_aten.equal(in_aten))
          << "Obtained output: " << out_aten << "\n"
          << "Expected output: " << in_aten;
    }
    
    class HirBinaryOpTest : public NVFuserFixtureParamTest<BinaryOpType> {
     protected:
      at::Tensor executeBinaryOp(at::Tensor lhs, at::Tensor rhs) {
        switch (GetParam()) {
          case BinaryOpType::Add:
            return lhs + rhs;
          case BinaryOpType::Sub:
            return lhs - rhs;
          case BinaryOpType::Mul:
            return lhs * rhs;
          case BinaryOpType::Div:
            return lhs / rhs;
          default:
            NVF_ERROR("Unsupported binary op type ", GetParam());
            return at::Tensor();
        }
      }
    };
    
    TEST_P(HirBinaryOpTest, PreAllocatedOutputs) {
      const std::vector<int64_t> sizes = {8, 64};
      const auto& binary_op_type = GetParam();
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      auto* lhs = makeConcreteTensor(sizes);
      auto* rhs = makeConcreteTensor(sizes);
      auto* out = makeConcreteTensor(sizes);
      auto* binary_op = IrBuilder::create<BinaryOp>(binary_op_type, out, lhs, rhs);
      hic->addInput(lhs);
      hic->addInput(rhs);
      hic->addInput(out);
      hic->pushBackTopLevelExprs(binary_op);
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0);
      auto lhs_aten = at::randn(sizes, options);
      auto rhs_aten = at::randn(sizes, options);
      auto out_aten = at::empty(sizes, options);
    
      hie.runWithInput({{lhs, lhs_aten}, {rhs, rhs_aten}, {out, out_aten}});
    
      at::Tensor expected_out = executeBinaryOp(lhs_aten, rhs_aten);
      EXPECT_TRUE(expected_out.equal(out_aten))
          << "Obtained output: " << out_aten << "\n"
          << "Expected output: " << expected_out;
    }
    
    TEST_P(HirBinaryOpTest, NonPreAllocatedOutputs) {
      const std::vector<int64_t> sizes = {8, 64};
      const auto& binary_op_type = GetParam();
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      auto* lhs = makeConcreteTensor(sizes);
      auto* rhs = makeConcreteTensor(sizes);
      auto* out = binaryOp(binary_op_type, lhs, rhs);
      hic->addInput(lhs);
      hic->addInput(rhs);
      hic->addOutput(out);
      hic->pushBackTopLevelExprs(out->definition());
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0);
      auto lhs_aten = at::randn(sizes, options);
      auto rhs_aten = at::randn(sizes, options);
    
      auto out_aten =
          hie.runWithInput({{lhs, lhs_aten}, {rhs, rhs_aten}})[0].as<at::Tensor>();
    
      at::Tensor expected_out = executeBinaryOp(lhs_aten, rhs_aten);
      EXPECT_TRUE(expected_out.equal(out_aten))
          << "Obtained output: " << out_aten << "\n"
          << "Expected output: " << expected_out;
    }
    
    INSTANTIATE_TEST_SUITE_P(
        ,
        HirBinaryOpTest,
        testing::Values(
            BinaryOpType::Add,
            BinaryOpType::Sub,
            BinaryOpType::Mul,
            BinaryOpType::Div),
        [](const testing::TestParamInfo<BinaryOpType>& info) -> std::string {
          std::stringstream ss;
          ss << "BinaryOpType_" << info.param;
          return ss.str();
        });
    
    using HirReductionOpTest = NVFuserTest;
    
    TEST_F(HirReductionOpTest, PreAllocatedOutputs) {
      constexpr int64_t size0 = 8, size1 = 64;
      constexpr int64_t reduction_axis = 1;
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      auto* in = makeConcreteTensor({size0, size1});
      auto* out = newForReduction(in, {reduction_axis}, in->dtype());
      auto* reduction_op = IrBuilder::create<ReductionOp>(
          BinaryOpType::Add, hic->zeroVal(), out, in);
      hic->addInput(in);
      hic->addOutput(out);
      hic->pushBackTopLevelExprs(reduction_op);
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0);
      auto in_aten = at::randn({size0, size1}, options);
      auto out_aten = at::empty({size0}, options);
    
      hie.runWithInput({{in, in_aten}, {out, out_aten}});
    
      at::Tensor expected_out = in_aten.sum(reduction_axis);
      EXPECT_TRUE(expected_out.equal(out_aten))
          << "Obtained output: " << out_aten << "\n"
          << "Expected output: " << expected_out;
    }
    
    TEST_F(HirReductionOpTest, NonPreAllocatedOutputs) {
      constexpr int64_t size0 = 8, size1 = 64;
      constexpr int64_t reduction_axis = 1;
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      auto* in = makeConcreteTensor({size0, size1});
      auto* out = sum(in, {reduction_axis});
      hic->addInput(in);
      hic->addOutput(out);
      hic->pushBackTopLevelExprs(out->definition());
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0);
      auto in_aten = at::randn({size0, size1}, options);
      auto out_aten = at::empty({size0}, options);
    
      hie.runWithInput({{in, in_aten}, {out, out_aten}});
    
      at::Tensor expected_out = in_aten.sum(reduction_axis);
      EXPECT_TRUE(expected_out.equal(out_aten))
          << "Obtained output: " << out_aten << "\n"
          << "Expected output: " << expected_out;
    }
    
    } // namespace hir
    Error Handling

    The handle(BinaryOp* binary_op) function returns early if the output is not known, which might lead to unhandled cases. Ensure that all possible scenarios are covered and appropriate error messages are provided.

    void HostIrEvaluator::handle(BinaryOp* binary_op) {
      if (!isKnown(binary_op->outputs().at(0))) {
        return unhandled(binary_op);
      }
    
      auto lhs = getKnownConcreteData(binary_op->inputs().at(0)).as<at::Tensor>();
      auto rhs = getKnownConcreteData(binary_op->inputs().at(1)).as<at::Tensor>();
      auto output =
          getKnownConcreteData(binary_op->outputs().at(0)).as<at::Tensor>();
    
      switch (binary_op->getBinaryOpType()) {

    @samnordmann samnordmann force-pushed the host_irs/LoadStore_Reduction_binaryOp_support branch from 588e130 to 10daa92 Compare March 26, 2025 13:05
    @samnordmann
    Copy link
    Collaborator Author

    !test

    permutation.has_value(),
    "The logical domain of a Set.Permute is supposed to be a permutation of the root domain: ",
    out_tv->toString());
    in_tensor = in_tensor.permute(*permutation).contiguous();
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    note that the .contiguous() is necessary here, and I think this is an unexposed bug in LoadStoreOp::evaluate() -- however fixing it there incidentally causes another test failure.

    The bug was not exposed because we never "host evaluate" a set.Permute op before this PR

    @@ -457,135 +457,4 @@ INSTANTIATE_TEST_SUITE_P(
    testing::Values(0, 1),
    testing::Values(true)));

    // Different scheduling modes used in
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This test is not relevant anymore since we don't use generated kernels for now. So we'll add it back in times if we think this is useful. But in the meantime it is just technical debt

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant