Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fast CIOS for Montgomery modular multiplication #869

Open
chfast opened this issue Apr 17, 2024 · 4 comments
Open

Implement Fast CIOS for Montgomery modular multiplication #869

chfast opened this issue Apr 17, 2024 · 4 comments
Labels

Comments

@chfast
Copy link
Member

chfast commented Apr 17, 2024

EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication

Here is an example project that has applied the algorithm from the paper above: privacy-scaling-explorations/halo2curves#134.

@clementjuventin
Copy link

Hey @chfast, thanks for redirecting me from #777.

Since I can't open a PR, let me present my work here.

Here is the mul method from the evmmax.hpp file implementing the improvement of the CIOS variant of the Montgomery algorithm proposed in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication

    /// Performs a Montgomery modular multiplication.
    ///
    /// Inputs must be in Montgomery form: x = aR, y = bR.
    /// This computes Montgomery multiplication xyR⁻¹ % mod what gives aRbRR⁻¹ % mod = abR % mod.
    /// The result (abR) is in Montgomery form.
    constexpr UintT mul(const UintT& x, const UintT& y) const noexcept
    {
        // Coarsely Integrated Operand Scanning (CIOS) Method
        // Based on 2.3.2 from
        // High-Speed Algorithms & Architectures For Number-Theoretic Cryptosystems
        // https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
        // and on 2.2 from
        // EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
        // https://eprint.iacr.org/2022/1400.pdf

        constexpr auto S = UintT::num_words;  // TODO(C++23): Make it static

        intx::uint<UintT::num_bits + 64> t;
        if (mod[S - 1] < std::numeric_limits<uint64_t>::max() / 2)
        {
            for (size_t i = 0; i != S; ++i)
            {
                uint64_t c = 0;
                for (size_t j = 0; j != S; ++j)
                    std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
                auto const c_2 = c;
                const auto m = t[0] * m_mod_inv;
                std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
                for (size_t j = 1; j != S; ++j)
                    std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
                t[S - 1] = c_2 + c;
            }
        }
        else
        {
            for (size_t i = 0; i != S; ++i)
            {
                uint64_t c = 0;
                for (size_t j = 0; j != S; ++j)
                    std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
                auto tmp = intx::addc(t[S], c);
                t[S] = tmp.value;
                const auto d = tmp.carry;  // TODO: Carry is 0 for sparse modulus.

                const auto m = t[0] * m_mod_inv;
                std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
                for (size_t j = 1; j != S; ++j)
                    std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
                tmp = intx::addc(t[S], c);
                t[S - 1] = tmp.value;
                t[S] = d + tmp.carry;  // TODO: Carry is 0 for sparse modulus.
            }
        }

        if (t >= mod)
            t -= mod;

        return static_cast<UintT>(t);
    }

Here are the benchmarks performed after applying these changes starting from commit 01eca77.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_add<uint256, bn254>                           +0.0247         +0.0245             3             3             3             3
evmmax_add<uint256, bn254>                           +0.0402         +0.0400             3             4             3             4
evmmax_add<uint256, bn254>                           -0.0072         -0.0073             4             4             4             4
evmmax_add<uint256, bn254>                           +0.0150         +0.0149             4             4             4             4
evmmax_add<uint256, bn254>                           +0.0491         +0.0489             3             4             3             4
evmmax_add<uint256, bn254>                           +0.0137         +0.0136             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0018         -0.0021             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0073         -0.0074             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0039         -0.0040             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0287         -0.0290             3             3             3             3
evmmax_add<uint256, bn254>_pvalue                     0.6776          0.7337      U Test, Repetitions: 10 vs 10
evmmax_add<uint256, bn254>_mean                      +0.0092         +0.0091             3             3             3             3
evmmax_add<uint256, bn254>_median                    +0.0181         +0.0180             3             3             3             3
evmmax_add<uint256, bn254>_stddev                    +0.2974         +0.2984             0             0             0             0
evmmax_add<uint256, bn254>_cv                        +0.2855         +0.2868             0             0             0             0
evmmax_add<uint256, secp256k1>                       +0.0261         +0.0260             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0089         -0.0090             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0105         -0.0107             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0223         -0.0224             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0231         -0.0233             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0136         -0.0138             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0423         +0.0421             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0394         +0.0393             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0058         +0.0057             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0089         +0.0087             3             3             3             3
evmmax_add<uint256, secp256k1>_pvalue                 0.6776          0.6776      U Test, Repetitions: 10 vs 10
evmmax_add<uint256, secp256k1>_mean                  +0.0041         +0.0039             3             3             3             3
evmmax_add<uint256, secp256k1>_median                +0.0013         +0.0012             3             3             3             3
evmmax_add<uint256, secp256k1>_stddev                -0.2987         -0.2986             0             0             0             0
evmmax_add<uint256, secp256k1>_cv                    -0.3015         -0.3013             0             0             0             0
evmmax_sub<uint256, bn254>                           +0.0036         +0.0034             2             2             2             2
evmmax_sub<uint256, bn254>                           -0.0068         -0.0069             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0163         +0.0162             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0066         +0.0064             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0417         +0.0415             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0421         +0.0419             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0298         +0.0298             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0404         +0.0402             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0800         +0.0798             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0197         +0.0197             2             2             2             2
evmmax_sub<uint256, bn254>_pvalue                     0.0376          0.0452      U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, bn254>_mean                      +0.0274         +0.0273             2             2             2             2
evmmax_sub<uint256, bn254>_median                    +0.0273         +0.0272             2             2             2             2
evmmax_sub<uint256, bn254>_stddev                    +0.9074         +0.9082             0             0             0             0
evmmax_sub<uint256, bn254>_cv                        +0.8565         +0.8576             0             0             0             0
evmmax_sub<uint256, secp256k1>                       +0.0005         +0.0003             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0062         -0.0062             2             2             2             2
evmmax_sub<uint256, secp256k1>                       +0.0020         +0.0018             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0151         -0.0152             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0151         -0.0151             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0508         -0.0509             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0428         -0.0430             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0583         -0.0584             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0616         -0.0617             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0423         -0.0425             2             2             2             2
evmmax_sub<uint256, secp256k1>_pvalue                 0.0091          0.0073      U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, secp256k1>_mean                  -0.0295         -0.0296             2             2             2             2
evmmax_sub<uint256, secp256k1>_median                -0.0305         -0.0306             2             2             2             2
evmmax_sub<uint256, secp256k1>_stddev                -0.7974         -0.7983             0             0             0             0
evmmax_sub<uint256, secp256k1>_cv                    -0.7913         -0.7922             0             0             0             0
evmmax_mul<uint256, bn254>                           +7.9425         +7.9408            27           243            27           243
evmmax_mul<uint256, bn254>                           +8.0145         +8.0139            27           241            27           241
evmmax_mul<uint256, bn254>                           +7.7713         +7.7701            27           236            27           236
evmmax_mul<uint256, bn254>                           +7.8369         +7.8361            27           237            27           237
evmmax_mul<uint256, bn254>                           +7.8531         +7.8522            27           235            27           235
evmmax_mul<uint256, bn254>                           +7.8605         +7.8592            27           237            27           237
evmmax_mul<uint256, bn254>                           +7.7501         +7.7493            27           235            27           235
evmmax_mul<uint256, bn254>                           +7.9913         +7.9902            27           242            27           242
evmmax_mul<uint256, bn254>                           +7.8688         +7.8676            27           238            27           238
evmmax_mul<uint256, bn254>                           +7.9943         +7.9935            27           240            27           240
evmmax_mul<uint256, bn254>_pvalue                     0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean                      +7.8884         +7.8873            27           238            27           238
evmmax_mul<uint256, bn254>_median                    +7.8595         +7.8584            27           237            27           237
evmmax_mul<uint256, bn254>_stddev                   +17.5405        +17.5057             0             3             0             3
evmmax_mul<uint256, bn254>_cv                        +1.0859         +1.0823             0             0             0             0
evmmax_mul<uint256, secp256k1>                       +7.5847         +7.5839            30           256            30           256
evmmax_mul<uint256, secp256k1>                       +7.5657         +7.5643            30           253            30           253
evmmax_mul<uint256, secp256k1>                       +8.0037         +8.0028            28           256            28           256
evmmax_mul<uint256, secp256k1>                       +8.1311         +8.1302            28           259            28           259
evmmax_mul<uint256, secp256k1>                       +8.0206         +8.0189            28           255            28           255
evmmax_mul<uint256, secp256k1>                       +7.6288         +7.6287            28           246            28           246
evmmax_mul<uint256, secp256k1>                       +7.5113         +7.5106            28           242            28           242
evmmax_mul<uint256, secp256k1>                       +7.4145         +7.4133            29           243            29           243
evmmax_mul<uint256, secp256k1>                       +7.5676         +7.5671            28           243            28           243
evmmax_mul<uint256, secp256k1>                       +7.2458         +7.2454            29           243            29           243
evmmax_mul<uint256, secp256k1>_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean                  +7.6644         +7.6636            29           249            29           249
evmmax_mul<uint256, secp256k1>_median                +7.7730         +7.7721            28           249            28           249
evmmax_mul<uint256, secp256k1>_stddev               +10.6559        +10.6522             1             7             1             7
evmmax_mul<uint256, secp256k1>_cv                    +0.3453         +0.3450             0             0             0             0
OVERALL_GEOMEAN                                      +1.0661         +1.0658             0             0             0             0

As you can see, the results do not seem to indicate a performance improvement. Could you review the proposed code and make suggestions? If there are no suggestions, could you also run the benchmarks?

@chfast
Copy link
Member Author

chfast commented Sep 16, 2024

Since I can't open a PR, let me present my work here.

Why can't you?

@clementjuventin
Copy link

Since I can't open a PR, let me present my work here.

Why can't you?

My bad; it's my first open source contribution, I am discovering the processes 😢

@clementjuventin
Copy link

Hey @chfast, may I ask you to review the associated PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants