Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RLP serialization #765

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open

Optimize RLP serialization #765

wants to merge 36 commits into from

Conversation

chirag-parmar
Copy link
Contributor

@chirag-parmar chirag-parmar commented Dec 11, 2024

Description

This PR improves upon the memory and time complexity of the RLP serialisation algorithm(RlpWriter). The existing implementation of RLP uses the RlpWriter that extends a buffer and moves memory whenever required. This wastes a lot of time for large payloads. This PR, firstly, implements a RlpLengthTracker that dry runs the serialisation to calculate the full length of the serialised data and the prefixes of lists. Secondly, the RlpTwoPassWriter uses the pre-calculated values to assign memory in one shot and avoid moving memory to make space for prefixes.

Additionally, the PR also implements a RlpHashWriter that directly updates the sponge construction of keccak256 hash to avoid allocating any memory (few exceptions which are negligible).

Rationale

Naive implementations of RLP serialisation require moving memory because prefixes of lists can only be calculated after serialising the items within the list. Since complex types such as objects are represented as lists the serialisation of any data structure in Ethereum requires moving the encoded/serialised list items to make space for the prefix. In cases where the encoded buffer is hashed, the encoded list items cannot be absorbed until the prefix is calculated.

However, the encoded lengths of base types can be calculated without actually serialising them (or allocating memory). This is exploited to split the computation into two. First to calculate lengths of the base types, and then to actually serialise the data.

By calculating the lengths of the base types we can find the length of encoded list items and thereby also the length of the prefix (without actually serialising the prefix). Additionally, we also derive the total length of the serialised data from this. This logic is implemented as the RlpLengthTracker which records listLengths, prefixLengths and the totalLength.

In the second part of the computation, we actually serialise the data (and the prefixes) without skipping over memory or moving memory positions. This is implemented as the RlpTwoPassWriter, "two-pass" because we iterate through the data twice. Because the data is iterated through twice the RlpTwoPassWriter can only be used within the encode function. In cases where users manually call append, the RlpTwoPaasWriter would require users to manually run the RlpLengthTracker for the first pass and then the second pass.

RlpHashWriter also uses two passes but for the second pass it absorbs the data directly into the sponge construction of keccak instead of allocating memory thereby saving memory (amortized constant usage). And for similar reasons as stated above, a new api method, encodeHash is exposed.

Changelog

  • introduces a new method encodeHash for constant memory RLP encoding and hashing.
  • updates the encode method to serialise data in "two passes".
  • isolates all writer implementations into individual files with writer.nim handling only the dependency injection architecture of the append method
  • introduces two new tests, test_hashes.nim and test_rlp_profiler.nim. The first, does a sanity check of the hash writer. The second one, time profiles the new writer implementation to the default implementation.

Benchmarks

using --mm=refc (refc GC)

CPU Time [Transaction serialization using two pass writer] 0.000001402s
CPU Time [Block Sequence serialization using two pass writer] 0.000002592s
CPU Time [Block serialization using two pass writer] 0.000007493s
.CPU Time [Transaction serialization using default writer] 0.000001559s
CPU Time [Block Sequence serailization using default writer] 0.000003281s
CPU Time [Block serialization using default writer] 0.000009589s
.CPU Time [Transaction hashing using hash writer] 0.000001810s
CPU Time [Block Sequence hashing using hash writer] 0.000006301s
CPU Time [Block hashing using hash writer] 0.000028100s
.CPU Time [Transaction hashing using default writer and then hash] 0.000002291s
CPU Time [Block Sequence hashing using default writer and then hash] 0.000007532s
CPU Time [Block hashing using defualt writer and the hash] 0.000007491s

using --mm=orc (orc GC)

CPU Time [Transaction serialization using two pass writer] 0.000000901s
CPU Time [Block Sequence serialization using two pass writer] 0.000002089s
CPU Time [Block serialization using two pass writer] 0.000005357s
.CPU Time [Transaction serialization using default writer] 0.000000973s
CPU Time [Block Sequence serailization using default writer] 0.000002513s
CPU Time [Block serialization using default writer] 0.000006726s
.CPU Time [Transaction hashing using hash writer] 0.000001609s
CPU Time [Block Sequence hashing using hash writer] 0.000006161s
CPU Time [Block hashing using hash writer] 0.000026231s
.CPU Time [Transaction hashing using default writer and then hash] 0.000001879s
CPU Time [Block Sequence hashing using default writer and then hash] 0.000006502s
CPU Time [Block hashing using defualt writer and the hash] 0.000006492s

@chirag-parmar chirag-parmar marked this pull request as ready for review December 12, 2024 17:04
@chirag-parmar chirag-parmar marked this pull request as draft December 12, 2024 17:06
@chirag-parmar chirag-parmar marked this pull request as ready for review December 15, 2024 10:53
@chirag-parmar chirag-parmar requested a review from jangko December 15, 2024 10:53
eth/rlp/default_writer.nim Show resolved Hide resolved
eth/rlp/hash_writer.nim Outdated Show resolved Hide resolved
eth/rlp/hash_writer.nim Outdated Show resolved Hide resolved
eth/rlp/hash_writer.nim Outdated Show resolved Hide resolved
eth/rlp/hash_writer.nim Show resolved Hide resolved
eth/rlp/two_pass_writer.nim Outdated Show resolved Hide resolved
eth/rlp/writer.nim Outdated Show resolved Hide resolved
eth/rlp/writer.nim Outdated Show resolved Hide resolved
eth/rlp/hash_writer.nim Outdated Show resolved Hide resolved
eth/rlp/two_pass_writer.nim Outdated Show resolved Hide resolved
@chirag-parmar chirag-parmar linked an issue Jan 7, 2025 that may be closed by this pull request
when nestedListsDepth > 0:
var tracker = StaticRlpLengthTracker[nestedListsDepth]()
elif nestedListsDepth == 0:
var tracker = DynamicRlpLengthTracker()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of types need DynamicLengthTracker? a string/int? Can we use StaticRlpLengthTracker[1]()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types that are recursive and only fully defined during runtime. By a full definition I mean, these types assume the full structure of the object they hold only at runtime. At compile time they are empty. For example, JsonNode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@jangko
Copy link
Contributor

jangko commented Jan 7, 2025

I'm curious why block hashing using hash writer is slower than default writer + hash. And can you add block header hashing measurement too? Because in nimbus-eth1 we only use transaction hashing and block header hashing only. Almost no need to hash a full block.

@chirag-parmar
Copy link
Contributor Author

chirag-parmar commented Jan 7, 2025

I'm curious why block hashing using hash writer is slower than default writer + hash.

I have been wondering the same thing for almost two weeks now. I made a lot of measurements using the time profiler on MacOS Instruments but couldn't corner it down. Then yesterday, I switched the tests, i.e. I ran the default writer + hash benchmark first and then the hash writer. This time hash writer was faster in the same order of magnitude. I think it has something to do with caching or garbage collection, I'm not sure though but I'm trying to corner this down.

what is weird is that, the benchmark is calculated by collecting 100 measurements and then averaging them out. So caching shouldn't be an issue. Additionally, the encoding benchmarks before hashing benchmarks use the same data which means it should already be cached

And can you add block header hashing measurement too? Because in nimbus-eth1 we only use transaction hashing and block header hashing only. Almost no need to hash a full block.

Yes, will add this

@jangko
Copy link
Contributor

jangko commented Jan 7, 2025

Nim generic instantiation cache(in the compiler) is known to have a bug. And by looking into the bug tracker, I believe it still there.
https://github.com/nim-lang/Nim/issues?q=is%3Aissue+is%3Aopen+generic+cache
This is only a hunch. The exact problem is yet to be found.

@chirag-parmar
Copy link
Contributor Author

refc GC

CPU Time [Transaction serialization using two pass writer] 0.000001261s
CPU Time [Block Sequence serialization using two pass writer] 0.000002928s
CPU Time [Block serialization using two pass writer] 0.000006690s
CPU Time [Block header serialization using two pass writer] 0.000000858s

CPU Time [Transaction serialization using default writer] 0.000001452s
CPU Time [Block Sequence serailization using default writer] 0.000003362s
CPU Time [Block serialization using default writer] 0.000009260s
CPU Time [Block header serialization using default writer] 0.000001547s

CPU Time [Transaction hashing using hash writer] 0.000001419s
CPU Time [Block Sequence hashing using hash writer] 0.000006039s
CPU Time [Block hashing using hash writer] 0.000027683s
CPU Time [Block header hashing using hash writer] 0.000004401s

CPU Time [Transaction hashing using default writer and then hash] 0.000002267s
CPU Time [Block Sequence hashing using default writer and then hash] 0.000007470s
CPU Time [Block hashing using defualt writer and the hash] 0.000007329s
CPU Time [Block header hashing using hash writer] 0.000005271s

orc GC

CPU Time [Transaction serialization using two pass writer] 0.000000780s
CPU Time [Block Sequence serialization using two pass writer] 0.000001807s
CPU Time [Block serialization using two pass writer] 0.000005291s
CPU Time [Block header serialization using two pass writer] 0.000000830s

CPU Time [Transaction serialization using default writer] 0.000000930s
CPU Time [Block Sequence serailization using default writer] 0.000002420s
CPU Time [Block serialization using default writer] 0.000006576s
CPU Time [Block header serialization using default writer] 0.000001354s

CPU Time [Transaction hashing using hash writer] 0.000001478s
CPU Time [Block Sequence hashing using hash writer] 0.000005939s
CPU Time [Block hashing using hash writer] 0.000026274s
CPU Time [Block header hashing using hash writer] 0.000004191s

CPU Time [Transaction hashing using default writer and then hash] 0.000001850s
CPU Time [Block Sequence hashing using default writer and then hash] 0.000006511s
CPU Time [Block hashing using defualt writer and the hash] 0.000006459s
CPU Time [Block header hashing using hash writer] 0.000004640s

@chirag-parmar
Copy link
Contributor Author

chirag-parmar commented Jan 13, 2025

So the benchmarks collected before are biased because of the CPU cache (I verified this). So I collected new benchmarks and ran each benchmark as a different process to avoid cache hits causing skewed/biased data. The code for the benchmarks is attached after the benchmarks.

EDIT: The collected benchmarks for rlp hashing a block with varying number of transactions

Mem Usage
CPU Time


import 
  eth/[rlp, common],
  times,
  std/[os, strutils],
  stew/io2,
  results

type
  Timeval {.importc: "timeval", header:"<sys/time.h>", bycopy.} = object
  
  Rusage* {.importc: "struct rusage", header:"<sys/resource.h>", bycopy.} = object
    ru_utime {.importc.}: Timeval
    ru_stime {.importc.}: Timeval
    ru_maxrss* {.importc.}: int32  # Maximum resident set size
    # ...
    ru_minflt* {.importc.}: int32  # page reclaims (soft page faults)
  
  RusageWho* {.size: sizeof(cint).} = enum
    RusageChildren = -1
    RusageSelf = 0
    RusageThread = 1

when defined(debug):
  var H_RUSAGE_SELF{.importc, header:"<sys/resource.h".}: cint
  var H_RUSAGE_CHILDREN{.importc, header:"<sys/resource.h".}: cint
  var H_RUSAGE_THREAD{.importc, header:"<sys/resource.h".}: cint
  assert H_RUSAGE_SELF == ord(RusageSelf)
  assert H_RUSAGE_CHILDREN = ord(RusageChildren)
  assert H_RUSAGE_THREAD = ord(RusageThread)

proc getrusage*(who: RusageWho, usage: var Rusage) {.importc, header: "sys/resource.h".}

const 
  accesses  = @[AccessPair(
    address: address"0x0000000000000000000000000000000000000001", 
    storageKeys: @[default(Bytes32)]
  )]

  tx = Transaction( 
    txType: TxLegacy, 
    chainId: 7.ChainId, 
    nonce: 11253.AccountNonce, 
    gasPrice: 9.GasInt, 
    maxPriorityFeePerGas: 0.GasInt, 
    maxFeePerGas: 0.GasInt, 
    gasLimit: 88920.GasInt, 
    payload: @[99, 0, 0, 0, 25, 96, 1, 1, 56, 3, 128, 99, 0, 0, 0, 25, 96, 1, 1, 96, 0, 57, 96, 0, 243, 91, 78, 30, 176, 85, 200, 234, 14, 45, 97, 73, 65, 149, 199, 11, 118, 202, 83, 30, 211, 109, 119, 168, 184, 89, 6, 38, 132, 53, 2, 237, 54, 131, 30, 141, 225, 155, 174, 92, 96, 211, 133, 53, 218, 245, 132, 17, 173, 79, 95, 241, 197, 214, 244, 196, 37, 88, 27, 34, 51, 69, 116, 64, 170, 77, 95, 191, 152, 7, 214, 85, 249, 244, 167, 67, 76, 137, 136, 37, 169, 40, 20, 131, 165, 153, 120, 158, 20, 26, 114, 99, 129, 254, 172, 229, 99, 18, 178, 251, 40, 126, 210, 155, 108, 238, 127, 2, 156, 67, 61, 199, 191, 71, 215, 72, 23, 173, 131, 213, 35, 87, 54, 248, 41, 221, 119, 31, 223, 144], 
    accessList: @[], 
    maxFeePerBlobGas: 0.u256, 
    versionedHashes: @[], 
    authorizationList: @[], 
    V: uint64(49), 
    R: 2812423844.u256, 
    S: 532553168.u256
  )

  header = Header(
    parentHash: Hash32.fromHex("0x4d24f4dddde8ab09ea900bc87492f74e44e1c3982042d0fd01673a98e72c14a6"), 
    ommersHash: Hash32.fromHex("0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347"), 
    coinbase: Address.fromHex("0x0000000000000000000000000000000000000000"), 
    stateRoot: Root.fromHex("0xa06c17a5ebaeacef3b760293e8f2092e775448c038aa7361350cde9f27e6b218"), 
    transactionsRoot: Root.fromHex("0x773faa6ccc82788a4392df241b7c6e7175fd77648fed2aaad583a528293a80bc"), 
    receiptsRoot: Root.fromHex("0x8ccc9b260110dad8161d01e72913cd27baab66809aa106a4c28a7c7a3a418b65"), 
    difficulty: 131072.u256, 
    number: 1024.BlockNumber, 
    gasLimit: 3141592.GasInt, 
    gasUsed: 891237.GasInt, 
    timestamp: EthTime(15748660), 
    extraData: @[], 
    mixHash: Bytes32.fromHex("0x378c37bd4d8692035f9d03a619d7c44a579d3b1c1c52db3096bc92a71b74aeb5"), 
    nonce: Bytes8.fromHex("0x4ea813c0c15b4215"), 
    baseFeePerGas: Opt.some(9.u256), 
  )

  blk = EthBlock(
    header: header,
    transactions: @[
      tx, tx, tx, tx,
      tx, tx, tx, tx
    ],
    uncles: @[],
  )
  
  blk80 = EthBlock(
    header: header, 
    transactions: @[
      tx, tx, tx, tx, 
      # total of 80 txs
      tx, tx, tx, tx], 
    uncles: @[], 
  )

  blk320 = EthBlock(
    header: header, 
    transactions: @[
      tx, tx, tx, tx, tx, tx, tx, tx, 
      #.. total of 320 txs
      tx, tx, tx, tx, tx, tx, tx, tx], 
    uncles: @[], 
  )
  
  blk640 = EthBlock(
    header: header, 
    transactions: @[
      tx, tx, tx, tx, tx, tx, tx, tx, 
      #... total of 640 txs
      tx, tx, tx, tx, tx, tx, tx, tx], 
    uncles: @[], 
  )
  
  blk1280 = EthBlock(
    header: header, 
    transactions: @[
      tx, tx, tx, tx, tx, tx, tx, tx, 
      #... total of 1280 txs
      tx, tx, tx, tx, tx, tx, tx, tx], 
    uncles: @[], 
  )

proc encodeOnePass[T](v: T): seq[byte] =
  var writer = initRlpWriter()

  writer.append(v)
  move(writer.finish)

proc encodeAndHash[T](v: T): Hash32 =
  keccak256(encodeOnePass(v))

# source: https://forum.nim-lang.org/t/7238
{.push checks: off.}
template benchmark(msg: string, code: untyped) =
  when not defined(windows):
    var ru: Rusage
    getrusage(RusageSelf, ru)
    var
      rss = ru.ru_maxrss
      flt = ru.ru_minflt

  when not defined(windows):
    let start = cpuTime()

  code

  when not defined(windows):
    let stop = cpuTime()

  when not defined(windows):
    getrusage(RusageSelf, ru)
    rss = ru.ru_maxrss - rss
    flt = ru.ru_minflt - flt

  echo "Benchmark: " & msg
  when not defined(windows):
    echo "Time(s)              ", stop - start
    echo "Runtime RSS (KB):     ", rss
    echo "# of page faults:     ", flt
{.pop.}

when defined(opt)
  when defined(blk):
    benchmark "encodeHash with 8 transactions":
      let bytes2 = rlp.encodeHash(blk)
  elif defined(blk80):
    benchmark "encodeHash with 80 transactions":
      let bytes2 = rlp.encodeHash(blk80)
  elif defined(blk320):
    benchmark "encodeHash with 320 transactions":
      let bytes2 = rlp.encodeHash(blk320)
  elif defined(blk640):
    benchmark "encodeHash with 640 transactions":
      let bytes2 = rlp.encodeHash(blk640)
  elif defined(blk1280):
    benchmark "encodeHash with 1280 transactions":
      let bytes2 = rlp.encodeHash(blk1280)
else:
  when defined(blk):
    benchmark "encodeHash with 8 transactions":
      let bytes2 = encodeAndHash(blk)
  elif defined(blk80):
    benchmark "encodeHash with 80 transactions":
      let bytes2 = encodeAndHash(blk80)
  elif defined(blk320):
    benchmark "encodeHash with 320 transactions":
      let bytes2 = encodeAndHash(blk320)
  elif defined(blk640):
    benchmark "encodeHash with 640 transactions":
      let bytes2 = encodeAndHash(blk640)
  elif defined(blk1280):
    benchmark "encodeHash with 1280 transactions":
      let bytes2 = encodeAndHash(blk1280)
nim compile --mm:orc -d:blk -d:release --out:bench_blk test.nim
nim compile --mm:orc -d:blk80 -d:release --out:bench_blk80 test.nim
nim compile --mm:orc -d:blk320 -d:release --out:bench_blk320 test.nim
nim compile --mm:orc -d:blk640 -d:release --out:bench_blk640 test.nim
nim compile --mm:orc -d:blk1280 -d:release --out:bench_blk1280 test.nim
nim compile --mm:orc -d:blk -d:opt -d:release --out:opt_bench_blk test.nim
nim compile --mm:orc -d:blk80 -d:opt -d:release --out:opt_bench_blk80 test.nim
nim compile --mm:orc -d:blk320 -d:opt -d:release --out:opt_bench_blk320 test.nim
nim compile --mm:orc -d:blk640 -d:opt -d:release --out:opt_bench_blk640 test.nim
nim compile --mm:orc -d:blk1280 -d:opt -d:release --out:opt_bench_blk1280 test.nim

@chirag-parmar
Copy link
Contributor Author

Also want to point out that the timing improvements for encoding blocks (not hashing) using the new implementation is way more significant than for hashing with new implementations. This is because the keccak implementation adds a huge bias. The next step to improving the rlp hash algo should be optimizing the keccak implementation before optimizing the writers again.

let bytes13 = encodeAndHash(myTx)
benchmark "Block Sequence hashing using default writer and then hash":
let bytes14 = encodeAndHash(blkSeq)
benchmark "Block hashing using defualt writer and the hash":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defualt -> default

let bytes14 = encodeAndHash(blkSeq)
benchmark "Block hashing using defualt writer and the hash":
let bytes15 = encodeAndHash(blkSeq)
benchmark "Block header hashing using hash writer":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block header hashing using hash writer -> Block header hashing using default writer and the hash

@tersec
Copy link
Contributor

tersec commented Jan 13, 2025

As a general note, refc performance is significantly more important to Nimbus and to IFT in general than ORC for the foreseeable future.

It's not to benchmark both, but the operative benchmarks in a short/medium-term way are refc ones.

@chirag-parmar
Copy link
Contributor Author

here are refc benchmarks:

type of benchmark Time taken by default implementation Time taken by new implementation
Tx Encoding (single) 17us 14us
Tx Hashing (single) 30us 40us
Header Encoding 22us 15us
Header Hashing 34us 36us
Block Encoding 45us 25us
Block Hashing 72us 63us

@chirag-parmar
Copy link
Contributor Author

chirag-parmar commented Jan 15, 2025

The graphs but with refc as the garbage collector

mem usage
cpu time

@tersec
Copy link
Contributor

tersec commented Jan 15, 2025

No copyright linter will enforce this in this repository, but still good to check all the modified files in the PR and update their copyright years to 2025.

@arnetheduck
Copy link
Member

something to note about benchmarking garbage-collected languages: the cost of allocation and deallocation is often not charged on the spot, but happens "randomly" - specially in a real-world application where you can't really micro-benchmark it in isolation (due to fragmentation etc).

Instead of benchmarking, the other way to reason about a change such as this is to measure/reason about the number of allocations that happen during an operation because this will ultimately determine how RLP contributes to the bigger picture in an application - the aim of the optimization is to reduce this number while not compromising on other things - for the ideal case where you're hashing an object (ie a call to rlpHash), all allocations can be done on the stack / precomputed - if this PR hits this target, it is already a success (dynamic usage of RLP is a separate topic) - the remaining unnecessary overhead will be found in zero:ing the (now) stack-allocated memory, but that's a separate concern as well.

Another way to look at it is that the objective of this PR can be reduced to "rlpHash(myobject) must not perform any heap memory alllocations" - once it fulfills this objective, it is done and we can merge / move on.

@chirag-parmar
Copy link
Contributor Author

for the ideal case where you're hashing an object (ie a call to rlpHash), all allocations can be done on the stack / precomputed - if this PR hits this target, it is already a success (dynamic usage of RLP is a separate topic) - the remaining unnecessary overhead will be found in zero:ing the (now) stack-allocated memory, but that's a separate concern as well.

Noted

Another way to look at it is that the objective of this PR can be reduced to "rlpHash(myobject) must not perform any heap memory alllocations" - once it fulfills this objective, it is done and we can merge / move on.

Makes sense, there are two edge cases where dynamic allocation is unavoidable. First, are lists(static or dynamic) within dynamic sized lists. In the RLP world this would include sequences/arrays within dynamic sized sequences and objects within dynamic sized sequences. An example is the transactions field within the block object.

The second is where the type of an object doesn't fully assume the shape of the object at compile time. Examples are recurring objects for binary trees and parsed data types like JSON.

Apart from this all other cases achieve static memory allocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement direct RLP hashing
4 participants