Import fast crc32 from Stephan Brumme #327

lsylsy2 · 2024-06-16T14:42:02Z

Description

After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage.
Replacing it with an open source faster implementation can make significant improvement in performance.

Performance Test

Setup

UDPspeeder client: running on test machine A is a 1Core 1GiB VM running Debian 12 on my Proxmox VE NAS, which is mostly idling during the test, and have a Ryzen 3 4350G CPU.
UDPspeeder server and iperf3 server: running on server B is running Debian 12 in a VM running on my Windows Hyper-V PC.
ipserf3 client: running on my PC directly which also hosts server B.

Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.

Script used

Simulating delay and loss

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms loss 0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.133 flowid 1:3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.132 flowid 1:3

UDPspeeder command lines

./speederv2_amd64 -s -l0.0.0.0:5301 -r127.0.0.1:5201
time ./speederv2_branch_libev -c -l0.0.0.0:5201 -r192.168.1.133:5301
time ./speederv2_crc -c -l0.0.0.0:5201 -r192.168.1.133:5301

iperf command lines

.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 --length 1400
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 -R --length 1400

Test results

"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.

		branch_libev	crc32 PR	branch_libev	crc32 PR
ping=delay*2	Comments	Send		Receive(-R)
delay0/loss0 100M/length1400	Ideal LAN or same city	real 1m4.002s user 0m8.940s sys 0m7.169s	real 1m12.985s user 0m5.239s sys 0m6.836s	real 1m4.562s user 0m5.966s sys 0m4.461s	real 1m10.889s user 0m2.646s sys 0m4.357s
delay10/loss1 100M/length1400	ping=20 with 1% loss A pretty good China-HK/KR/JP network	real 1m7.122s user 0m9.165s sys 0m4.124s	real 1m10.473s user 0m5.827s sys 0m3.216s	real 1m2.657s user 0m5.850s sys 0m4.625s	real 1m2.515s user 0m2.633s sys 0m4.435s
delay10/loss1 50M/length200	Testing small packets	real 1m14.906s user 0m7.325s sys 0m3.939s	real 1m3.129s user 0m5.751s sys 0m3.155s	real 1m6.626s user 0m3.138s sys 0m9.732s	real 1m2.153s user 0m2.458s sys 0m9.062s
delay30/loss5 100M/length1400	ping=60 with 5% loss A less ideal network within Asia	real 1m5.962s user 0m9.277s sys 0m7.061s	real 1m3.250s user 0m5.494s sys 0m4.875s	real 1m7.642s user 0m6.329s sys 0m4.158s	real 1m3.010s user 0m2.457s sys 0m4.704s
delay30/loss5 50M/length200	Testing small packets	real 1m33.371s user 0m7.033s sys 0m5.237s	real 1m7.538s user 0m5.097s sys 0m3.929s	real 1m3.274s user 0m2.006s sys 0m11.125s	real 1m3.346s user 0m1.585s sys 0m9.966s
delay80/loss10 50M/length1400	ping=160 with 10% loss Pretty bad network across the Pacific Usually aiming low cost Web browsing	real 1m5.898s user 0m4.867s sys 0m2.093s	real 1m4.362s user 0m3.054s sys 0m1.760s	real 1m9.226s user 0m3.642s sys 0m1.938s	real 1m4.579s user 0m1.189s sys 0m2.671s

$%9GCR@R$7_U~8NSG{NZ670X$

BIG ENDIAN Validation

TODO

Flame Graph

TODO

lsylsy2 · 2024-06-16T15:57:35Z

@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance?

wangyu- · 2024-06-17T04:55:51Z

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

wangyu- · 2024-06-17T04:59:05Z

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

wangyu- · 2024-06-17T05:02:59Z

the best way is probably flame graph:

If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR.

wangyu- · 2024-06-17T05:07:14Z

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

lsylsy2 · 2024-06-17T05:54:35Z

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.
BTW, optimizing the XOR encryption can also improve the performance by 3~10% by utilizing 64bit operations, but it's written by me and have not been tested on multiple platforms (and also need to modify to support 32bit systems, etc.), so I'll not submit it very soon.
branch_libev...lsylsy2:UDPspeeder:2406_optimization

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. stbrumme/crc32#8
Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

lsylsy2 · 2024-06-17T05:59:08Z

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

wangyu- · 2024-06-17T06:07:52Z

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

wangyu- · 2024-06-17T06:09:41Z

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

lsylsy2 · 2024-06-17T06:10:14Z

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role.
I'll try to make more tests and send the result later.

lsylsy2 · 2024-06-17T06:15:18Z

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

Server: Oracle ARM VPS in Osaka
Client (running perf and generating this graph): a single-core virtual machine on a mostly idling AMD Ryzen 3 pro 4350g, which single thread performance should be similar to Ryzen 5 3600 or i3-10100.
The test was ran that sending and receiving was ran for both 60 seconds, each with 100Mb(it)ps throughput, however I forget if I set the packet size to 200/400 or the default ~1400.

wangyu- · 2024-06-17T06:23:00Z

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Forgot to say. tc and netem is acutally easier on simulate packet loss.

Here is some example code piece:

DEV=ens5

# turn driver optimizations off
sudo ethtool -K $DEV gro off
sudo ethtool -K $DEV tso off
sudo ethtool -K $DEV gso off

sudo tc qdisc del dev $DEV root
sudo tc qdisc add dev $DEV root netem loss 5.5%

(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo)

lsylsy2 · 2024-06-22T09:48:46Z

Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64).
Later adding flame graph comparison and BIG ENDIAN validation

tofurky · 2024-10-10T18:31:52Z

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
         tunnel_client.cpp
         tunnel_server.cpp
         my_ev.cpp
+       crc32/Crc32.cpp
 )
 set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead.

lsylsy2 · 2024-10-12T04:38:24Z

Thank you for the response and improvement, I was busy on private work and not done the BIG ENDIAN test. Let me see if I can try and include the other improvements tofurky ***@***.***> 于2024年10月11日周五 02:32写道：

…

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test. note that there is a small change needed to fix compilation with cmake : diff --git a/CMakeLists.txt b/CMakeLists.txt index d6b11ef..ca34fe0 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,7 @@ set(SOURCE_FILES tunnel_client.cpp tunnel_server.cpp my_ev.cpp + crc32/Crc32.cpp ) set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined") — Reply to this email directly, view it on GitHub <https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUTWIR2HGVOFANR2PHEITZ23B23AVCNFSM6AAAAABJMU6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVG44DIOJZGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Import fast crc32 from Stephan Brumme

dfc6ac0

lsylsy2 marked this pull request as ready for review June 16, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import fast crc32 from Stephan Brumme #327

Import fast crc32 from Stephan Brumme #327

lsylsy2 commented Jun 16, 2024 •

edited

Loading

lsylsy2 commented Jun 16, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024 •

edited

Loading

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 22, 2024

tofurky commented Oct 10, 2024 •

edited

Loading

lsylsy2 commented Oct 12, 2024 via email

Import fast crc32 from Stephan Brumme #327

Are you sure you want to change the base?

Import fast crc32 from Stephan Brumme #327

Conversation

lsylsy2 commented Jun 16, 2024 • edited Loading

Description

Performance Test

Setup

Script used

Simulating delay and loss

UDPspeeder command lines

iperf command lines

Test results

BIG ENDIAN Validation

Flame Graph

lsylsy2 commented Jun 16, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024 • edited Loading

wangyu- commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

lsylsy2 commented Jun 17, 2024

wangyu- commented Jun 17, 2024

lsylsy2 commented Jun 22, 2024

tofurky commented Oct 10, 2024 • edited Loading

lsylsy2 commented Oct 12, 2024 via email

lsylsy2 commented Jun 16, 2024 •

edited

Loading

lsylsy2 commented Jun 17, 2024 •

edited

Loading

tofurky commented Oct 10, 2024 •

edited

Loading