Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1632

Closed
MattPOlson opened this issue Apr 6, 2023 · 5 comments
Closed

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1632

MattPOlson opened this issue Apr 6, 2023 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@MattPOlson
Copy link

Describe the bug:

We run okd in a vSphere environment with the below configuration:

vSphere:
ESXi version: 7.0 U3e
Seperate vDS (on version 6.5) for Front End and iSCSI

Hardware:
UCS B200-M4 Blade
	BIOS - B200M4.4.1.2a.0.0202211902
	Xeon(R) CPU E5-2667
	2 x 20Gb Cisco UCS VIC 1340 network adapter for front end connectivity (Firmware 4.5(1a))
	2 x 20Gb Cisco UCS VIC 1340 network adapter for iSCSI connectivity (Firmware 4.5(1a))
	
Storage:
Compellent SC4020 over iSCSI
	2 controller array with dual iSCSI IP connectivity (2 paths per LUN)
All cluster nodes on same Datastore

After upgrading the cluster from a 4.10.x version to anything above 4.11.x pod to pod communication is severely degraded where the nodes that the pods run on are hosted on different esx hosts. We ran a benchmark test on the cluster before the upgrade with the below results:


Benchmark Results

Name : knb-2672
Date : 2023-03-29 15:26:01 UTC
Generator : knb
Version : 1.5.0
Server : k8s-se-internal-01-582st-worker-n2wtp
Client : k8s-se-internal-01-582st-worker-cv7cd
UDP Socket size : auto

Discovered CPU : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Discovered Kernel : 5.18.5-100.fc35.x86_64
Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
Discovered MTU : 1400
Idle :
bandwidth = 0 Mbit/s
client cpu = total 12.31% (user 9.41%, nice 0.00%, system 2.83%, iowait 0.07%, steal 0.00%)
server cpu = total 9.04% (user 6.28%, nice 0.00%, system 2.74%, iowait 0.02%, steal 0.00%)
client ram = 4440 MB
server ram = 3828 MB
Pod to pod :
TCP :
bandwidth = 6306 Mbit/s
client cpu = total 26.15% (user 5.19%, nice 0.00%, system 20.96%, iowait 0.00%, steal 0.00%)
server cpu = total 29.39% (user 8.13%, nice 0.00%, system 21.26%, iowait 0.00%, steal 0.00%)
client ram = 4460 MB
server ram = 3820 MB
UDP :
bandwidth = 1424 Mbit/s
client cpu = total 26.08% (user 7.21%, nice 0.00%, system 18.82%, iowait 0.05%, steal 0.00%)
server cpu = total 24.82% (user 6.72%, nice 0.00%, system 18.05%, iowait 0.05%, steal 0.00%)
client ram = 4444 MB
server ram = 3824 MB
Pod to Service :
TCP :
bandwidth = 6227 Mbit/s
client cpu = total 27.90% (user 5.12%, nice 0.00%, system 22.73%, iowait 0.05%, steal 0.00%)
server cpu = total 29.85% (user 5.86%, nice 0.00%, system 23.99%, iowait 0.00%, steal 0.00%)
client ram = 4439 MB
server ram = 3811 MB
UDP :
bandwidth = 1576 Mbit/s
client cpu = total 32.31% (user 6.41%, nice 0.00%, system 25.90%, iowait 0.00%, steal 0.00%)
server cpu = total 26.12% (user 5.68%, nice 0.00%, system 20.39%, iowait 0.05%, steal 0.00%)
client ram = 4449 MB
server ram = 3818 MB

After upgrading to version 4.11.0-0.okd-2023-01-14-152430 the latency between the pods is so high the benchmark test, qperf test, and iperf test all timeout and fail to run. This is the result of curling the network check pod across nodes, it takes close to 30 seconds.


sh-4.4# time curl http://10.129.2.44:8080
Hello, 10.128.2.2. You have reached 10.129.2.44 on k8s-se-internal-01-582st-worker-cv7cd
real    0m26.496s
We have been able to reproduce this issue consistently on multiple different clusters.

Version:

4.11.0-0.okd-2023-01-14-152430
IPI on vSphere

How reproducible:

Upgrade or install a 4.11.x or higher version of OKD and observe the latency.

@MattPOlson
Copy link
Author

We re-deployed the cluster with version 4.10.0-0.okd-2022-07-09-073606 on the same hardware and the issue went away. There is clearly an issue with 4.11 and above. Benchmark results are below:

=========================================================
 Benchmark Results
=========================================================
 Name            : knb-17886
 Date            : 2023-04-10 19:46:01 UTC
 Generator       : knb
 Version         : 1.5.0
 Server          : k8s-se-platform-01-t4fb6-worker-vw2d9
 Client          : k8s-se-platform-01-t4fb6-worker-jk2wm
 UDP Socket size : auto
=========================================================
  Discovered CPU         : Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz
  Discovered Kernel      : 5.18.5-100.fc35.x86_64
  Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
  Discovered MTU         : 1400
  Idle :
    bandwidth = 0 Mbit/s
    client cpu = total 4.06% (user 2.17%, nice 0.00%, system 1.82%, iowait 0.07%, steal 0.00%)
    server cpu = total 2.96% (user 1.48%, nice 0.00%, system 1.48%, iowait 0.00%, steal 0.00%)
    client ram = 925 MB
    server ram = 1198 MB
  Pod to pod :
    TCP :
      bandwidth = 8348 Mbit/s
      client cpu = total 26.07% (user 1.78%, nice 0.00%, system 24.27%, iowait 0.02%, steal 0.00%)
      server cpu = total 26.59% (user 1.94%, nice 0.00%, system 24.63%, iowait 0.02%, steal 0.00%)
      client ram = 930 MB
      server ram = 1196 MB
    UDP :
      bandwidth = 1666 Mbit/s
      client cpu = total 19.21% (user 2.14%, nice 0.00%, system 17.02%, iowait 0.05%, steal 0.00%)
      server cpu = total 22.51% (user 2.91%, nice 0.00%, system 19.55%, iowait 0.05%, steal 0.00%)
      client ram = 924 MB
      server ram = 1201 MB
  Pod to Service :
    TCP :
      bandwidth = 8274 Mbit/s
      client cpu = total 26.55% (user 1.78%, nice 0.00%, system 24.77%, iowait 0.00%, steal 0.00%)
      server cpu = total 26.37% (user 2.67%, nice 0.00%, system 23.68%, iowait 0.02%, steal 0.00%)
      client ram = 922 MB
      server ram = 1191 MB
    UDP :
      bandwidth = 1635 Mbit/s
      client cpu = total 20.19% (user 1.60%, nice 0.00%, system 18.54%, iowait 0.05%, steal 0.00%)
      server cpu = total 21.80% (user 2.82%, nice 0.00%, system 18.98%, iowait 0.00%, steal 0.00%)
      client ram = 913 MB
      server ram = 1179 MB
=========================================================

=========================================================
qperf
======================================================

/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  907 MB/sec
tcp_lat:
    latency  =  70.6 us
/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  1 GB/sec
tcp_lat:
    latency  =  68.2 us

===

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 10, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Sep 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2023

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

2 participants