[SOLVED] [iperf3 speed] Same Node vs 2 Nodes (Found a Bug)

Ramalama · May 10, 2024

I just stumbled over something very weird with LXC Containers, but i bet it happens with VM's either:

i have 2 Identical Nodes in a Cluster:
- both are connected over 2x25G in LACP (NIC: Intel E810)
- CPU: Genoa 9374F
- RAM: 12x64GB (All Channels 1DPC) 768GB
- Storage: ZFS Raid10 (8x Micron 7450 Max)

So those nodes are everything, but not slow in any regard. PVE is working great, its actually the first issue i have.
I have more as 10 PVE Servers and im around here since forever possibly.
--> What i mean is, this issue should have everyone! But probably no one realized, maybe even came to the idea to test.

I imagined at least till today, that an iperf3 test, or general network speed, should be insane on a communication between 2 LXC's/VM's on the same Node.
Because the Packets doesn't leave the Node (if both Containers/VMs are in the same Network), means they never actually leave the vmbridge.

But this is absolutely not the case...
i get with iperf3 (no special arguments, just -s and -c) fullowing:
- Both LXC's on the same Node: 14.1 Gbits/sec
- Each LXC on separate Node: 20.5 Gbits/sec

It feels like my understanding is broken now, because the Packets leave the Host, so there is at least the Hardcore Limit of 25G.
But on the same Host, you actually don't have any Limits, so i expected to see at least like 40Gbits/s

Anyone has a Clue, tryed already an iperf3 test?
Do at least like 3-5 tests please.

My issue is, thats the first servers with 25G links i have, all others have like 10G, and 10G was never an issue.
But i never camed to the idea to test Iperf3 on 2 Containers or VM's on the same node xD

Cheers

Ramalama · May 10, 2024

i just migrated the Container back, so that both are again on the same Node and retested:

Code:

iperf3 -c 172.17.1.122
Connecting to host 172.17.1.122, port 5201
[  5] local 172.17.1.129 port 35156 connected to 172.17.1.122 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.24 GBytes  19.2 Gbits/sec    0    530 KBytes       
[  5]   1.00-2.00   sec  3.30 GBytes  28.3 Gbits/sec    0    530 KBytes       
[  5]   2.00-3.00   sec  3.70 GBytes  31.8 Gbits/sec    0    530 KBytes       
[  5]   3.00-4.00   sec  4.09 GBytes  35.1 Gbits/sec    0    530 KBytes       
[  5]   4.00-5.00   sec  3.90 GBytes  33.5 Gbits/sec    0    530 KBytes       
[  5]   5.00-6.00   sec  3.69 GBytes  31.7 Gbits/sec    0    617 KBytes       
[  5]   6.00-7.00   sec  3.71 GBytes  31.8 Gbits/sec    0    617 KBytes       
[  5]   7.00-8.00   sec  3.87 GBytes  33.2 Gbits/sec    0    617 KBytes       
[  5]   8.00-9.00   sec  4.11 GBytes  35.3 Gbits/sec    0    617 KBytes       
[  5]   9.00-10.00  sec  4.08 GBytes  35.1 Gbits/sec    0    617 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  39.4 GBytes  33.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  39.4 GBytes  33.8 Gbits/sec                  receiver

iperf Done.

Code:

iperf3 -c 172.17.1.122
Connecting to host 172.17.1.122, port 5201
[  5] local 172.17.1.129 port 60096 connected to 172.17.1.122 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.61 GBytes  13.8 Gbits/sec    0    396 KBytes       
[  5]   1.00-2.00   sec  1.51 GBytes  13.0 Gbits/sec    0    396 KBytes       
[  5]   2.00-3.00   sec  1.57 GBytes  13.5 Gbits/sec    0    396 KBytes       
[  5]   3.00-4.00   sec  1.55 GBytes  13.3 Gbits/sec    0    396 KBytes       
[  5]   4.00-5.00   sec  1.34 GBytes  11.5 Gbits/sec    0    788 KBytes       
[  5]   5.00-6.00   sec  1.57 GBytes  13.5 Gbits/sec    0    788 KBytes       
[  5]   6.00-7.00   sec  1.57 GBytes  13.5 Gbits/sec    0    788 KBytes       
[  5]   7.00-8.00   sec  1.56 GBytes  13.4 Gbits/sec    0    977 KBytes       
[  5]   8.00-9.00   sec  1.58 GBytes  13.5 Gbits/sec    0   1.39 MBytes       
[  5]   9.00-10.00  sec  1.53 GBytes  13.1 Gbits/sec    0   1.39 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  16.2 GBytes  13.9 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  16.2 GBytes  13.9 Gbits/sec                  receiver

iperf Done.

WTF is happening here?
Its both the exactly same test, both containers on the same Node.
I dont get why its so extremely inconsistent.

Ramalama · May 10, 2024

Okay i found the KEY issue.

When it runs at 13-14Gbits/s, im hitting a Single Thread/Core limit on the PVE Host itself!
But im not seeing which process is eating that one core, so that must be a Kernel or a Module.

When it runs at 34Gbits/s, i am not hitting a Single Thread/Core limit, instead 2-4 Cores are hitting like 80%.

Seems to me like some sort of bug in the Kernel itself, since virtio/vmbr is all directly the Kernel itself.
Nothing leaves the Host, so it can't be the E810 nic or any drivers related to that.

Anyone aware of any "Tuning" or something to force Multithreading for virtio or vmbr? I mean it obviously does multithreading anyway when it hits 34Gbit/s.
Im not sure if its an vmbr or virtio issue.
But i believe that everyone benefits a lot, if we can rule this out.

Cheers

_gabriel · May 11, 2024

edited : iperf3 is multi streams BUT cpu multi threads only since v3.16 (December 2023)

Ramalama · May 11, 2024

_gabriel said:
iperf3 is known to be a single cpu thread application.
iperf2 is multi threads.

There is no difference and it has anyway nothing with iperf itself todo.
You can do parallel with iperf2 and 3, there is no difference, but im not talking about parallel.

Im talking about something in vmbridge or kernel or ip stack in the kernel, that is sometimes multithreaded and sometimes single threaded.
What it is exactly probably no one can tell, i mean where the issue in the kernel is, since you cannot see the task that utilizes the core, or sometimes multiple cores.

The issue is, that iperf (not parallel, single connection) runs sometimes at 40G Speeds, and most of the time at 14-16G speeds. Depending if the kernel decides to use only a single core, or multiple cores.

_gabriel · May 11, 2024

iperf3 can use one thread per stream only since version 3.16
https://github.com/esnet/iperf/releases/tag/3.16
Before 3.16 , iperf3 is only one cpu thread even with parallel streams (-P).

edit: just tested constant 50 Gbits/s between two CT (default Alpine template) , iperf3 v3.14 (single stream and -P 4 streams same speed ) , CPU Xeon E-2386G

_gabriel · May 11, 2024

fwiw, 200 Gbits/s (!) between two default Alpine template CT over Linux bridge, with static build iperf3 v3.16 -P 8 streams and, so, 8 cpu threads on Xeon E-2386G ( with all 12 threads is 177 Gbits/s , 4 streams & threads 150 Gbits/s )

Ramalama · May 12, 2024

Single Socket Genoa 9375 / 12x DDR5 Memory Channels with 64GB-Dimms / Ultrafast Raid 10 out of 8x Micron 7450 MAX:

Code:

8 Streams:
Ubuntu 24.04:
[SUM]   0.00-10.00  sec   123 GBytes   105 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec   123 GBytes   105 Gbits/sec                  receiver
Alpine:
[SUM]   0.00-10.00  sec   125 GBytes   107 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec   125 GBytes   107 Gbits/sec                  receiver

One stream:
Ubuntu 24.04:
[  5]   0.00-10.00  sec  16.4 GBytes  14.1 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  16.4 GBytes  14.1 Gbits/sec                  receiver
Alpine:
[  5]   0.00-10.00  sec  16.4 GBytes  14.1 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  16.4 GBytes  14.1 Gbits/sec                  receiver

Single Socket Xeon Silver 4210R / 4x DDR4 Memory Channels with 64GB-Dimms / 4x Samsung 870 EVO in Raid 10:

Code:

8 Streams:
Alpine:
[SUM]   0.00-10.00  sec   140 GBytes   120 Gbits/sec  7460             sender
[SUM]   0.00-10.00  sec   140 GBytes   120 Gbits/sec                  receiver

One Stream:
Alpine:
[  5]   0.00-10.00  sec  34.7 GBytes  29.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  34.7 GBytes  29.8 Gbits/sec                  receiver

The Xeon is Compared to the Genoa, literally Crap.
But it is in Single Stream 2x faster.

The Genoa Runs at 4,3Ghz, at least the one Core that is at 100% during Iperf3.
The Xeon Goes up to 2,8Ghz only and can't do anyway above 3,2Ghz.
And retransfer is very high on the xeon in parallel, looks to me like something is wrong. I need to test probably on an earlier Proxmox Kernel, there is definitively something wrong, those speedtests make no sense to me.

It makes literally no sense at all.

Ramalama · May 12, 2024

Im Further with my Investigation:
If i set the LXC Container to use only one CPU-Core:

Code:

 ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.28 GBytes  36.7 Gbits/sec    0    508 KBytes      
[  5]   1.00-2.00   sec  4.27 GBytes  36.8 Gbits/sec    0    537 KBytes      
[  5]   2.00-3.00   sec  4.26 GBytes  36.6 Gbits/sec    0    598 KBytes      
[  5]   3.00-4.00   sec  4.23 GBytes  36.4 Gbits/sec    0    663 KBytes      
[  5]   4.00-5.00   sec  4.23 GBytes  36.4 Gbits/sec    0    663 KBytes      
[  5]   5.00-6.00   sec  4.24 GBytes  36.4 Gbits/sec    0    697 KBytes      
[  5]   6.00-7.00   sec  4.23 GBytes  36.3 Gbits/sec    0    799 KBytes      
[  5]   7.00-8.00   sec  4.22 GBytes  36.3 Gbits/sec    0    799 KBytes      
[  5]   8.00-9.00   sec  4.22 GBytes  36.3 Gbits/sec    0    799 KBytes      
[  5]   9.00-10.00  sec  4.22 GBytes  36.2 Gbits/sec    0    799 KBytes      
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  42.4 GBytes  36.4 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  42.4 GBytes  36.4 Gbits/sec                  receiver

Previously i used 6 Cores for my LXC Containers, this is getting very weird.
But setting to only 1 Core, increases the throughput from 14GB/s to 36GB/s, not even a Hypertrading Core should be that slow.
There is definitively something broken.

Seems like some sort of multithreading bug, or Proxmox have issues with Hyperthreading (or what its called on amd), maybe not Proxmox, but the kernel.

I have to disable Hyperthreading on my Genoa Servers and retest. 32 Cores should be enough for my VM's hopefully.
Cheers

EDIT: More testing:
Setting to 4 CPU Cores per LXC and using "iperf3 -c xxxx -P2" 2 Threads, is even worse xD

Code:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  16.8 GBytes  14.4 Gbits/sec                  receiver
[  8]   0.00-10.00  sec  16.8 GBytes  14.5 Gbits/sec                  receiver
[SUM]   0.00-10.00  sec  33.7 GBytes  28.9 Gbits/sec                  receiver

- Setting to 2 Cores and 2 Parallel Iperf tests, starts to Vary, sometimes i get 72GB/s, most of the time 32,5GB/s
- Setting to 1 Core and 1 Iperf is very consistent! Always 36,4GB/s.
- 1 Core and 2/3/4/6 Parallel Iperf tests, reach 36,5GB/s all together. Same like only one Iperf test.

So in Conclusion, if i give an VM/LXC Container more as 1-Core (Tested only with LXC's) on Genoa at least, something weird starts to happen.

Ramalama · May 13, 2024

Still Debugging, but i found the Best example:
LXC Conainers with 2 Cores Assigned and iperf3 test with -P 2

Code:

[  5]   9.00-10.00  sec  1.61 GBytes  13.8 Gbits/sec    0   1.02 MBytes    
[  7]   9.00-10.00  sec  4.10 GBytes  35.2 Gbits/sec    0    513 KBytes    
[SUM]   9.00-10.00  sec  5.71 GBytes  49.0 Gbits/sec    0

Here its really great visible, one connection is running at ~14Gb/s and the other with 35,2Gb/s.
36,4Gb/s is the limit of what one real core can do. 14Gb/s looks to me like it used a hyperthreading core.

But again to mention, this is not iperf3 itself what causes the load, there is some sort of a bug in the PVE Kernel itself or a Module.
Probably a scheduler bug on AMD Systems.

- Intel is definitively not affected, since i tested this on all my Intel Servers, and they all act as expected!
- A Ryzen 5800x Server (Thats the only one other AMD-Server i have), is definitively not affected either.
- Both 9374F Servers are Affected!

So it seems like a Specific Genoa issue, shit, that means that im on my own :-(

_gabriel · May 15, 2024

tested on AMD EPYC 7302 16c/32t (Rome / 2th Gen / 2020 era) running PVE 7.2 and Kernel 5.15.35-1 (mitigations=off)
between 2 LXCs Containers (Alpine 3.18 default template from mid-2023) , 2 Cores assigned.
iperf3 is constant 15 Gbits/s & iperf3 -P 2 is contant 30 Gbits/s ( 2 x 15 Gbits/s )

Ramalama · May 15, 2024

15gb/s is a bit low for 2 containers on the same node and same network.
Very low, i reach on lowend e5 v3, extremely old xeons, at least 30gb/s per core/stream

Ramalama · May 18, 2024

Okay, im further in my research, there seems indeed some bug:

If i do an iperf3 test from LXC to Node directly:

Code:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.44 GBytes  37.9 Gbits/sec    0    434 KBytes       
[  5]   1.00-2.00   sec  4.33 GBytes  37.3 Gbits/sec    0    458 KBytes       
[  5]   2.00-3.00   sec  4.38 GBytes  37.6 Gbits/sec    0    458 KBytes       
[  5]   3.00-4.00   sec  4.40 GBytes  37.8 Gbits/sec    0    458 KBytes       
[  5]   4.00-5.00   sec  4.38 GBytes  37.6 Gbits/sec    0    458 KBytes       
[  5]   5.00-6.00   sec  4.37 GBytes  37.6 Gbits/sec    0    458 KBytes       
[  5]   6.00-7.00   sec  4.34 GBytes  37.3 Gbits/sec    0    458 KBytes       
[  5]   7.00-8.00   sec  4.35 GBytes  37.4 Gbits/sec    0    458 KBytes       
[  5]   8.00-9.00   sec  4.35 GBytes  37.4 Gbits/sec    0    458 KBytes       
[  5]   9.00-10.00  sec  4.38 GBytes  37.6 Gbits/sec    0    458 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  43.7 GBytes  37.6 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  43.7 GBytes  37.6 Gbits/sec                  receiver

It is very consistent! No matter how much times i try.

If i do the other way, from Node to LXC:

Code:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.49 GBytes  38.6 Gbits/sec    0    505 KBytes
[  5]   1.00-2.00   sec  4.48 GBytes  38.5 Gbits/sec    0    505 KBytes
[  5]   2.00-3.00   sec  4.48 GBytes  38.5 Gbits/sec    0    533 KBytes
[  5]   3.00-4.00   sec  1.94 GBytes  16.7 Gbits/sec    0    560 KBytes
[  5]   4.00-5.00   sec  1.72 GBytes  14.8 Gbits/sec    0    560 KBytes
[  5]   5.00-6.00   sec  1.73 GBytes  14.9 Gbits/sec    0    560 KBytes
[  5]   6.00-7.00   sec  1.73 GBytes  14.8 Gbits/sec    0    560 KBytes
[  5]   7.00-8.00   sec  1.73 GBytes  14.9 Gbits/sec    0    631 KBytes
[  5]   8.00-9.00   sec  1.73 GBytes  14.8 Gbits/sec    0    631 KBytes
[  5]   9.00-10.00  sec  1.73 GBytes  14.9 Gbits/sec    0    631 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  25.8 GBytes  22.1 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  25.8 GBytes  22.1 Gbits/sec                  receiver

It starts high, but then goes down to the usual 14GB/s.

But this doesn't happen on any Intel based system. Im so sick, first i found here a limitation, then in PBS on another thread...
Im getting tired of find the root cause.
And i tested if it has something todo with Hyperthreading, so i disabled SMT, but no difference in the behaviour. Setted scaling governor to Performance, no difference. Tryed different "Tuning Profiles" in the Genoa Bios, like Workload Optimized by Clockspeed or HPC, no difference.
Setted cTDP up to 400W and overclocked the 9374F to 4,5GHZ, no difference.
Basically tryed now all bios settings, nothing helps.

I only got a slight speed bump from ~14GB/s to almost 15GB/s, due to HPC Profile and Overclocking, lol.

The best part is, when it runs with ~38GB/s then 2-4 Cores get utilized. When it runs with 14GB/s only one Core gets utilized.
And im Talking just of one stream iperf3. And it doesn't matters if node/lxc or lxc/lxc testing. Always the same behaviour. And i have no influence on that decision.

Cheers

Ramalama · May 19, 2024

Code:

iperf3 -c 172.17.1.131 -P4
Connecting to host 172.17.1.131, port 5201
[  5] local 172.17.1.132 port 48106 connected to 172.17.1.131 port 5201
[  7] local 172.17.1.132 port 48108 connected to 172.17.1.131 port 5201
[  9] local 172.17.1.132 port 48114 connected to 172.17.1.131 port 5201
[ 11] local 172.17.1.132 port 48130 connected to 172.17.1.131 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.46 GBytes  46.7 Gbits/sec    0    468 KBytes       
[  7]   0.00-1.00   sec  3.70 GBytes  31.6 Gbits/sec    0    512 KBytes       
[  9]   0.00-1.01   sec  5.38 GBytes  46.0 Gbits/sec    0    515 KBytes       
[ 11]   0.00-1.01   sec  3.68 GBytes  31.4 Gbits/sec    0    468 KBytes       
[SUM]   0.00-1.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec  5.41 GBytes  46.7 Gbits/sec    0    468 KBytes       
[  7]   1.00-2.00   sec  3.62 GBytes  31.2 Gbits/sec    0    512 KBytes       
[  9]   1.01-2.00   sec  5.39 GBytes  46.5 Gbits/sec    0    515 KBytes       
[ 11]   1.01-2.00   sec  3.67 GBytes  31.7 Gbits/sec    0    468 KBytes       
[SUM]   1.00-2.00   sec  18.1 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec  5.44 GBytes  46.7 Gbits/sec    0    468 KBytes       
[  7]   2.00-3.00   sec  3.63 GBytes  31.2 Gbits/sec    0    512 KBytes       
[  9]   2.00-3.00   sec  5.42 GBytes  46.5 Gbits/sec    0    515 KBytes       
[ 11]   2.00-3.00   sec  3.68 GBytes  31.7 Gbits/sec    0    468 KBytes       
[SUM]   2.00-3.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec  5.44 GBytes  46.7 Gbits/sec    0    468 KBytes       
[  7]   3.00-4.00   sec  3.63 GBytes  31.2 Gbits/sec    0    512 KBytes       
[  9]   3.00-4.00   sec  5.41 GBytes  46.5 Gbits/sec    0    515 KBytes       
[ 11]   3.00-4.00   sec  3.69 GBytes  31.7 Gbits/sec    0    468 KBytes       
[SUM]   3.00-4.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec  5.43 GBytes  46.6 Gbits/sec    0    468 KBytes       
[  7]   4.00-5.00   sec  3.63 GBytes  31.2 Gbits/sec    0    512 KBytes       
[  9]   4.00-5.00   sec  5.41 GBytes  46.4 Gbits/sec    0    515 KBytes       
[ 11]   4.00-5.00   sec  3.69 GBytes  31.7 Gbits/sec    0    468 KBytes       
[SUM]   4.00-5.00   sec  18.1 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec  5.43 GBytes  46.6 Gbits/sec    0    468 KBytes       
[  7]   5.00-6.00   sec  3.63 GBytes  31.2 Gbits/sec    0    512 KBytes       
[  9]   5.00-6.00   sec  5.40 GBytes  46.4 Gbits/sec    0    515 KBytes       
[ 11]   5.00-6.00   sec  3.69 GBytes  31.7 Gbits/sec    0    496 KBytes       
[SUM]   5.00-6.00   sec  18.1 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-7.00   sec  5.40 GBytes  46.4 Gbits/sec    0    468 KBytes       
[  7]   6.00-7.00   sec  3.65 GBytes  31.3 Gbits/sec    0    512 KBytes       
[  9]   6.00-7.00   sec  5.40 GBytes  46.4 Gbits/sec    0    515 KBytes       
[ 11]   6.00-7.00   sec  3.71 GBytes  31.9 Gbits/sec    0    520 KBytes       
[SUM]   6.00-7.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   7.00-8.00   sec  5.40 GBytes  46.3 Gbits/sec    0    468 KBytes       
[  7]   7.00-8.00   sec  3.65 GBytes  31.4 Gbits/sec    0    512 KBytes       
[  9]   7.00-8.00   sec  5.40 GBytes  46.4 Gbits/sec    0    515 KBytes       
[ 11]   7.00-8.00   sec  3.72 GBytes  32.0 Gbits/sec    0    520 KBytes       
[SUM]   7.00-8.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec  5.40 GBytes  46.4 Gbits/sec    0    468 KBytes       
[  7]   8.00-9.00   sec  3.67 GBytes  31.5 Gbits/sec    0    512 KBytes       
[  9]   8.00-9.00   sec  5.39 GBytes  46.3 Gbits/sec    0    515 KBytes       
[ 11]   8.00-9.00   sec  3.73 GBytes  32.0 Gbits/sec    0    520 KBytes       
[SUM]   8.00-9.00   sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec  5.39 GBytes  46.3 Gbits/sec    0    468 KBytes       
[  7]   9.00-10.00  sec  3.66 GBytes  31.4 Gbits/sec    0    512 KBytes       
[  9]   9.00-10.00  sec  5.39 GBytes  46.3 Gbits/sec    0    515 KBytes       
[ 11]   9.00-10.00  sec  3.71 GBytes  31.9 Gbits/sec    0    520 KBytes       
[SUM]   9.00-10.00  sec  18.2 GBytes   156 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  54.2 GBytes  46.5 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  54.2 GBytes  46.5 Gbits/sec                  receiver
[  7]   0.00-10.00  sec  36.5 GBytes  31.3 Gbits/sec    0             sender
[  7]   0.00-10.00  sec  36.5 GBytes  31.3 Gbits/sec                  receiver
[  9]   0.00-10.00  sec  54.0 GBytes  46.4 Gbits/sec    0             sender
[  9]   0.00-10.00  sec  54.0 GBytes  46.4 Gbits/sec                  receiver
[ 11]   0.00-10.00  sec  37.0 GBytes  31.8 Gbits/sec    0             sender
[ 11]   0.00-10.00  sec  37.0 GBytes  31.8 Gbits/sec                  receiver
[SUM]   0.00-10.00  sec   182 GBytes   156 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec   182 GBytes   156 Gbits/sec                  receiver

Mystery Solved!
Cheers

_gabriel · May 20, 2024

Ramalama said:
Mystery Solved!

How ?

Ramalama · May 25, 2024

_gabriel said:
How ?

Sorry, i was initially mad that no one tryed to help, so i didn't wanted to post a solution. But that doesn't make sense, to get an a.... just because no one has a clue.

However, the solution is somewhat simple and not so simple.
The main issue is how the CPU-Cores are Accesing the Cache, especially the L3-Cache.
Let me illustrate that with a Picture from AMD:

Bildschirmfoto 2024-05-25 um 13.30.55.png

You see on that Picture a Perfect example of the 9374F (The CPU i'm using), it has 32 Cores, so 4 Chiplets, each with 8 Cores and one L3-Cache.
Everything on that one Chiplet can access the L3-Cache at the same time, so those 8 Cores.

So there are 4 Chiplets, lets call them A B C D.
As far i understand Chiplet A cannot access the L3-Cache of Chiplet B. (Or it is very slow at least)
So the data in the L3-Cache has to move from Chiplet A to Memory, so that Chiplet B can access it and that costs a lot of time and is relatively slow.

The Linux bridge/Kernel takes advantage of that effect, or it happens maybe directly with hardware, i don't know exactly where this "Cleverness" Comes from.

However, i have the Ability (Or anyone has on Genoa, or maybe on any Epyc) to Split the CPU into 4 Numa Nodes in the Bios.
Thats Called 4NMS (Numa-Nodes-per-Socket).
That will split those 4 Chiplets into Numa-Nodes, so you get the ability to pin the VM's or Containers to the same Chiplet.

With SMT/Hyperthreading, you have 16 Cores on one node, 8physical+8virtual.
So if i assign both VM's or both LXC Containers to the same Numa-Node, they get Access to the same L3-Cache and i get my desired 40GB/s speed.

If i don't do that and Proxmox simply takes randomly Core 12+14+15+20 for Container 1 and Core 2+3+5+8 for Container 2, as example, then you get weird things. Like i had....
Sometimes 14GB/s and sometimes 40GB/s, depends if it was lucky to select the same cores on the same Chiplet or not.

That explained all the Issues i had and why it was so weird.

Now why that doesn't happen on Intel Platforms? I think thats simply because all my Intel Plattforms are too old, where the CPU's were still "Monotholic". On newer Intel Plattforms the Chiplets are designed maybe differently, with much faster interconnects between the Chiplets, i don't know...
And with my Ryzen 5800x CPU, where i didn't had the issue either, has only 2 Chiplets and they share both just one L3-Cache.

So in short, numa-nodes will get much more important in future, on basically any CPU that has multiple Chiplets, which have all their own L3-Cache.

In my example, i explicitely wanted to build 2x Single-Socket Genoa-Servers, to not fu.. with numa, but it seems like i simply failed, since i have to use Numa now for a single-socket cpu too, lol...
And Pin VM's/Containers to the same Numa-Node, which need a really fast network communication between them.

I think assigning a VM/Container to the same Numa-node(Chiplet), that has simply multiple cores assigned to it, will boost multicore performance inside the VM/Containers hugely too. So its not only dedicated to Network speed.

The reason for that is the same, if you assign to the VM, 6 Cores for example and those 6 Cores have all access to the same L3-Cache and probably even L2-Caches, the tasks inside your VM can share data over the L3/L2 Caches to another tasks that run on another Cores inside your VM.
Intead without numa and without Luck, they will need to share the data over the memory or need to access the another chiplets L3-Cache which will be insanely slower.

So yeah, this Lession was one of the most important ones that i learned here.
Cheers

LnxBil · Aug 22, 2024

Just an addition, you can also check out where the NIC is connected to using lstopo (package hwloc) and pin the cores on the CPU on which bus the NIC is connected to.

_gabriel · Aug 22, 2024

(edit: if I'm not wrong) : OP was about inter guests traffic, so over bridge which is bounded to CPU.

Ramalama · Aug 22, 2024

_gabriel said:
(edit: if I'm not wrong) : OP was about inter guests traffic, so over bridge which is bounded to CPU.

Thats correct. But that info was still good.

Between 2 nodes, i would need a 100gb Connection, that i dont have to test properly.
Just 2x 25g

Ramalama · Aug 24, 2024

LnxBil said:
Just an addition, you can also check out where the NIC is connected to using lstopo (package hwloc) and pin the cores on the CPU on which bus the NIC is connected to.

Sorry i didn't had time to Reply on the Docker Thread xD
But now i do.

This may not be really correct what i said in this thread, because that was the time i just learned about Numa etc...
And at that time, i didn't knew that Numa isn't working Correctly. It was just the beginning.

There where a lot of other Threads in June, where we i debugged/tested Numa and got with other Prople the Conclusion, that the only way to make things properly work is with CPU-Pinning.

Then after some disscussions in Multiple Theads one Guy had a great Suggestion (I sadly can't find the thread, or remember the name..., Maybe @justinclift or some alex....something) Sorry...
However, wasn't someone that we all know, like Dunuin or all the other highly active ones...

He camed with the Idea to create an Intelligent Hook Script, before you start a VM, that simply does CPU-Pinning.
Simply checks either with Numactl or lets say your own text file with L3/Core definitions agains configured vCPU's for your starting VM against which Numa-Node or L3-CPU-Block, is least used and simply does CPU-Pinning.

This will solve the issue if you have a Ton of VM's with CPU-Pinning, because otherwise its unmanagable.
Sure this is not a Solution for Live-Migrations and there can be issues that 2 VM's use the same Physical Cores and are highly busy, while other Cores do nothing...
But this is at least a starting Point.

Genoa has a Ton of Numa-Settings in Bios, a lot more as any other Plattform i ever seen.
You can disable Numa, you can Split the CPU to 2/4 Numa Nodes or you can even do one Numa-Node per L3 Cacle, which will lead to 8 Numa-Nodes for my 9374F.
But i think we can agree, that 8 Numa nodes is a ton and 8 threads per Node is maybe a bit few.
(Its simply too hard to handle for me with with pinning)
So i sticked with 4 here.

On the Proxmox side of things, i don't know why the Wiki entry exists, that tells that Numa is working. Its definitively not.
On one of the mentioned Threads, where we discussed about Numa, a Guy found out that Numa actually works only on SLES or RHEL. (Not both, one of those)
Because they have some special souce.
Qemu itself doesn't support Properly Numa at the moment and all that the "Numa Option" in the VM-Settings of Proxmox does, is just passing the Numa-Info (which cpu belongs to which numa-node) to the Guest, to make the guest numa-aware. Thats it.
But as the Guest has no ability to select other Physical Cores, this Option is senseless. It may be only usefull, if you give the Guest almost all or all Cores)
On the host itself that option doesn't have any impact/effect.

Cheers!

[SOLVED] [iperf3 speed] Same Node vs 2 Nodes (Found a Bug)

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Distinguished Member

Famous Member

Renowned Member

Renowned Member

We value your privacy