Dell AMD EPYC Slow Bandwidth Performance/throughput

Nov 8, 2024
17
0
1
Hi All. We are in a deep trouble.
We use 3 x Dell PE 7625 servers with 2 x AMD 9374F (32 core processors), I am facing an bandwidth issue with VM to VM as well as VM to the Host Node in the same node.
The bandwidth is ~13 Gbps for Host to VM and ~8 Gbps for VM to VM for a 50 Gbps bridge(2 x 25Gbps ports bonded with LACP) with no other traffic(New nodes).

Counter measures tested:

1) No improvement even after configuring multiqueue, I have configured multiqueue(=8) in Proxmox VM Network device settings.
2) My BIOS is in performance profile with NUMA Node Per Socket = 1, and in host node if i run numactl --hardware it shows as Available : 2 Nodes.(=represents 2 socket and 1 Numa node per socket).
As per the post (https://forum.proxmox.com/threads/proxmox-8-4-1-on-amd-epyc-slow-virtio-net.167555/ I have changed BIOS settings with NPS=4/2 but no improvement.

3) I have a old Intel Cluster and I know that that itself has around 30Gbps speed within the node (VM to VM),

So to find underlying cause, I have installed same proxmox version in new Intel Xeon 5410 (5th gen-24 core) server (called as N2) and tested the iperf within the node( acting as server and client) .Please check the images the speed is 68 Gbps without any parallel (-P).
The same when i do in my new AMD 9374F processor, to my shock it was 38 Gbps (see N1 images), almost half the performance.

Now, this is the reason that the VM to VM bandwidth is very less inside a node. This results are very scarring because the AMD processor is a beast with High cache, 32GT/s interconnect etc., and I know its CCD architecture, but still the speed is very very less. I want to know any other method to increase the inter core/process bandwidth [2] to maximum throughput.

If it is the case AMD for virtualization is a big NO for the future buyers.


Note:

1) I have not added -P(parallel ) in iperf as i want to see the real case where if u want to copy a big file or backup to another node, there is no parallel connection.
2) As the tests are run in same node, if I am right, there is no network interface involvement (that's why I get 30Gbps with 1G network card in my old server), so its just the inter core/process bandwidth that we are measuring. And so no need of network level tuning required.

We are struggling so much, it will be helpful with your guidance, as no other resource available for this strange issue.
Thanks.
 

Attachments

  • N1 INFO.png
    N1 INFO.png
    33 KB · Views: 9
  • N1 IPERF.png
    N1 IPERF.png
    139.5 KB · Views: 9
  • N2 INFO.png
    N2 INFO.png
    57.4 KB · Views: 8
  • N2 IPERF.png
    N2 IPERF.png
    95.4 KB · Views: 9
Last edited:
someone else on this forum had a similar problem and disabling c-states in bios seemed to help ...
german forum post

and check if your LACP settings are correct, including switches, if the settings on the switches and bridge dont match that could be another possible issue
 
Last edited:
I know of several people here who prefer AMD Epyc so this doesn't seem so much a matter of performance but personal preferences. Despite that I have no idea, sorry
 
  • Like
Reactions: sensei_pv
There are several things you could check/try.

Your node distance is typical for NUMA cross traffic (infinity fabric). Your bandwith test was not locally isolated (NUMA).

iperf on localhost uses any cores and RAM, so your sockets are "mixed". You could try:
Bash:
Server on NUMA node 0:

taskset -c 0-15 numactl -m 0 -N 0 iperf3 -s

Client on NUMA node 0:

taskset -c 16-31 numactl -m 0 -N 0 iperf3 -c localhost

This should result in 80-100Gbps locally.

Another thing: try to temporarily disable SMT and run the iperf tests again:
Bash:
echo off > /sys/devices/system/cpu/smt/control

To re-enable SMT use:
Bash:
echo on > /sys/devices/system/cpu/smt/control

To check SMT (on/off):
Bash:
cat /sys/devices/system/cpu/smt/control

In your VM configuration you can explicitly specify the NUMA node. Example:
Code:
numa: 1
sockets: 1
cores: 8
memory: 16384
numa0: cpus=32-39,mem=16384 # << use cores 32-39 (from node 1) and 16GB RAM also from node 1

If your VM runs Linux you can test the mapping with
Bash:
numactl --show

Within your BIOS a few options would be:

NPS (NUMA per socket) --> 1 (less NUMA zones, max. RAM bandwith) *

SMT mode --> Disabled (for testing)

Performance profile (if available) --> Maximum performance

CPPC (collaborative power and performance control) --> Enabled

P-States --> Disabled (or CPPC preferred)

Global C-States --> Disabled

If you have these options, try to change them also:

DRAM Interleaving --> Auto or Channel interleaving

NUMA-aware interleaving --> Disabled

I personally would change any BIOS setting step by step, followed by performance tests.

* Edit/notice: be sure to power off(!) your system after changing NPS. This is not a myth. If you only "Save, exit and reboot" parts of the older settings are kept in the firmware cache. SRAT tables will not be fully rewritten.
 
Last edited:
In addition to my previous post you could also try to enable SRV-IO within your BIOS and Kernel. For clarification: SRV-IO != passthrough.

It helps even on VM <> VM traffic. Why that? The typical virtio-net on a bridge is not NUMA-aware. It uses a vhost-net thread which may be scheduled on a different NUMA node. Every packet between VMs (on the same host) goes through:

- virtio inside the VM
- vhost-net thread on the host
- the linux bridge
- another vhost-net thread
- virtio on the second VM

Many context switches on different NUMA nodes. SRV-IO bypasses vhost-net, the bridge and most of the kernel stack. Every SRV-IO VF (virtual function) is a „real“ pcie device from the VM‘s perspective. If you have - for example - two VMs using VF on the same host and physical NIC most NICs can detect that both endpoints are local and use either zero-copy transfer or internal DMA. Mellanox Connect X (with mlx5-core) as an example: uses internal loopback without pcie bus traffic on VM<>VM traffic on the same host.
 
There are several things you could check/try.

Your node distance is typical for NUMA cross traffic (infinity fabric). Your bandwith test was not locally isolated (NUMA).

iperf on localhost uses any cores and RAM, so your sockets are "mixed". You could try:
Bash:
Server on NUMA node 0:

taskset -c 0-15 numactl -m 0 -N 0 iperf3 -s

Client on NUMA node 0:

taskset -c 16-31 numactl -m 0 -N 0 iperf3 -c localhost

This should result in 80-100Gbps locally.

Another thing: try to temporarily disable SMT and run the iperf tests again:
Bash:
echo off > /sys/devices/system/cpu/smt/control

To re-enable SMT use:
Bash:
echo on > /sys/devices/system/cpu/smt/control

To check SMT (on/off):
Bash:
cat /sys/devices/system/cpu/smt/control

In your VM configuration you can explicitly specify the NUMA node. Example:
Code:
numa: 1
sockets: 1
cores: 8
memory: 16384
numa0: cpus=32-39,mem=16384 # << use cores 32-39 (from node 1) and 16GB RAM also from node 1

If your VM runs Linux you can test the mapping with
Bash:
numactl --show

Within your BIOS a few options would be:

NPS (NUMA per socket) --> 1 (less NUMA zones, max. RAM bandwith)

SMT mode --> Disabled (for testing)

Performance profile (if available) --> Maximum performance

CPPC (collaborative power and performance control) --> Enabled

P-States --> Disabled (or CPPC preferred)

Global C-States --> Disabled

If you have these options, try to change them also:

DRAM Interleaving --> Auto or Channel interleaving

NUMA-aware interleaving --> Disabled

I personally would change any BIOS setting step by step, followed by performance tests.
Bash:
Server on NUMA node 0:

taskset -c 0-15 numactl -m 0 -N 0 iperf3 -s

Client on NUMA node 0:

taskset -c 16-31 numactl -m 0 -N 0 iperf3 -c localhost

Got Same ~38 Gbps. Even reduced to 0-7 but still the same speed.
1753528249995.png
 
OK, you could also try to re-run the test with SMT disabled. If it helps, here's a litte benchmark/test script I use for the same purpose on my hosts with Epyc/Threadripper:


Bash:
#!/bin/bash

# Configuration
IPERF_PORT=5001
DURATION=10
NUMA_NODE=0
CPU_RANGE_SERVER="0-15"
CPU_RANGE_CLIENT="16-31"
WINDOW_SIZES=("128k" "256k" "512k")
PARALLEL_STREAMS=(1 4 8 16)

# Disable SMT temporarily
echo "[INFO] Disabling SMT temporarily..."
echo off > /sys/devices/system/cpu/smt/control 2>/dev/null
cat /sys/devices/system/cpu/smt/control

# Start iperf3 server
echo "[INFO] Starting iperf3 server on CPU ${CPU_RANGE_SERVER}, NUMA ${NUMA_NODE}..."
taskset -c ${CPU_RANGE_SERVER} numactl -m ${NUMA_NODE} -N ${NUMA_NODE} iperf3 -s -p ${IPERF_PORT} > /tmp/iperf_server.log 2>&1 &
IPERF_PID=$!
sleep 2

# Test matrix execution
echo
echo "=== NUMA-local iperf3 Benchmark ==="
printf "%-10s %-10s %-10s\n" "Window" "Streams" "Gbit/sec"
echo "-----------------------------------"

for WIN in "${WINDOW_SIZES[@]}"; do
  for P in "${PARALLEL_STREAMS[@]}"; do
    RESULT=$(taskset -c ${CPU_RANGE_CLIENT} numactl -m ${NUMA_NODE} -N ${NUMA_NODE} iperf3 -c 127.0.0.1 -p ${IPERF_PORT} -t ${DURATION} -w ${WIN} -P ${P} 2>/dev/null)
    BW=$(echo "$RESULT" | grep SUM | grep -oP '[0-9.]+(?= Gbits/sec)' | tail -1)
    if [ -z "$BW" ]; then
      # fallback in case SUM is missing (single stream)
      BW=$(echo "$RESULT" | grep -oP '[0-9.]+(?= Gbits/sec)' | tail -1)
    fi
    printf "%-10s %-10s %-10s\n" "$WIN" "$P" "${BW:-n/a}"
  done
done

# Stop iperf3 server
kill $IPERF_PID 2>/dev/null
wait $IPERF_PID 2>/dev/null

# Re-enable SMT
echo "[INFO] Re-enabling SMT..."
echo on > /sys/devices/system/cpu/smt/control 2>/dev/null

echo -e "\n[INFO] Benchmark finished. SMT has been re-enabled."