Epyc Milan Vm to VM Communication Slow

dizzydre21

Member
Apr 10, 2023
88
4
13
Hello all,

I was recently testing VM to VM bandwidth/speed on various machines in my homelab. I was doing these tests with iperf3 from within several VMs because I was not quite getting 25gb speeds accross my network. I eventually got to the point where I was looking at both the Proxmox BIOS config of the server that was slowest. It is buy far the most powerful of the bunch (Epyc Milan), yet I cannot exceed ~20gbit/s over the network or accross VMs on the same machine with a dedicated VMBR1 bridge (no assigned physical NIC.

I'm using the AMD pstate driver on both AMD machines, but all of them are typically set to powersave and balanced_performance

I'd be very grateful for advice on how to troubleshoot this.


Some of the hardware:
AMD Eypc 7443p
Asrock Romed2-2t mobo
256GB 3200MHz ECC Rdimms (8x32GB)
Samsung 870 500GB (boot disk)
Samsung PMA93 x2 ZFS Mirror (VM OS disks)

Tiny machine with better VM to VM comms (~25gbit/s):
Lenovo P330 Tiny
Intel i7-8700T
2x16GB RAM
WD CL SN720 x2 ZZFS Mirror (Boot and VM disks)

Third tested machine (~45gbit/s):
AMD Ryzen 7800x3d
Supermicro H13 mobo
2x32GB 5200MHz ECC Udimms
 
Threadripper 5955WX (16c/32t) (NPS=1), 8x32GB 3200MT/s-CL22, VirtIO, amd_pstate, powersave/balance_performance (boosts up to 4.7GHz with PBO and +200MHz offset)

I am also getting 22Gbit/s VM to VM. Wonder if NPS=2 would help, but too lazy to reboot.
LXC to LXC 41Gbit/s
 
After reading perf tuning guides on AMD, I would say that the best you align l3 caches with numa domains, the best performance you will get in most cases. This way the os knows better cache architecture and can make advantage of it.

I tested this years ago in 2x32 and 2x64 and after setting static high perf governor and freqs enabling “LLC/L3/CCX as NUMA domain” did the trick. It boosted performance of memory intensive workloads.

You can check with numastat before and after the change.

Just let me know if you test it ;)
 
Threadripper 5955WX (16c/32t) (NPS=1), 8x32GB 3200MT/s-CL22, VirtIO, amd_pstate, powersave/balance_performance (boosts up to 4.7GHz with PBO and +200MHz offset)

I am also getting 22Gbit/s VM to VM. Wonder if NPS=2 would help, but too lazy to reboot.
LXC to LXC 41Gbit/s
Thanks for the reply. I have my 7443p set up basically the same as your threadriper. How would NPS2 help? I'm not super knowledgeable on how all the Numa stuff works. Mine is just a single CPU.
 
After reading perf tuning guides on AMD, I would say that the best you align l3 caches with numa domains, the best performance you will get in most cases. This way the os knows better cache architecture and can make advantage of it.

I tested this years ago in 2x32 and 2x64 and after setting static high perf governor and freqs enabling “LLC/L3/CCX as NUMA domain” did the trick. It boosted performance of memory intensive workloads.

You can check with numastat before and after the change.

Just let me know if you test it ;)
I'm going to do some testing on this today.

You think even on a single CPU system that Numa settings could cause slow transfers?
 
After reading perf tuning guides on AMD, I would say that the best you align l3 caches with numa domains, the best performance you will get in most cases. This way the os knows better cache architecture and can make advantage of it.

I tested this years ago in 2x32 and 2x64 and after setting static high perf governor and freqs enabling “LLC/L3/CCX as NUMA domain” did the trick. It boosted performance of memory intensive workloads.

You can check with numastat before and after the change.

Just let me know if you test it ;)
Okay, so the only BIOS parameter I found that looked like what you described was called "ACPI SRAT L3 Cache as Numa Domain". I enabled it with NPS1 also being set and booted back into my OS. I forgot to mention that I'm no longer using Proxmox on this particular machine, but the issue is persistent across several OS's that I've tried. FWIW, I moved the original ZFS pool to an LGA1700 machine, and it gets like 6,000MBps sequential r/w on the same pool. So, the issue here is definitely something related to Epyc, the motherboard, or the BIOS config


Also, please note that I found much better speeds if I did not use ZFS, but had a single drive formatted as EXT4. I could get basically full bandwidth on either PCIE3 or PCIE4 SSDs. I tried created the ZFS pool manually and also within a TruenasVM so I had control via the GUI and could make damn sure volume size, async, and all that stuff was set the way that I wanted. Tests from within the TruenasVM and bare metal resulted in bandwidth being limited to ~2100MBps.



I have no idea what I'm looking with the following numactl and numastat commands, but I've pasted them with hopes that you might.

Ran in the directory where the ZFS pool was mounted while fio was running:

Code:
[redacted@redacted-server fio]$ numastat
                           node0           node1           node2           node3
numa_hit                 2228523         1427110         1398131         1799493
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit               791             785             787             781
local_node               2218553         1414927         1386431         1785785
other_node                  9970           12183           11700           13708
[redacted@redacted-server fio]$ numastat
                           node0           node1           node2           node3
numa_hit                 2923281         1750225         1798990         2279736
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit               791             785             787             781
local_node               2913205         1738038         1787274         2266010
other_node                 10076           12187           11716           13726
[redacted@redacted-server fio]$ numastat
                           node0           node1           node2           node3
numa_hit                 2950093         1784334         1819508         2303329
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit               791             785             787             781
local_node               2940010         1772147         1807792         2289603
other_node                 10083           12187           11716           13726


fio command:

Code:
[redacted@redacted-server fio]$ sudo  fio --ramp_time=5 --gtod_reduce=1 --numjobs=2 --bs=1M --size=100G --runtime=60s --readwrite=write --name=testfile
testfile: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
...
fio-3.41
Starting 2 processes
Jobs: 2 (f=2): [W(2)][100.0%][w=2026MiB/s][w=2026 IOPS][eta 00m:00s]
testfile: (groupid=0, jobs=1): err= 0: pid=7875: Sun Nov 30 11:30:49 2025
  write: IOPS=1024, BW=1025MiB/s (1075MB/s)(60.1GiB/60047msec); 0 zone resets
   bw (  MiB/s): min=  944, max= 1130, per=50.18%, avg=1024.97, stdev=33.45, samples=120
   iops        : min=  944, max= 1130, avg=1024.95, stdev=33.45, samples=120
  cpu          : usr=0.51%, sys=13.01%, ctx=61757, majf=0, minf=1087
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,61537,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
testfile: (groupid=0, jobs=1): err= 0: pid=7876: Sun Nov 30 11:30:49 2025
  write: IOPS=1017, BW=1018MiB/s (1067MB/s)(59.7GiB/60044msec); 0 zone resets
   bw (  KiB/s): min=882688, max=1144832, per=49.84%, avg=1042433.01, stdev=38599.93, samples=120
   iops        : min=  862, max= 1118, avg=1017.98, stdev=37.71, samples=120
  cpu          : usr=0.55%, sys=12.85%, ctx=61264, majf=0, minf=816
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,61114,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2043MiB/s (2142MB/s), 1018MiB/s-1025MiB/s (1067MB/s-1075MB/s), io=120GiB (129GB), run=60044-60047msec


Code:
[redacted@redacted-server fio]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3
preferred:
 
Last edited:
Your NUMA settings (and other) within the BIOS limit the inter-VM communication.

The stats show 4 NUMA nodes (0-3). This means NPS=4 is still active. It's always recommended to completely power down AMD server boards after changing and setting NUMA options.

Your output:

Code:
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

With NPS=1 correctly set it should be:

Code:
cpubind: 0
nodebind: 0
membind: 0

"ACPI SRAT L3 Cache as NUMA Domain" -> wrong. This produces artifical NUMA nodes based on L3 cache. That kills ZFS performance.

Here are some recommended settings for your board:

NUMA / memory settings

Advanced → AMD CBS → DF Common Options → Memory → NUMA Nodes per Socket (NPS)

NUMA Per Socket (NPS): NPS1

CPU power management


Advanced → AMD CBS → CPU Common Options

Core Performance Boost: Enabled
CPPC (Collaborative Processor Performance Control): Enabled
CPPC Preferred Cores: Enabled
Global C-States: Disabled
AMD P-state / performance scaling

Advanced → AMD CBS → NBIO → SMU Common Options
P-State Mode: CPPC / Autonomous
Boost Mode: Enabled


PCIe / I/O Fabric Settings

Advanced → AMD CBS → NBIO → I/O Options

IOMMU: Enabled
SR-IOV Support: Enabled
PCIe Speed: Gen4
(or auto if it's unstable)
ACS Support: Enabled

Network related

Advanced → AMD CBS → NBIO → DF Cstates
DF C-States: Disabled

Memory settings

Advanced → AMD CBS → UMC Common Options

Memory Interleaving: Auto
Memory PowerDown: Disabled

Chipset config

Advanced → Chipset Configuration

Disable:

SRAM Scrub & DRAM Scrub
Periodically / Background Scrub: Disabled

CPU config


Advanced → CPU Configuration

Simultaneous Multithreading (SMT): Enabled
Above 4G Decoding: Enabled

Be sure to power off your system once you saved your settings.
 
Last edited:
Your NUMA settings (and other) within the BIOS limit the inter-VM communication.

The stats show 4 NUMA nodes (0-3). This means NPS=4 is still active. It's always recommended to completely power down AMD server boards after changing and setting NUMA options.

Your output:

Code:
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

With NPS=1 correctly set it should be:

Code:
cpubind: 0
nodebind: 0
membind: 0

"ACPI SRAT L3 Cache as NUMA Domain" -> wrong. This produces artifical NUMA nodes based on L3 cache. That kills ZFS performance.

Here are some recommended settings for your board:

NUMA / memory settings

Advanced → AMD CBS → DF Common Options → Memory → NUMA Nodes per Socket (NPS)

NUMA Per Socket (NPS): NPS1

CPU power management


Advanced → AMD CBS → CPU Common Options

Core Performance Boost: Enabled
CPPC (Collaborative Processor Performance Control): Enabled
CPPC Preferred Cores: Enabled
Global C-States: Disabled
AMD P-state / performance scaling

Advanced → AMD CBS → NBIO → SMU Common Options
P-State Mode: CPPC / Autonomous
Boost Mode: Enabled


PCIe / I/O Fabric Settings

Advanced → AMD CBS → NBIO → I/O Options

IOMMU: Enabled
SR-IOV Support: Enabled
PCIe Speed: Gen4
(or auto if it's unstable)
ACS Support: Enabled

Network related

Advanced → AMD CBS → NBIO → DF Cstates
DF C-States: Disabled

Memory settings

Advanced → AMD CBS → UMC Common Options

Memory Interleaving: Auto
Memory PowerDown: Disabled

Chipset config

Advanced → Chipset Configuration

Disable:

SRAM Scrub & DRAM Scrub
Periodically / Background Scrub: Disabled

CPU config


Advanced → CPU Configuration

Simultaneous Multithreading (SMT): Enabled
Above 4G Decoding: Enabled

Be sure to power off your system once you saved your settings.
Thank you so much for the detailed response.



I went back through all of the BIOS settings and got them as close as possible to your recommendations. A few of them either weren't there, or they were in different locations. Perhaps there is a difference in BIOS versions or something.

Regardless, my transfer speeds have not changed.


Code:
[redacted@redacted-server fio]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0
nodebind: 0
membind: 0
preferred:


Code:
[redacted@redacted-server fio]$ numastat
                           node0
numa_hit                 5917074
numa_miss                      0
numa_foreign                   0
interleave_hit              3298
local_node               5917074
other_node                     0
[dakota@eos-server fio]$


Code:
[redacted@redacted-server fio]$ sudo  fio --ramp_time=5 --gtod_reduce=1 --numjobs=1 --bs=1M --size=100G --runtime=60s --readwrite=write --name=testfile
testfile: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [W(1)][89.1%][w=2056MiB/s][w=2056 IOPS][eta 00m:06s]
testfile: (groupid=0, jobs=1): err= 0: pid=8138: Sun Nov 30 15:47:08 2025
  write: IOPS=2046, BW=2046MiB/s (2146MB/s)(87.7GiB/43862msec); 0 zone resets
   bw (  MiB/s): min= 1910, max= 2194, per=100.00%, avg=2047.18, stdev=68.23, samples=87
   iops        : min= 1910, max= 2194, avg=2047.15, stdev=68.24, samples=87
  cpu          : usr=1.08%, sys=26.75%, ctx=92807, majf=0, minf=36
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,89759,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2046MiB/s (2146MB/s), 2046MiB/s-2046MiB/s (2146MB/s-2146MB/s), io=87.7GiB (94.1GB), run=43862-43862msec
 
Your test pattern „hits“ the Milan‘s limitation for 1 thread on ZFS.

Can you please provide the output of these 2 test patterns:

Write test, many threads:

Bash:
fio \
 --name=write \
 --direct=1 \
 --rw=write \
 --bs=1M \
 --numjobs=4 \
 --iodepth=32 \
 --size=50G \
 --ioengine=libaio

Result should be 5-6GB/s

And this one (read):

Bash:
fio --name=readtest --rw=read --bs=1M --numjobs=8 --iodepth=32 --size=50G --ioengine=libaio

Should be between 8-14GB/s
 
Your test pattern „hits“ the Milan‘s limitation for 1 thread on ZFS.

Can you please provide the output of these 2 test patterns:

Write test, many threads:

Bash:
fio \
 --name=write \
 --direct=1 \
 --rw=write \
 --bs=1M \
 --numjobs=4 \
 --iodepth=32 \
 --size=50G \
 --ioengine=libaio

Result should be 5-6GB/s

And this one (read):

Bash:
fio --name=readtest --rw=read --bs=1M --numjobs=8 --iodepth=32 --size=50G --ioengine=libaio

Should be between 8-14GB/s
Yes, I will test shortly. I needed to power down the machine to remove some disks unrelated to this issue.

I'll report back hopefully within ~30 minutes.
 
Your test pattern „hits“ the Milan‘s limitation for 1 thread on ZFS.

Can you please provide the output of these 2 test patterns:

Write test, many threads:

Bash:
fio \
 --name=write \
 --direct=1 \
 --rw=write \
 --bs=1M \
 --numjobs=4 \
 --iodepth=32 \
 --size=50G \
 --ioengine=libaio

Result should be 5-6GB/s

And this one (read):

Bash:
fio --name=readtest --rw=read --bs=1M --numjobs=8 --iodepth=32 --size=50G --ioengine=libaio

Should be between 8-14GB/s
Reporting back. I've shortened the pasted in stuff because of it being too many characters to post on the forums.

Results were worse on the write:

Code:
[redacted@redacted-server fio]$ sudo fio \
 --name=write \
 --direct=1 \
 --rw=write \
 --bs=1M \
 --numjobs=4 \
 --iodepth=32 \
 --size=50G \
 --ioengine=libaio
write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...

Run status group 0 (all jobs):
  WRITE: bw=1819MiB/s (1908MB/s), 455MiB/s-474MiB/s (477MB/s-497MB/s), io=200GiB (215GB), run=107936-112581msec




Reads looked like what I'd expect from writes

Code:
[redacted@redacted-server fio]$ sudo fio --name=readtest --rw=read --bs=1M --numjobs=8 --iodepth=32 --size=50G --ioengine=libaio
                                                                                          
readtest: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...


Run status group 0 (all jobs):
   READ: bw=5428MiB/s (5691MB/s), 678MiB/s-2419MiB/s (711MB/s-2536MB/s), io=400GiB (429GB), run=21166-75466msec