Backups hang reliably when jumbo frames enabled on PBS VM

sstillwell · Apr 9, 2024

I currently have Proxmox Backup Server (current up to date with non-subscription repository) installed in a VM in Synology VMM which is running on a Synology DS1819+ (32 GB RAM, mix of HDDs and SSDs), where the backup repository resides on NFS storage. I run the backup server VM with two vNICs, one connected to a 1Gbps NIC on the NAS for management, and the other connected to a 10Gbps storage network NIC. The Proxmox servers are similarly configured (1Gbps management network, 1Gbps VM network, 10Gbps storage network), and MTU is set to 9000 everywhere including the Unifi physical switches on the network. The backup repository and VM storage are both on NFS shares on separate volumes on the NAS. (VM storage on Seagate IronWolf SSDs, backup repository on Seagate IronWolf traditional HDDs with Synology NVMe cache.)

Backup is reliable and fairly fast (> 250 MB/sec) as long as I do NOT enable jumbo frames on the backup server vNICs. If I enable jumbo frames (MTU 9000) the server appears to work properly until I run a backup, at which point the backup will hang until it eventually times out. I've tried lowering the MTU to 8000, 4000 and it just doesn't work. Tried switching from Linux network devices to OVS with OVSBridge, OVSPort, and OVSIntPort...also makes no difference.

I don't get any real error messages out of the process, it just hangs. I would hope that using jumbo frames would improve backup throughput, even if it's only marginally. Any ideas?

Scott

Rickb · Apr 10, 2024

Sounds like there is a MTU problem in path. Did you verify that you can transmit/receive at 9000 MTU at about 10GB?
I have been working on a different issue but it did come down to a MTU issue in my path and I used the following to trace it down. This will also help to see if your speed rate can get up to about 10GB.

Test throughput
Server machine

Code:

iperf -s -e -i 1

Client machine

Code:

iperf -c SERVER_IP -e -i 1

Test ping with 9000 bytes

Code:

ping -s 9000 SERVER_IP

If I understand your config you have the PBS running on the Synology along with serving the NFS storage? If so you could be causing it to choke under use.

sstillwell · Apr 10, 2024

with MTU 1500 I get

Code:

root@pve-host01:~# iperf -c 10.1.1.5 -e -i 1
------------------------------------------------------------
Client connecting to 10.1.1.5, TCP port 5001 with pid 1434530 (1 flows)
Write buffer size: 131072 Byte
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.1.3%storport port 40792 connected with 10.1.1.5 port 5001 (sock=3) (icwnd/mss/irtt=14/1448/300) (ct=0.35 ms) on 2024-04-10 16:20:42 (CDT)
[ ID] Interval            Transfer    Bandwidth       Write/Err  Rtry     Cwnd/RTT(var)        NetPwr
[  1] 0.0000-1.0000 sec   855 MBytes  7.17 Gbits/sec  6841/0          1     4341K/798(179) us  1123639
[  1] 1.0000-2.0000 sec  1012 MBytes  8.49 Gbits/sec  8097/0          0     4341K/801(216) us  1324956
[  1] 2.0000-3.0000 sec   973 MBytes  8.16 Gbits/sec  7781/0          0     4341K/610(77) us  1671920
[  1] 3.0000-4.0000 sec   949 MBytes  7.96 Gbits/sec  7591/0          0     4341K/1415(534) us  703157
[  1] 4.0000-5.0000 sec   868 MBytes  7.28 Gbits/sec  6944/0         22     3038K/971(221) us  937347
[  1] 5.0000-6.0000 sec   913 MBytes  7.66 Gbits/sec  7302/0          7     2293K/2630(256) us  363912
[  1] 6.0000-7.0000 sec   862 MBytes  7.24 Gbits/sec  6900/0         84     1330K/750(173) us  1205862
[  1] 7.0000-8.0000 sec   829 MBytes  6.95 Gbits/sec  6630/0          0     1698K/1415(85) us  614139
[  1] 8.0000-9.0000 sec   844 MBytes  7.08 Gbits/sec  6753/0         90     1296K/1775(337) us  498664
[  1] 9.0000-10.0000 sec   829 MBytes  6.95 Gbits/sec  6629/0          0     1527K/1409(295) us  616662
[  1] 0.0000-10.0297 sec  8.72 GBytes  7.47 Gbits/sec  71469/0        204     1527K/1419(249) us  658196

but at MTU 9000 I get

Code:

root@pve-host01:~# iperf -c 10.1.1.5 -e -i 1
------------------------------------------------------------
Client connecting to 10.1.1.5, TCP port 5001 with pid 1436453 (1 flows)
Write buffer size: 131072 Byte
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.1.3%storport port 53480 connected with 10.1.1.5 port 5001 (sock=3) (icwnd/mss/irtt=87/8948/341) (ct=0.39 ms) on 2024-04-10 16:24:19 (CDT)
[ ID] Interval            Transfer    Bandwidth       Write/Err  Rtry     Cwnd/RTT(var)        NetPwr
[  1] 0.0000-1.0000 sec   551 KBytes  4.51 Mbits/sec  5/0         11        8K/11003(9064) us  51.24
[  1] 1.0000-2.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          1        8K/11003(9064) us  0.000000
[  1] 2.0000-3.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 3.0000-4.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          1        8K/11003(9064) us  0.000000
[  1] 4.0000-5.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 5.0000-6.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 6.0000-7.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          1        8K/11003(9064) us  0.000000
[  1] 7.0000-8.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 8.0000-9.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 9.0000-10.0000 sec  0.000 Bytes  0.000 bits/sec  0/0          0        8K/11003(9064) us  0.000000
[  1] 0.0000-20.4494 sec   551 KBytes   221 Kbits/sec  5/0         15        8K/11003(9064) us  2.505655
root@pve-host01:~# ping -s 9000 10.1.1.5
PING 10.1.1.5 (10.1.1.5) 9000(9028) bytes of data.
^C
--- 10.1.1.5 ping statistics ---
32 packets transmitted, 0 received, 100% packet loss, time 31696ms

So SOMETHING is going wrong, obviously, but I'm not sure what. The networks and switches are all configured for MTU 9000 and the two host machines are just connected to a single managed switch.

Thanks for the input - I'll continue to dig.

Rickb · Apr 10, 2024

Depending on the switch vendor the mtu should be higher then 9000 in the switch config. Our dell switch's are at 9216 because of the packet overhead if it applies in the environment. (aka vlan tagging)

Try decreasing the 9000 (start at 8972) in your ping and see if it starts pinging as you decrease. If it does then you need to set the MTU in the switch higher then 9000. I would do the max that your switches support. This is assuming that the traffic is going through the switch and not staying local if the PBS and storage is all on the NAS.

sstillwell · Apr 11, 2024

Well, all of the switching and routing equipment is Ubiquiti Unifi, and their KBs say...

Code:

MTU Size and Exceptions
On UniFi Switches and Gateways, the typical MTU size is 9216 bytes when using Jumbo Frames. This applies to all devices with the exception of the following:
USG: 9000 bytes
USG-Pro: 9000 bytes

and I don't have the latter two products, so it should support up to 9216. 9000 should be fine in that case.

The Synology NAS unit allows you to choose frame size up to 9000, and the PVE hosts are configured for 9000 and talk to the NFS shares on the Synology with no issue...which is through the same Unifi switch that the problematic connection is going through.

I can't get a ping larger than 1472 to go through from the PBS server to the PVE server, or vice-versa, regardless of the MTU set on the PBS interface (yes, I'm restarting the network stack after each change and verifying that 'ip a' shows the correct MTU on the interfaces).

It's got me downright puzzled.

mow · Apr 11, 2024

Have you checked with tcpdump where you can still see the packets?

sstillwell · Apr 11, 2024

For the moment, I've just reverted changes and disabled jumbo frames across everything. Monitoring PBS processes like jobs verifying datastore or such, I'm getting CPU-bound on the PBS VM and NAS long before I'm getting bound by network throughput. Standing up a separate physical fire-breathing PBS server is not a practicality for me right now, and performance is more than acceptable as is (I don't have a lot of serious IOPS going on in my home lab outside of backup tasks) so I'm just falling back to standard frames and calling it done. I may revisit this later, but I am getting ~6-7 Gbps minimum throughput even on standard MTU.

sstillwell · Apr 18, 2024

For what it's worth, I've found the problem. It's Synology - specifically, if you're running their Virtual Machine Manager (and a VM on it), it installs and enables Open vSwitch, and while you can set MTU=9000 on the interfaces of the NAS, the internal virtual bridge/switches that the vNICs of the VMs connect to remains set at 1500 with no effective way to adjust it. I guess I could move the PBS VM back onto the PVE host, but that kind of defeats the purpose I was shooting for. I doubt that any gains from jumbo frames would exceed the amount lost by moving the PBS away from the storage repository. I guess someday I'll suck it up and just buy a machine to host PBS with local storage.

Search

Search

Backups hang reliably when jumbo frames enabled on PBS VM

sstillwell

New Member

Rickb

New Member

sstillwell

New Member

Rickb

New Member

sstillwell

New Member

mow

Member

sstillwell

New Member

sstillwell

New Member

We value your privacy