Greetings, all!
I’m having an issue with Proxmox 9.1.6 and a virtualized TrueNAS SCALE 25.10.2.1 that I’m hoping someone can shed some light on.
Network Performance:
However… Castle to Node blows up, and is the source of my query here. I get, on a good run, thousands of retries. On a bad one, tens of thousands:
[ 5] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 86 1.68 MBytes
[ 7] 6.00-7.00 sec 292 MBytes 2.45 Gbits/sec 115 1.70 MBytes
[ 9] 6.00-7.00 sec 296 MBytes 2.49 Gbits/sec 204 1.66 MBytes
[ 11] 6.00-7.00 sec 294 MBytes 2.46 Gbits/sec 116 1.56 MBytes
[SUM] 6.00-7.00 sec 1.15 GBytes 9.88 Gbits/sec 521
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 7.00-8.00 sec 292 MBytes 2.45 Gbits/sec 155 1.67 MBytes
[ 7] 7.00-8.00 sec 294 MBytes 2.46 Gbits/sec 524 1.64 MBytes
[ 9] 7.00-8.00 sec 295 MBytes 2.47 Gbits/sec 174 1.77 MBytes
[ 11] 7.00-8.00 sec 290 MBytes 2.43 Gbits/sec 161 1.25 MBytes
[SUM] 7.00-8.00 sec 1.14 GBytes 9.82 Gbits/sec 1014
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 80 1.32 MBytes
[ 7] 8.00-9.00 sec 294 MBytes 2.46 Gbits/sec 159 1.26 MBytes
[ 9] 8.00-9.00 sec 294 MBytes 2.46 Gbits/sec 108 1.44 MBytes
[ 11] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 50 1.53 MBytes
[SUM] 8.00-9.00 sec 1.15 GBytes 9.88 Gbits/sec 397
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 8 1.60 MBytes
[ 7] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 154 1.48 MBytes
[ 9] 9.00-10.00 sec 296 MBytes 2.49 Gbits/sec 189 1.83 MBytes
[ 11] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 70 1.76 MBytes
[SUM] 9.00-10.00 sec 1.15 GBytes 9.88 Gbits/sec 421
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec 1386 sender
[ 5] 0.00-10.00 sec 2.87 GBytes 2.47 Gbits/sec receiver
[ 7] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec 1977 sender
[ 7] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec receiver
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec 1975 sender
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 11] 0.00-10.00 sec 2.87 GBytes 2.46 Gbits/sec 986 sender
[ 11] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec receiver
[SUM] 0.00-10.00 sec 11.5 GBytes 9.87 Gbits/sec 6324 sender
[SUM] 0.00-10.00 sec 11.5 GBytes 9.86 Gbits/sec receiver
Also, what's incredibly strange (at least to me) is that Node -> Castle is clean and clear:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 0 760 KBytes
[ 7] 6.00-7.00 sec 296 MBytes 2.48 Gbits/sec 0 795 KBytes
[ 9] 6.00-7.00 sec 295 MBytes 2.48 Gbits/sec 0 804 KBytes
[ 11] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[SUM] 6.00-7.00 sec 1.15 GBytes 9.91 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 7.00-8.00 sec 295 MBytes 2.48 Gbits/sec 0 760 KBytes
[ 7] 7.00-8.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[ 9] 7.00-8.00 sec 295 MBytes 2.48 Gbits/sec 0 1.21 MBytes
[ 11] 7.00-8.00 sec 297 MBytes 2.49 Gbits/sec 0 1.13 MBytes
[SUM] 7.00-8.00 sec 1.15 GBytes 9.91 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 8.00-9.00 sec 295 MBytes 2.48 Gbits/sec 0 760 KBytes
[ 7] 8.00-9.00 sec 295 MBytes 2.48 Gbits/sec 0 795 KBytes
[ 9] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 0 1.21 MBytes
[ 11] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 0 1.13 MBytes
[SUM] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 760 KBytes
[ 7] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[ 9] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 1.21 MBytes
[ 11] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 1.13 MBytes
[SUM] 9.00-10.00 sec 1.15 GBytes 9.89 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 1 sender
[ 5] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec receiver
[ 7] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 1 sender
[ 7] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 9] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 3 sender
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 11] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec 3 sender
[ 11] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec receiver
[SUM] 0.00-10.00 sec 11.5 GBytes 9.91 Gbits/sec 8 sender
[SUM] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
So VM NAS -> bare metal NAS blows up, but bare metal NAS -> VM NAS is nearly perfect.
Line rate stays pretty good–not the pegged 9.90gbps of host ↔ bare metal SCALE, but not bad at 9.88gbps. However, 6300 retries over ten segments is kind of awful. The only thing “in the way” of Castle’s performance is the fact that it’s virtualized on Proxmox. The host has no other VMs installed, and as a standalone, has no Corosync overhead. I’ve tried altering NIC parameters inside Castle (“ethtool -K ens18 tso off gso off lro off”) but that made the results even worse, not better. I cannot use PCIe-passthrough to split the 10gb NICs on the Supermicro board (I don’t think…) and I cannot add another PCIe 10gb NIC because the single 16x slot is occupied by an HBA for Castle’s drive pools. Isolating and passing through a 10gb NIC for Castle would obviously be a silver bullet as it would entirely bypass all of Proxmox’s networking overhead, but that’s not in the cards.
Is there some sort of tuning I can do in Proxmox to alleviate the retries? There’s clearly enough bandwidth, as evidenced by the throughput numbers, but retries continue to happen. For me, the smoking gun is the fact that the host and bare-metal NAS can pass iperf3 at blazing speed AND stability, but going from VM to bare metal chokes badly over the exact same hardware.
Any help in solving this would be greatly appreciated! I can, of course, provide additional information if that would help.
I’m having an issue with Proxmox 9.1.6 and a virtualized TrueNAS SCALE 25.10.2.1 that I’m hoping someone can shed some light on.
- Networking is 10gb all around to a MikroTik CRS309-1G-8S+IN.
- Proxmox (“Mithril”) is using on-board 10gbe on a Supermicro ITX motherboard.
- The TrueNAS SCALE VM (“Castle”) is configured to use vmbr0 through Proxmox.
- The external test server (“Node”) is a second TrueNAS SCALE box on bare metal, and has an X520-DA2 PCIe 10gb NIC.
- All hops in the fabric–vmbr0, TrueNAS SCALE (both machines), the MikroTik switch–are already set to MTU 9000 / L2MTU 9216.
- All OS installs (Proxmox, both SCALE) are “fresh” and, except for functionally important adjustments like IP assignment, NTP confirmation, etc, are “stock and default” in an effort to inject as little bias into troubleshooting as possible.
- Mithril is stand-alone, and NOT clustered.
Network Performance:
- I’m using “iperf3 -c Node -P 4” as my iperf3 config in all tests below.
- Mithril to Node is blisteringly fast and clean. 9.90gbps with nary a packet retry.
- Node to Mithril is equally as fast and clean. Solid 9.90gbps with near zero packet retries.
However… Castle to Node blows up, and is the source of my query here. I get, on a good run, thousands of retries. On a bad one, tens of thousands:
[ 5] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 86 1.68 MBytes
[ 7] 6.00-7.00 sec 292 MBytes 2.45 Gbits/sec 115 1.70 MBytes
[ 9] 6.00-7.00 sec 296 MBytes 2.49 Gbits/sec 204 1.66 MBytes
[ 11] 6.00-7.00 sec 294 MBytes 2.46 Gbits/sec 116 1.56 MBytes
[SUM] 6.00-7.00 sec 1.15 GBytes 9.88 Gbits/sec 521
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 7.00-8.00 sec 292 MBytes 2.45 Gbits/sec 155 1.67 MBytes
[ 7] 7.00-8.00 sec 294 MBytes 2.46 Gbits/sec 524 1.64 MBytes
[ 9] 7.00-8.00 sec 295 MBytes 2.47 Gbits/sec 174 1.77 MBytes
[ 11] 7.00-8.00 sec 290 MBytes 2.43 Gbits/sec 161 1.25 MBytes
[SUM] 7.00-8.00 sec 1.14 GBytes 9.82 Gbits/sec 1014
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 80 1.32 MBytes
[ 7] 8.00-9.00 sec 294 MBytes 2.46 Gbits/sec 159 1.26 MBytes
[ 9] 8.00-9.00 sec 294 MBytes 2.46 Gbits/sec 108 1.44 MBytes
[ 11] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 50 1.53 MBytes
[SUM] 8.00-9.00 sec 1.15 GBytes 9.88 Gbits/sec 397
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 8 1.60 MBytes
[ 7] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 154 1.48 MBytes
[ 9] 9.00-10.00 sec 296 MBytes 2.49 Gbits/sec 189 1.83 MBytes
[ 11] 9.00-10.00 sec 294 MBytes 2.46 Gbits/sec 70 1.76 MBytes
[SUM] 9.00-10.00 sec 1.15 GBytes 9.88 Gbits/sec 421
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec 1386 sender
[ 5] 0.00-10.00 sec 2.87 GBytes 2.47 Gbits/sec receiver
[ 7] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec 1977 sender
[ 7] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec receiver
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec 1975 sender
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 11] 0.00-10.00 sec 2.87 GBytes 2.46 Gbits/sec 986 sender
[ 11] 0.00-10.00 sec 2.86 GBytes 2.46 Gbits/sec receiver
[SUM] 0.00-10.00 sec 11.5 GBytes 9.87 Gbits/sec 6324 sender
[SUM] 0.00-10.00 sec 11.5 GBytes 9.86 Gbits/sec receiver
Also, what's incredibly strange (at least to me) is that Node -> Castle is clean and clear:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 0 760 KBytes
[ 7] 6.00-7.00 sec 296 MBytes 2.48 Gbits/sec 0 795 KBytes
[ 9] 6.00-7.00 sec 295 MBytes 2.48 Gbits/sec 0 804 KBytes
[ 11] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[SUM] 6.00-7.00 sec 1.15 GBytes 9.91 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 7.00-8.00 sec 295 MBytes 2.48 Gbits/sec 0 760 KBytes
[ 7] 7.00-8.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[ 9] 7.00-8.00 sec 295 MBytes 2.48 Gbits/sec 0 1.21 MBytes
[ 11] 7.00-8.00 sec 297 MBytes 2.49 Gbits/sec 0 1.13 MBytes
[SUM] 7.00-8.00 sec 1.15 GBytes 9.91 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 8.00-9.00 sec 295 MBytes 2.48 Gbits/sec 0 760 KBytes
[ 7] 8.00-9.00 sec 295 MBytes 2.48 Gbits/sec 0 795 KBytes
[ 9] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 0 1.21 MBytes
[ 11] 8.00-9.00 sec 295 MBytes 2.47 Gbits/sec 0 1.13 MBytes
[SUM] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 760 KBytes
[ 7] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 795 KBytes
[ 9] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 1.21 MBytes
[ 11] 9.00-10.00 sec 295 MBytes 2.47 Gbits/sec 0 1.13 MBytes
[SUM] 9.00-10.00 sec 1.15 GBytes 9.89 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 1 sender
[ 5] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec receiver
[ 7] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 1 sender
[ 7] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 9] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec 3 sender
[ 9] 0.00-10.00 sec 2.88 GBytes 2.47 Gbits/sec receiver
[ 11] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec 3 sender
[ 11] 0.00-10.00 sec 2.88 GBytes 2.48 Gbits/sec receiver
[SUM] 0.00-10.00 sec 11.5 GBytes 9.91 Gbits/sec 8 sender
[SUM] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
So VM NAS -> bare metal NAS blows up, but bare metal NAS -> VM NAS is nearly perfect.
Line rate stays pretty good–not the pegged 9.90gbps of host ↔ bare metal SCALE, but not bad at 9.88gbps. However, 6300 retries over ten segments is kind of awful. The only thing “in the way” of Castle’s performance is the fact that it’s virtualized on Proxmox. The host has no other VMs installed, and as a standalone, has no Corosync overhead. I’ve tried altering NIC parameters inside Castle (“ethtool -K ens18 tso off gso off lro off”) but that made the results even worse, not better. I cannot use PCIe-passthrough to split the 10gb NICs on the Supermicro board (I don’t think…) and I cannot add another PCIe 10gb NIC because the single 16x slot is occupied by an HBA for Castle’s drive pools. Isolating and passing through a 10gb NIC for Castle would obviously be a silver bullet as it would entirely bypass all of Proxmox’s networking overhead, but that’s not in the cards.
Is there some sort of tuning I can do in Proxmox to alleviate the retries? There’s clearly enough bandwidth, as evidenced by the throughput numbers, but retries continue to happen. For me, the smoking gun is the fact that the host and bare-metal NAS can pass iperf3 at blazing speed AND stability, but going from VM to bare metal chokes badly over the exact same hardware.
Any help in solving this would be greatly appreciated! I can, of course, provide additional information if that would help.
Last edited: