[SOLVED] Tuning

rogueangel2k

New Member
Dec 18, 2023
11
1
3
Good afternoon. I have a production HA setup with shared iscsi. I fear I've misconfigured things and am stuck with the poor performance I'm experiencing.

I have 5 HA hosts. 4 active with guests and #5 for voting only.
- Each is dual port 802.3ad LACP bound 10 GIG MTU 9000
- HA configured iSCSI with then 2x LVM's over the top. The spinning and SSD's below on the qNAP.
- Each is dual port gigabit to 2x 5120 HP's for redundant host communication traffic. Each host's gigabit ports are set for "balance-alb".
- 256 Gigs of RAM each with a mix of EPYC procs on the 4 active hosts and a Xeon on the voting member with 16 Gigs of RAM on it.
- Because of the mix of EPYCS ranging from 7542, 7351, 7302, and a 7401 each guest is set to x86-64-v2-AES for the most compatible setup for migration. I tried others and when I would migrate a guest, the guest would reset, so I had to settle on that proc config for the guests.

qNAP TS-h1886XU-RP-R2-D1622-32G iSCSI
- Spinning LUN - SATA
- SSD LUN - SATA
- Dual port 802.3ad LACP bound 10 GIG MTU 9000
- 128 Gigs of RAM with a 4 core, 8 thread Xeon

Netgear XS724EM 24-Port 10 Gig Switch
- Each machine's port is configured for 802.3ad LACP support which the switch's documentation states it is capable of doing.
- Storage traffic and migration traffic only.
- Max MTU 9216

2x HP 5120 48 port gigabit switches for guests and redundant host communication.
- Guest traffic
- Host HA communication

I get poor performance even on the SSD side which only has 6 guests at the moment (4 domain controllers and 2 database apps). Even poorer performance on the spinners. Either I've configured something wrong or the tuning is off OR I've hit my limit and won't get anything more out of it. Throwing cores at the guests doesn't help much. It's the I/O that's eating me up, I think.

I guess my question is can I get more out of this configuration, or do I need to do something different, which just isn't in the budget right now? This is what I had to work with so I made lemonade to get HA and some redundancy.

I've read that multipath is ideal but really only talks about it being a redundant option and not a performant one. I know LACP isn't true 20 gig but I thought I'd be seeing more out of this setup... again... unless I missed something.

Please be kind. I need advice. Thank you.
 
Last edited:
Hi @rogueangel2k ,

It is really hard to assess the situation without concrete numbers, "poor performance" is not a sufficient descriptor.

if you suspect that storage IO is your bottleneck, you need to measure it independently of any VM traffic. Ideally, you'd do it on an idle system. Perhaps you can migrate VMs off one of your hosts for a test. Your first goal is measure and understand performance between your hypervisor and the storage, before inserting the virtualization latency.

One tool you can use is "fio". You can find some examples here: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html#description.

I would not spend much time on spinning rust testing. You should concentrate on your SDD, although their performance also depends on whether they are Enterprise or Consumer type.

LACP will not give you 20Gig on a single host. The default balancing is MAC XOR, and you only have one IP/MAC on each side.

If you can get some data, perhaps there is more to do.

Do take a look at the tips in the above article.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @rogueangel2k ,

It is really hard to assess the situation without concrete numbers, "poor performance" is not a sufficient descriptor.

if you suspect that storage IO is your bottleneck, you need to measure it independently of any VM traffic. Ideally, you'd do it on an idle system. Perhaps you can migrate VMs off one of your hosts for a test. Your first goal is measure and understand performance between your hypervisor and the storage, before inserting the virtualization latency.

One tool you can use is "fio". You can find some examples here: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html#description.

I would not spend much time on spinning rust testing. You should concentrate on your SDD, although their performance also depends on whether they are Enterprise or Consumer type.

LACP will not give you 20Gig on a single host. The default balancing is MAC XOR, and you only have one IP/MAC on each side.

If you can get some data, perhaps there is more to do.

Do take a look at the tips in the above article.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I appreciate that. I did get iperf3 installed on the qNAP and hosts. Each host... host to host shows 10 gig... well like 9.6. Host to qNAP is also in the same range. I've never used fio so I'll attempt something as soon as I can. Thank you, bbgeek17.
 
Iperf is only testing the throughput of the network. If you are having performance issues below your network capabilities your disks are the cause. You have 6 hosts with enterprise loads on an SSD - so what is the performance characteristics of said SSD, does it match. You are using a QNAP (soho grade) NAS, what exactly are its specs, does it even have a CPU that can move data to iSCSI at 10Gbps, typically Drobo/QNAP etc cheaps out on an embedded grade 2 or 4 core CPU and will sell that same model for a good decade. How many disks are you spreading the load on etc, 6 VMs on a single (or mirror) SSD is pretty much the maximum for your average datacenter grade QLC.
 
  • Like
Reactions: Johannes S
SSD's are 6x WD Red SA500 2.5 4TB. qNAP, in their documentation, speak of creating a LUN for each core on the qNAP's proc. It is a 4 core, 8 thread proc. I only have 1 LUN for the SSD's. Given how my cluster is setup right now, I do not know if I "could" create more LUN's and wouldn't know how to integrate them into the cluster with it running and not bring everything offline in the process.
 
If it’s using a core per LUN that tells me their iSCSI stack is likely single threaded and thus you are only using 1 core. From the benchmarks I’ve seen you can get close to 10Gbps throughput if you are using multiple LUN and spreading the load evenly. You can create additional LUN, expose them to Proxmox and then migrate the disks while the system is online, provided the QNAP software can make LUN without breaking anything. I don’t know whether the WD Red are capable either, the benchmarks are abysmal with spinning-disk like latency when they are under load.
 
  • Like
Reactions: Kingneutron
If it’s using a core per LUN that tells me their iSCSI stack is likely single threaded and thus you are only using 1 core. From the benchmarks I’ve seen you can get close to 10Gbps throughput if you are using multiple LUN and spreading the load evenly. You can create additional LUN, expose them to Proxmox and then migrate the disks while the system is online, provided the QNAP software can make LUN without breaking anything. I don’t know whether the WD Red are capable either, the benchmarks are abysmal with spinning-disk like latency when they are under load.
I'll have a small opportunity tomorrow and a larger one this weekend to attempt more LUNS. I can shut the stack down prior to and keep the HA guests offline while I bring the cluster back online. When you say migrate the disks to a different LUN, wouldn't proxmox handle that automatically? Will I have to manually spread the load and keep an eye on it myself?

I've never done this so I'm hoping someone has experience and can point me in the right direction. Creating a new LUN would require an entirely new volume. No? I don't have any new volumes created, nor really any un-provisioned space in which to make a new volume. Please advise.

Thank you in advance. Your help means a lot to me.
 
Ok, so this is weird. When I setup the cluster nearly a year ago, system performance was meh. I had all traffic on spinners with an SSD cache. On each guest I installed the Red Hat VirtIO drivers. Performance was abysmal. So I went through every guest and disabled all offloading in the driver. Voila. Performance.

A couple of months later I brought an SSD LUN online and moved my more demanding guests over to it. Again... bam! Performance, on those guests.

Then I suffered a RAM loss of 2 sticks out of 4 on the storage device. This current problem I triaged and troubleshoot from that stand point. After replacing the failed RAM, performance did not return.

I scratched my head, then came and posted here. As always, multiple minds together can trigger something.

I troubleshot from the guests' perspective. Went around to all guests and updated the VirtIO drivers and ensured each and every one still had checksum offloading disabled. They did, but one guest had a problem. I could copy a file to \\server\share but it was going like 25 megabit. I didn't think one guest would cause troubles for the whole stack. Maybe it does, maybe it doesn't but here's what I did.

On that Windows Server 2016 guest, I enabled a 2nd NIC as the E1000 device. Moved my network traffic to it. Removed the VirtIO NIC driver. Rebooted. Reinstalled the VirtIO driver and Qemu Guest packages. Rebooted. Moved my network traffic back to the VirtIO driver with checksum offloading disabled. Removed the E1000 device.

Performance has returned to the entire stack. I simply do not understand that, but everything is back to normal. No restarting of the hosts. No restarting of the storage. Just simple guest troubleshooting of a problem that caused the whole stack to lose its performance.

Odd. Thank you all for reaching out.
 
Last edited: