as the title says, we have one server where the gui will hang up / become unreachable by the cluster manager under certain specific workloads.
server topology: this server has two nvme gen4 ports that are run by the chipset. we have the the Proxmox OS installed on those. since the chipset is a bit short on bandwidth for running VM's, we have an additional four gen5 nvme's connected to the CPU in RAID10 that are running all the VM's. these are no issue and the I/O delay is hovering around 0.
everything here is zfs.
since the two gen4 nvme's connected to the chipset ports are 4tb in size and have nothing but the host os on them, we have a PBS VM running on them. while i know that's technically not the best way to do it, this PBS instance is only backing up the 25 VM's on this host, so it's not seeing a ton of activity. but this is what's causing us the issue :/ every time we send a backup job towards that PBS instance, the host will see it's I/O delay going up to around 3-4 %, hover there a bit, and then the gui will loose communication. the graphs will have empy spots, and, in some instances, some of the vm's will hang up. this at only at 3-4 percent I/O delay! i cannot with certainty say the VM's are hanging up because of this. but the gui freezing in a clustered host is creepy enough.
here's what we've tried:
we've set the bandwidth limit on the PBS vm's disk to 500 IOPS. no joy. frankly, it doesn't even seem to obey this law because the backups keep heading towards it at something like 300 MB/s. the ashift is at default 12.
it has 8 vcpu's and we've set the cpulimit to 2. the cpu units to 20. (this is the only vm where we've configured cpu limit parameters)
we've set the network adapter for the PBS vm to a rate limit of 50 MB/s
we've deployed a windows vm to the chipset drives and ran stress tests, which will drive the host I/O delay to 15% or 20% without any hangs. so why is the gui hanging when sending a backup job to the PBS VM and the I/O delay going to 3% - 4%? are there logs that would tell me this? this is a gen4 xeon server with 28 cores, 56 threads. it doesn't seem to have any issues with anything except this PBS isntance.
pve version: pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)
server topology: this server has two nvme gen4 ports that are run by the chipset. we have the the Proxmox OS installed on those. since the chipset is a bit short on bandwidth for running VM's, we have an additional four gen5 nvme's connected to the CPU in RAID10 that are running all the VM's. these are no issue and the I/O delay is hovering around 0.
everything here is zfs.
since the two gen4 nvme's connected to the chipset ports are 4tb in size and have nothing but the host os on them, we have a PBS VM running on them. while i know that's technically not the best way to do it, this PBS instance is only backing up the 25 VM's on this host, so it's not seeing a ton of activity. but this is what's causing us the issue :/ every time we send a backup job towards that PBS instance, the host will see it's I/O delay going up to around 3-4 %, hover there a bit, and then the gui will loose communication. the graphs will have empy spots, and, in some instances, some of the vm's will hang up. this at only at 3-4 percent I/O delay! i cannot with certainty say the VM's are hanging up because of this. but the gui freezing in a clustered host is creepy enough.
here's what we've tried:
we've set the bandwidth limit on the PBS vm's disk to 500 IOPS. no joy. frankly, it doesn't even seem to obey this law because the backups keep heading towards it at something like 300 MB/s. the ashift is at default 12.
it has 8 vcpu's and we've set the cpulimit to 2. the cpu units to 20. (this is the only vm where we've configured cpu limit parameters)
we've set the network adapter for the PBS vm to a rate limit of 50 MB/s
we've deployed a windows vm to the chipset drives and ran stress tests, which will drive the host I/O delay to 15% or 20% without any hangs. so why is the gui hanging when sending a backup job to the PBS VM and the I/O delay going to 3% - 4%? are there logs that would tell me this? this is a gen4 xeon server with 28 cores, 56 threads. it doesn't seem to have any issues with anything except this PBS isntance.
pve version: pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)