Disk-IO and CPU-Load after Upgrade

chrwa

Member
Nov 28, 2019
16
0
6
73
On a 7-node hyperconverged cluster I made an upgrade from 7.0-2 to 7.1-1 and for the first time I recognized a heavy disk-IO and CPU-Load on most VM's during the reboot of the upgraded node. This leads to a temporary freeze of them (about 4 Minutes). An additional problem is, that some of the guests don't work properly after live-migration (I have to stop them before I can start them again). This issue I know much more longer and everytime I hoped it will be fixed with the next upgrade.

During the reboot I have set the global flags nobackfill, noout, nodown and norebalance.
 
There seems to be some issues on Proxmox 7 when there is heavy IO on storage. Our cluster with multiple NFS storages will lock up VMs whenever there is heavy IO on a VM on the same storage. Never had this issue on Proxmox 6.
 
Perhaps another issue points in the same direction. Since I made an upgrade to 7.x the ZFS pools are very often degraded with multiple read- and write-errors. After clear and resilvering the pools working fine until another disk fails. I have changed some of the "faulty" disks and made a surface and some other tests and my suspicion has been confirmed that the disks were all okay. After i did an update of the bios and the firmware of the controller, the errors became a little less frequent. But the errors are still there.