Hi,
we are running quite a large Proxmox cluster environment here (23 nodes) and have recently updated all our nodes to Proxmox 8.0 according to the official documentation. Since then we are having severe issues with all our qemu-driven Windows Server 2019 VMs, which we use to provide users with a connection-broker driven RDP Terminal-Server cluster. All other VMs (Linux, Windows10, WindowsXP, etc. run smoothly).
The issues we are seeing are, that out of a sudden only the Windows2019 VMs are ending up in 100% CPU usage (CPU usage jumping up&down all the time, but especially when more and more users are using a certain Terminal-Server VM). This is also visible on the used resources statistics of the Proxmox host itself where the kvm process also starts to consume almost all free CPU time of the Proxmox host system at a certain time until the VM itself becomes unresponsive - even network wise. Thus, running a constant ICMP ping against such a Windows2019 VM we can see that while usually ICMP ping requests show normal ping request times (< 1 ms) ping request times, as soon as the CPU time of these VMs go wild the ping times raise up to 80-100 seconds or even for a period of 20-30 seconds end up with no ping reply at all (VM being unresponsive). Furthermore, when this happens even on the Proxmox console we can see that the mouse pointer cannot be moved anymore, thus the VM stalls completely. However, after some time (20-30 seconds or even minutes) the VM returns to almost normal until the same happens after a while.
After some more investigation on that matter we found:
Thus, we now used the
Nevertheless, I am highly curious if others which are running Windows2019 VMs are having the same issue or can reproduce this issue. Furthermore, it would be nice if someone of the Proxmox staff could assist in further investigation, simply because it seems that this issue is rather kernel 6.2 related than qemu or debian bookworm related since simply booting into kernel 5.15 seems to completly solve the issue without having to downgrade to Proxmox 7.
Thus, any help would be highly appreciated and if I should provide more technical details or tests, please let me know since we have a perfect reproducible case here. All I need to do is reboot into kernel 6.2 and the issue manifests itself immediately.
we are running quite a large Proxmox cluster environment here (23 nodes) and have recently updated all our nodes to Proxmox 8.0 according to the official documentation. Since then we are having severe issues with all our qemu-driven Windows Server 2019 VMs, which we use to provide users with a connection-broker driven RDP Terminal-Server cluster. All other VMs (Linux, Windows10, WindowsXP, etc. run smoothly).
The issues we are seeing are, that out of a sudden only the Windows2019 VMs are ending up in 100% CPU usage (CPU usage jumping up&down all the time, but especially when more and more users are using a certain Terminal-Server VM). This is also visible on the used resources statistics of the Proxmox host itself where the kvm process also starts to consume almost all free CPU time of the Proxmox host system at a certain time until the VM itself becomes unresponsive - even network wise. Thus, running a constant ICMP ping against such a Windows2019 VM we can see that while usually ICMP ping requests show normal ping request times (< 1 ms) ping request times, as soon as the CPU time of these VMs go wild the ping times raise up to 80-100 seconds or even for a period of 20-30 seconds end up with no ping reply at all (VM being unresponsive). Furthermore, when this happens even on the Proxmox console we can see that the mouse pointer cannot be moved anymore, thus the VM stalls completely. However, after some time (20-30 seconds or even minutes) the VM returns to almost normal until the same happens after a while.
After some more investigation on that matter we found:
- We have the same issue on all our Windows2019 VMs with different underlying Proxmox host hardware.
- We can easily reproduce the issue using the JetStream2 browser benchmark suite under Google Chrome (https://browserbench.org/JetStream/). Most of the time as soon as the "crypto-sha" crypto test starts running the cpu usage jumps to 100% and the issue starts to manifest until the whole VM suddenly stalls and become unresponsive.
- Changing Processor/CPU Type does not solve the issue (also not when using 'host').
- Disabling memory ballooning does not solve the issue.
- Changing machine type (i440fx vs q35) or version does not solve the issue.
- Updating virtio tools or trying to use older versions does not solve the issue.
- Trying to perform a fresh Windows2019 installation on such affected Proxmox 8 host ends up in the same 100%CPU usage problem during the installation process.
- The affected VMs run smoothly under the old Proxmox 7 environment which we could test by having downgraded one node to Proxmox 7.
Thus, we now used the
proxmox-boot-tool
command to pin these Windows2019 VM hosting hosts to kernel 5.15 for the time being. Thus, while other Proxmox 8 hosts run smoothly with kernel 6.2 in our cluster (because they only host non-Windows2019 VMs), we keep the affected Proxmox nodes to run kernel 5.15 until the issue is understood and hopefully fixed.Nevertheless, I am highly curious if others which are running Windows2019 VMs are having the same issue or can reproduce this issue. Furthermore, it would be nice if someone of the Proxmox staff could assist in further investigation, simply because it seems that this issue is rather kernel 6.2 related than qemu or debian bookworm related since simply booting into kernel 5.15 seems to completly solve the issue without having to downgrade to Proxmox 7.
Thus, any help would be highly appreciated and if I should provide more technical details or tests, please let me know since we have a perfect reproducible case here. All I need to do is reboot into kernel 6.2 and the issue manifests itself immediately.