The cluster consists of several compute nodes and an NFS server (based on PVE) as shared storage. DIsk format of the VM is qcow2.
Symptoms:
Under certain disk loads within the VM, it gradually starts to hang. In the web interface, the "Summary" tab and the console display with significant delay, eventually stopping completely with "Connection Error/Timeout." A hard stop works, but with a long delay.
When trying to access the storage tab from this node, it also results in "Connection Error/Timeout." On all other nodes in the cluster, the storage is accessible, and there are no issues with other VMs or the NFS server, everything is healthy and operational. Restarting the node was the only solution.
After this, I migrated the VM to an empty node. Both nodes and NFS-server had kernel version 6.8.12-5. I updated the empty node and NFS-server to 6.8.12-7 beforehand. The behavior remained exactly the same. Changing any VM settings (CPU type, disk controller, disk type, essentially trying all possible configurations) did not affect the situation.
Since the VM is quite old, I deployed a test VM with a completely clean Windows 7 installation and installed all missing virtio drivers. This did not have a positive effect.
The issue is consistently and quickly triggered by running disk defragmentation inside the OS. If the VM is simply left running, depending on the applications, it partially or completely hangs (as does the node) within 5-10 minutes.
Previously, the cluster did not have any Windows 7 VMs. There are Windows 8 VMs, ancient CentOS 6, and even Ubuntu 12, none of which exhibit similar issues.
For testing, I created local storage on the compute node’s boot disk and migrated the VM there. In this case, the issue does NOT occur.
Dmesg log from the compute node after the hang occurred in in the attachement (there is absolutely nothing interesting in the dmesg logs of the NFS server). The node and the NFS server were rebooted beforehand.
If needed, I can keep the NFS server and the node empty for some time for further experiments. But I need guidance on what to look for and where to investigate. =)
Symptoms:
Under certain disk loads within the VM, it gradually starts to hang. In the web interface, the "Summary" tab and the console display with significant delay, eventually stopping completely with "Connection Error/Timeout." A hard stop works, but with a long delay.
When trying to access the storage tab from this node, it also results in "Connection Error/Timeout." On all other nodes in the cluster, the storage is accessible, and there are no issues with other VMs or the NFS server, everything is healthy and operational. Restarting the node was the only solution.
After this, I migrated the VM to an empty node. Both nodes and NFS-server had kernel version 6.8.12-5. I updated the empty node and NFS-server to 6.8.12-7 beforehand. The behavior remained exactly the same. Changing any VM settings (CPU type, disk controller, disk type, essentially trying all possible configurations) did not affect the situation.
Since the VM is quite old, I deployed a test VM with a completely clean Windows 7 installation and installed all missing virtio drivers. This did not have a positive effect.
The issue is consistently and quickly triggered by running disk defragmentation inside the OS. If the VM is simply left running, depending on the applications, it partially or completely hangs (as does the node) within 5-10 minutes.
Previously, the cluster did not have any Windows 7 VMs. There are Windows 8 VMs, ancient CentOS 6, and even Ubuntu 12, none of which exhibit similar issues.
For testing, I created local storage on the compute node’s boot disk and migrated the VM there. In this case, the issue does NOT occur.
Dmesg log from the compute node after the hang occurred in in the attachement (there is absolutely nothing interesting in the dmesg logs of the NFS server). The node and the NFS server were rebooted beforehand.
If needed, I can keep the NFS server and the node empty for some time for further experiments. But I need guidance on what to look for and where to investigate. =)
Attachments
Last edited: