Hi,
I just got hit by an unavailable (unreachable) Windows Server VM and looked to journalctl -e and I saw qmp timeout errors (a lot for quite some time). I read some threads about that and I can tell, I use no Backup Server, no Ceph, no cluster, just single Proxmox (pve-manager/7.4-17).
I have the impression that the error correlates well with I/O load of the Windows VM.
Let me describe in detail.
here the end:
I noticed that a source server (metal) started pushing big backup files to this VM as destination server it when problems started. According to robocopy statistics output the copy was very slow even before it stopped working.
I reset the VM by PVE gui and checked that it looked good. I made some simple tests, such as using 7zip to store a few GB data, and saw some hunderds MB/sec and thought all would be fine.
Then I restarted the backup push again - and seconds later the VM Server again ended up unreachable with the same error messages (as I included above).
I noticed that a scrub was running on the pool and paused it:
by this, the VM became responsible again (without resetting it).
Then I restarted the backup push and got the impression that the speed is not constant (well looking to the output, which is far away from anything like measurement, but the transfer time for different 1gb files seemed very different).
I copied that to a linux SMB containiner on the same host on the same ZFS pool and it was fast and made no issues, constant high speed (Today I copied some terabytes from the same metal windows server into that container without issues, on the same Proxmox host on the same ZFS pool, without issues).
I again let it push to the Windows Server and noticed, that every now and then for some seconds the VM becomes unresponsive, and also I got a very few qmp command timeouts.
So I pushed a different file set which mostly contains small files (few KB) (with robocopy ...copyall+sec... /J /MT). This is much slower (well, Windows). The VM was responsive. Every now and then some bigger files were copied and I think this correlated well with the (then rare, a few per hour) qmp command timeouts. Whenever RDP reconnected (it makes a sound, so I noticed it) I checked and then robocopy always worked on larger files.
So apparently the higher the (windows) VM I/O (write?) load, the more (or longer) qmp timeout errors and more likely (and longer) unresponsive VM.
Load on (linux) container in contrast seem not to cause such issues (I don't see any qmp timeouts in the hours when I pushed to linux CT, not to Win VM).
Maybe there is an issue with QEMU long responsive times at longer I/O wait times or so.
I read about similar issues in other threads, so let me share my observations in case it somehow could help.
I just got hit by an unavailable (unreachable) Windows Server VM and looked to journalctl -e and I saw qmp timeout errors (a lot for quite some time). I read some threads about that and I can tell, I use no Backup Server, no Ceph, no cluster, just single Proxmox (pve-manager/7.4-17).
I have the impression that the error correlates well with I/O load of the Windows VM.
Let me describe in detail.
here the end:
Code:
Jan 14 01:10:05 pve-3-4 pvestatd[4502]: VM 214 qmp command failed - VM 214 qmp command 'query-proxmox-support' failed - unable to connect to VM 214 qmp socket - timeout after 51 retries
Jan 14 01:10:05 pve-3-4 pvestatd[4502]: status update time (8.240 seconds)
Jan 14 01:10:09 pve-3-4 pvedaemon[1583106]: VM 214 qmp command failed - VM 214 qmp command 'query-proxmox-support' failed - unable to connect to VM 214 qmp socket - timeout after 51 retries
Jan 14 01:10:15 pve-3-4 pvestatd[4502]: VM 214 qmp command failed - VM 214 qmp command 'query-proxmox-support' failed - unable to connect to VM 214 qmp socket - timeout after 51 retries
Jan 14 01:10:15 pve-3-4 pvestatd[4502]: status update time (8.226 seconds)
Jan 14 01:10:16 pve-3-4 pvedaemon[249561]: VM 214 qmp command failed - VM 214 qmp command 'guest-ping' failed - got timeout
Jan 14 01:10:23 pve-3-4 pvestatd[4502]: status update time (5.304 seconds)
I noticed that a source server (metal) started pushing big backup files to this VM as destination server it when problems started. According to robocopy statistics output the copy was very slow even before it stopped working.
I reset the VM by PVE gui and checked that it looked good. I made some simple tests, such as using 7zip to store a few GB data, and saw some hunderds MB/sec and thought all would be fine.
Then I restarted the backup push again - and seconds later the VM Server again ended up unreachable with the same error messages (as I included above).
I noticed that a scrub was running on the pool and paused it:
Code:
Jan 14 01:11:23 pve-3-4 zed[733763]: eid=1356 class=scrub_paused pool='dpool'
by this, the VM became responsible again (without resetting it).
Then I restarted the backup push and got the impression that the speed is not constant (well looking to the output, which is far away from anything like measurement, but the transfer time for different 1gb files seemed very different).
I copied that to a linux SMB containiner on the same host on the same ZFS pool and it was fast and made no issues, constant high speed (Today I copied some terabytes from the same metal windows server into that container without issues, on the same Proxmox host on the same ZFS pool, without issues).
I again let it push to the Windows Server and noticed, that every now and then for some seconds the VM becomes unresponsive, and also I got a very few qmp command timeouts.
So I pushed a different file set which mostly contains small files (few KB) (with robocopy ...copyall+sec... /J /MT). This is much slower (well, Windows). The VM was responsive. Every now and then some bigger files were copied and I think this correlated well with the (then rare, a few per hour) qmp command timeouts. Whenever RDP reconnected (it makes a sound, so I noticed it) I checked and then robocopy always worked on larger files.
So apparently the higher the (windows) VM I/O (write?) load, the more (or longer) qmp timeout errors and more likely (and longer) unresponsive VM.
Load on (linux) container in contrast seem not to cause such issues (I don't see any qmp timeouts in the hours when I pushed to linux CT, not to Win VM).
Maybe there is an issue with QEMU long responsive times at longer I/O wait times or so.
I read about similar issues in other threads, so let me share my observations in case it somehow could help.