ERROR: VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout

These timeout occur for my cluster as well like once per month, qmp command 'guest-fsfreeze-thaw' failed - got timeout followed by qmp command 'backup' failed - got timeout. Initially I restarted all affected VMs, but there seems to be a workaround, apparently it's possible to un-freeze the VMs like this:
Code:
qm agent <vmid> fsfreeze-status
Answers with "thawed". Processes stuck on IO requests continue immediately. (fs-thaw might work as well, didn't try so far.)
In another thread (https://forum.proxmox.com/threads/q...ze-thaw-failed-got-timeout.77195/#post-344117) someone claimed "there is a bug in Qemu Guest Agent or in kernel", which could very well be the case, however does anyone know why both qmp commands appear to timeout the same time? Took a quick look at the source, but it's not clear to me.
 
If it happens once a month it might be an issue with the storage backend that does e.g. scrubbing and has bad I/O. Bad I/O can also cause PBS backups to be problematic.
 
Indeed it happens when the PBS storage is scrubbing, however the backup itself runs on a different node with independent storage. So the timeout regarding the backup command is maybe understandable, but thaw failing afterwards is not expected.

I tried to reproduce this by starting the backup manually while hacking the source (/usr/share/perl5/PVE/QMPClient.pm) to not send the backup command in order to force a timeout. The backup command times out as expected, but now thaw works, not sure what's the difference. I also added some debugging output, maybe I have to wait until next month. :/

(If someone else is digging deeper here and is confused that the timing changed: The timeout for the backup command was recently increased from 60s to 125s: https://git.proxmox.com/?p=qemu-ser...pm;h=46b676c0b127028d057f82c47b18df830fa26a49)

What do mean by "Bad I/O can also cause PBS backups to be problematic."?
 
Hy

we have the same behavior on multiple PVE Servers Version 6.4 and 7.x
Backup is configured partial on nas devices and PBS 1.x.

Think the error affectx only to Windows VM's (Win-Srv 2012-2019)

Log from one Backup

Code:
root@srv-vm11:/var/log/vzdump# cat qemu-102.log

2021-09-14 22:13:59 INFO: Starting Backup of VM 102 (qemu)

2021-09-14 22:13:59 INFO: status = running

2021-09-14 22:13:59 INFO: VM Name: winserver

2021-09-14 22:13:59 INFO: include disk 'scsi0' 'zfs01:vm-102-disk-0' 300G

2021-09-14 22:13:59 INFO: backup mode: snapshot

2021-09-14 22:13:59 INFO: ionice priority: 7

2021-09-14 22:13:59 INFO: HOOK: backup-start snapshot 102

2021-09-14 22:13:59 INFO: HOOK-ENV: vmtype=qemu;dumpdir=/mnt/pve/nas/dump;storeid=nas;hostname=winserver;target=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.vma.zst;logfile=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.log

2021-09-14 22:13:59 INFO: HOOK: pre-stop snapshot 102

2021-09-14 22:13:59 INFO: HOOK-ENV: vmtype=qemu;dumpdir=/mnt/pve/nas/dump;storeid=nas;hostname=winserver;target=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.vma.zst;logfile=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.log

2021-09-14 22:13:59 INFO: HOOK: pre-restart snapshot 102

2021-09-14 22:13:59 INFO: HOOK-ENV: vmtype=qemu;dumpdir=/mnt/pve/nas/dump;storeid=nas;hostname=winserver;target=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.vma.zst;logfile=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.log

2021-09-14 22:13:59 INFO: HOOK: post-restart snapshot 102

2021-09-14 22:13:59 INFO: HOOK-ENV: vmtype=qemu;dumpdir=/mnt/pve/nas/dump;storeid=nas;hostname=winserver;target=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.vma.zst;logfile=/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.log

2021-09-14 22:13:59 INFO: creating vzdump archive '/mnt/pve/nas/dump/vzdump-qemu-102-2021_09_14-22_13_59.vma.zst'

2021-09-14 22:13:59 INFO: issuing guest-agent 'fs-freeze' command

2021-09-14 22:14:29 INFO: issuing guest-agent 'fs-thaw' command

2021-09-14 22:14:39 ERROR: VM 102 qmp command 'guest-fsfreeze-thaw' failed - got timeout

2021-09-14 22:14:39 INFO: started backup task '2bc20f42-6962-479d-8c1f-e39b345da410'

2021-09-14 22:14:39 INFO: resuming VM again
 
  • Like
Reactions: velocity08
Hi All

we are seeing the same behavior on both PVE 6.4 & 7.x
performing a manual snapshot works fine and doesn't crash the VM, only when performing a PBS backup which also takes a snapshot freeze this crashes the VM and its unrecoverable.

for the time being we have had to disable QEMU in PVE > VM > Options so that a backup can be taken without crashing the VM.

has anyone had any success in resolving this issue?

We have opened a ticket with ProxMox support to investigate further.

At present we are only seeing this on cPanel VM that has the following;

- CentOS 7.9
- CloudLinux
- KernelCare

installed.

all other VM's are backing up fine without any issues.

VM Config:

16 GB Ram
6 vCPU, 1 Socket
320 GB SSD Storage - VirtIO
Controller - VirtIO_SCSI
Nic - VirtIO

""Cheers
G
 
Hi All

we are seeing the same behavior on both PVE 6.4 & 7.x
performing a manual snapshot works fine and doesn't crash the VM, only when performing a PBS backup which also takes a snapshot freeze this crashes the VM and its unrecoverable.

for the time being we have had to disable QEMU in PVE > VM > Options so that a backup can be taken without crashing the VM.

has anyone had any success in resolving this issue?

We have opened a ticket with ProxMox support to investigate further.

At present we are only seeing this on cPanel VM that has the following;

- CentOS 7.9
- CloudLinux
- KernelCare

installed.

all other VM's are backing up fine without any issues.

VM Config:

16 GB Ram
6 vCPU, 1 Socket
320 GB SSD Storage - VirtIO
Controller - VirtIO_SCSI
Nic - VirtIO

""Cheers
G
You'll need to disable securetmp.

https://bugs.launchpad.net/qemu/+bug/1813045
 
Sorry I’m not seeing “securetmp” mentioned in that linked bug report or the bug report on the nested gitlab thread.

Can you please provide some clarity on this?

Ta

Securetmp enables /dev/loop mounts

You'll need to disable it if you want to run snapshots/backups with CloudLinux

https://support.cpanel.net/hc/en-us/articles/360058525333-How-to-disable-scripts-securetmp

This is what CloudLinux sent me

Hello,

The issue is not related to CloudLinux directly, but to Qemu agent, which does not freeze the file system(s) correctly. What is actually happening:

When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

I'm afraid we have no further recommendations at this point.

Thank you.
 
Securetmp enables /dev/loop mounts

You'll need to disable it if you want to run snapshots/backups with CloudLinux

https://support.cpanel.net/hc/en-us/articles/360058525333-How-to-disable-scripts-securetmp

This is what CloudLinux sent me

Hello,

The issue is not related to CloudLinux directly, but to Qemu agent, which does not freeze the file system(s) correctly. What is actually happening:

When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

I'm afraid we have no further recommendations at this point.

Thank you.
Ok awesome thank you.

I’ll look into this.

Silly question you’ve tested this your self as well?

Any side effects to be aware of?

Ta
 
Yes I have tested on 10+ cPanel servers. No side effects apart from lack of "securetmp"
Sweet that’s good info thank you.

Have you noticed an issue with temp space filling up due to no “secure temp” or not really a concern ?

We see tmp space filling up even with it on by default so it sure what the benefit is meant to be.

Ta
 
Securetmp enables /dev/loop mounts

You'll need to disable it if you want to run snapshots/backups with CloudLinux

https://support.cpanel.net/hc/en-us/articles/360058525333-How-to-disable-scripts-securetmp

This is what CloudLinux sent me

Hello,

The issue is not related to CloudLinux directly, but to Qemu agent, which does not freeze the file system(s) correctly. What is actually happening:

When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

I'm afraid we have no further recommendations at this point.

Thank you.
hello I am having the same issue with Qemu Freezing on backups with snapshot mode.
I don't have Cloudlinux, but AlmaLinux 8.5 and cPanel on my VM and that's the one giving me an issue?
if I disable /scripts/securetmp
what issues can happen since I do not have the expertise to notice if there ever is an issue.
what can I expect by disabling this, and or is it a danger security issue to disable it? I only host my own domains on this VM just fyi

Thanks or your help in advance I want to try this and see if this is causing my issue here > https://forum.proxmox.com/threads/backup-scheduler-stop-every-day-the-vm.103539/page-2

Kind Regards,
Spiro
 
OK I did exactly that, disabled /script/securetmp and no more issues as stated by @Brad22 Also Thank you Brad for this, I was going insaine trying to figure this out..

another temp solution would be to disable qemu-server-agent in proxmox and then it works as well.

'wish it would work without having to disable /script/securetmp but for now this works.

I am Thankful for your post
Kind Regards,
Spiro
 
Last edited:
  • Like
Reactions: velocity08
Hi all,

I hope I can get some help from all you smart people here.

I run a server with OVH Hosting and I am running CentOS 7 with C panel on a VPS. I cannot use their auto backup or snapshot featur as it causes the kernel to halt and freeze up when the backup gets done. They said it is because of the C panel script and the QEMU Agent which is not compatible. They provided some work arounds, but nothing worked and I had to disable the backup feature and do all backups manually which is a pain.

Here is a few responses from them:

"
Resolution comment:

Hello,

Our specialists verified and found that the issue is being caused by VirtFS in cPanel. You will need to disable it in order to prevent the issue from repeating.

https://docs.cpanel.net/knowledge-b...isable the Use cPanel®,>> Manage Shell Access).

You can accept this solution by closing the ticket in your customer control panel."


AND...

" After reviewing what happened, I understand how this situation can be frustrating for you.

I've reviewed the case and also reviewed this feature of Cpanel(VirtFS) to see if there's any known issues regarding this. Something important to keep note of is VirtFS is a paravirtualization technology used to secure apps that are install on the machine so there's in theory no risk of malware or something in those lines to jump from app to app to eventually take over the whole system.

In this use case of an OVHCloud VPS, this would be referred to as Nested Virtualization as you are hosting apps on a type of virtualization technology and the VPS itself is on a QEMU type hypervisor.

The issue that is happening is VirtFS and QEMU are unaware of each other and this is where your issue lies. When QEMU is performing a snapshot of the VPS, the filesystem must be locked and all changes to the disk will be handled by the hypervisor so when the snapshot of the VPS is complete, all the new data will get imported into the snapshot as if nothing happened. This usually referred to a hot snapshot or a live snapshot.

After checking multiple QEMU hypervisor documentations, VirtFS(also called virtio-9p) is not compatible with how QEMU performs it's live snapshot by putting new data into RAM instead of going to the drive directly. So it appears that VirtFS isn't compatible.

Apologies for the long explanation but I did want to make the issue very clear. With all that said, my main recommendation would be to no longer use our snapshot feature for the VPS. "



Does anyone of you have a fix for this or a workaround? I would really like to change hosting, but I am not sure if this will fix the issue as others are experiecing issues on another hosting.

Thank you!

Steve
 
  • Like
Reactions: WartraxX93
Hi all,

I hope I can get some help from all you smart people here.

I run a server with OVH Hosting and I am running CentOS 7 with C panel on a VPS. I cannot use their auto backup or snapshot featur as it causes the kernel to halt and freeze up when the backup gets done. They said it is because of the C panel script and the QEMU Agent which is not compatible. They provided some work arounds, but nothing worked and I had to disable the backup feature and do all backups manually which is a pain.

Here is a few responses from them:

"
Resolution comment:

Hello,

Our specialists verified and found that the issue is being caused by VirtFS in cPanel. You will need to disable it in order to prevent the issue from repeating.

https://docs.cpanel.net/knowledge-base/accounts/virtfs-jailed-shell/#:~:text=Disable the Use cPanel®,>> Manage Shell Access).

You can accept this solution by closing the ticket in your customer control panel."


AND...

" After reviewing what happened, I understand how this situation can be frustrating for you.

I've reviewed the case and also reviewed this feature of Cpanel(VirtFS) to see if there's any known issues regarding this. Something important to keep note of is VirtFS is a paravirtualization technology used to secure apps that are install on the machine so there's in theory no risk of malware or something in those lines to jump from app to app to eventually take over the whole system.

In this use case of an OVHCloud VPS, this would be referred to as Nested Virtualization as you are hosting apps on a type of virtualization technology and the VPS itself is on a QEMU type hypervisor.

The issue that is happening is VirtFS and QEMU are unaware of each other and this is where your issue lies. When QEMU is performing a snapshot of the VPS, the filesystem must be locked and all changes to the disk will be handled by the hypervisor so when the snapshot of the VPS is complete, all the new data will get imported into the snapshot as if nothing happened. This usually referred to a hot snapshot or a live snapshot.

After checking multiple QEMU hypervisor documentations, VirtFS(also called virtio-9p) is not compatible with how QEMU performs it's live snapshot by putting new data into RAM instead of going to the drive directly. So it appears that VirtFS isn't compatible.

Apologies for the long explanation but I did want to make the issue very clear. With all that said, my main recommendation would be to no longer use our snapshot feature for the VPS. "



Does anyone of you have a fix for this or a workaround? I would really like to change hosting, but I am not sure if this will fix the issue as others are experiecing issues on another hosting.

Thank you!

Steve
Do you rent space out on the vps you use with cPanel ? And do you need to have jailed shell on? Is this just your websites you use it for ?

Qemu guest agent is enabled in proxmox for your vm ?

I have an issue with Qemu enabled and snapshots backup in proxmox with cPanel vm. It freezes and could take over 1 hour to backup or even have to stop and unlock vm then reboot it. I found turning off Qemu guest agent solved that issue. While using snapshot mode for schedules or manual backups on proxmox for vm.

Try to disable Qemu guest agent in proxmox for your cPanel vm and reboot vm and see if you can turn on backups in cPanel.

Also
Do you have WHM access. Are you administrator on your vps WHM.
If so have you tried to go to tweak settings and disable jailed shell ?

I also have proxmox on a server (not with OVH) but from another company and have 2 vm’s with WHM / cPanel installed. I do backups from proxmox server.

I have not turned on backups in cPanel but will try this and see if I get the same issue and let you know.

I do not think I have jailed shell on, because I do not share my server with anyone. Just my own websites.

I also use CSF ConfigServer firewall that’s installed on cPanel and use that as my firewall.

Let me try and see if I get the same error as you
Regards,
SPIRO
 
HI, I have a server with some clients on it, however I dont give no one any shell access as its just a few websites. I have access to WHM and I am also the admin of this server.
 
Hi there,
I just started having this same issue after updating my cluster the most recent version. I am dumping backup to NFS share and it randomly happens with Windows VMs only.

Is anybody else having this same issue?

[backup-job]
INFO: Starting Backup of VM 120 (qemu)
INFO: Backup started at 2022-10-12 01:57:09
INFO: status = running
INFO: VM Name: fs-dc-001
INFO: include disk 'ide0' 'local-lvm:vm-120-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/zl-stg-001-proxmox-backup/dump/vzdump-qemu-120-2022_10_12-01_57_09.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 120 qmp command 'guest-fsfreeze-thaw' failed - got timeout
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 120 qmp command 'backup-cancel' failed - unable to connect to VM 120 qmp socket - timeout after 5982 retries
INFO: resuming VM again
ERROR: Backup of VM 120 failed - VM 120 qmp command 'cont' failed - unable to connect to VM 120 qmp socket - timeout after 450 retries
INFO: Failed at 2022-10-12 02:11:26
INFO: Backup job finished with errors

TASK ERROR: job errors

[pve]
proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-11
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
 
Hi,
Using IDE for Windows is bad idea.
Chenge to SCSI or at least SATA.
Should help....
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!