Backup crashes VM - 'guest-fsfreeze-freeze' failed - got timeout

Razva

Renowned Member
Dec 3, 2013
252
10
83
Romania
cncted.com
Hello,

The fsfreeze-freeze command works as expected when executed from CLI but crashes a VM when executed by the backup scheduler.

Here's some useful info.

Guest:
  • Debian 11
  • Linux server.transud.ro 5.10.0-11-cloud-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
  • qemu-guest-agent is already the newest version (1:5.2+dfsg-11+deb11u1)
Guest log:
Code:
Feb 10 09:29:31 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:17:38 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:17:50 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:18:01 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:18:12 server.transud.ro qemu-ga: info: guest-ping called
Feb 11 00:23:59 server.transud.ro qemu-ga: info: guest-ping called
Feb 11 00:23:59 server.transud.ro qemu-ga: info: guest-fsfreeze called
<---------------- VM stopped working and I reset it
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] Linux version 5.10.0-11-cloud-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.>
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-11-cloud-amd64 root=UUID=19d572a8-1cf6-4b9>
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] x86/fpu: x87 FPU will use FXSAVE

Host:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-4-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-9
pve-kernel-5.13: 7.1-7
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-4-pve: 5.13.19-8
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

Host log:
Code:
INFO: Starting Backup of VM 124 (qemu)
INFO: Backup started at 2022-02-11 00:23:59
INFO: status = running
INFO: VM Name: server.transud.ro
INFO: include disk 'scsi0' 'local-zfs:vm-124-disk-0' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/124/2022-02-10T22:23:59Z'
INFO: issuing guest-agent 'fs-freeze' command
ERROR: VM 124 qmp command 'guest-fsfreeze-freeze' failed - got timeout
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 124 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '9f7dde1a-6693-4fe0-bbf9-9cb87410c462'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:  42% (34.1 GiB of 80.0 GiB) in 3s, read: 11.4 GiB/s, write: 278.7 MiB/s
INFO:  59% (47.6 GiB of 80.0 GiB) in 7s, read: 3.4 GiB/s, write: 0 B/s
INFO:  76% (61.4 GiB of 80.0 GiB) in 10s, read: 4.6 GiB/s, write: 0 B/s
INFO:  94% (75.6 GiB of 80.0 GiB) in 13s, read: 4.7 GiB/s, write: 0 B/s
INFO: 100% (80.0 GiB of 80.0 GiB) in 15s, read: 2.2 GiB/s, write: 0 B/s
INFO: backup is sparse: 76.36 GiB (95%) total zero data
INFO: backup was done incrementally, reused 79.18 GiB (98%)
INFO: transferred 80.00 GiB in 15 seconds (5.3 GiB/s)
INFO: Finished Backup of VM 124 (01:00:26)
INFO: Backup finished at 2022-02-11 01:24:25

Testing fsfreeze:
Code:
# qm guest cmd 124 fsfreeze-status
thawed
# qm guest cmd 124 fsfreeze-freeze
1
# qm guest cmd 124 fsfreeze-status
frozen
# qm guest cmd 124 fsfreeze-thaw
1
# qm guest cmd 124 fsfreeze-status
thawed

While executing the above commands I was logged into the VM and executing a ping to google.com. The VM didn't interrupt/crash, everything worked as expected:
Code:
cncted@server:~$ ping google.com
PING google.com (216.58.212.174) 56(84) bytes of data.
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=1 ttl=60 time=5.49 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=2 ttl=60 time=5.61 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=3 ttl=60 time=5.55 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=4 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=5 ttl=60 time=5.63 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=6 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=7 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=8 ttl=60 time=5.62 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=9 ttl=60 time=5.56 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=10 ttl=60 time=5.62 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=11 ttl=60 time=5.58 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=12 ttl=60 time=5.65 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=13 ttl=60 time=5.55 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=14 ttl=60 time=5.61 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=15 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=16 ttl=60 time=5.59 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=17 ttl=60 time=5.63 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=18 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=19 ttl=60 time=5.58 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=20 ttl=60 time=5.59 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=21 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=22 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=23 ttl=60 time=5.59 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=24 ttl=60 time=5.64 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=25 ttl=60 time=5.60 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=26 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=27 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=28 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=29 ttl=60 time=5.64 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=30 ttl=60 time=5.57 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=31 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=32 ttl=60 time=5.59 ms
^C
--- google.com ping statistics ---
32 packets transmitted, 32 received, 0% packet loss, time 31037ms
rtt min/avg/max/mdev = 5.492/5.592/5.652/0.030 ms

Also I could ping the VM from outside without any packet loss.

Any hints? Thank you!
 
I have had to totally disable all backup jobs because they all start qmp-timeout or fs-freeze fails halting the entire VM for several minuttes, which is unacceptable.
 
I have had to totally disable all backup jobs because they all start qmp-timeout or fs-freeze fails halting the entire VM for several minuttes, which is unacceptable.
If you find any solutions please let me know. For me the only solution was to disable the backups until I find a fix.
 
I've had to do the same as a vm seems to have crashed unexpectedly yesterday.
"The computer has rebooted from a bugcheck. The bugcheck was: 0x000000d1 (0x0000000000000228, 0x0000000000000006, 0x0000000000000000, 0xfffff8008943f715). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: ae6788b6-b50d-4b33-8d79-3253e39c2481."

Which essentially results in this:
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000000000228, memory referenced
Arg2: 0000000000000006, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff8008943f715, address which referenced memory


I noticed alot of "storeahci" errors in Windows eventlog everytime the backup runs. It has done so for a few months. This time it appears to have killed the driver and the vm.
 
I found yet another VM that behaves very similar, the message is this:

Code:
INFO: Backup started at 2022-02-17 00:22:26
INFO: status = running
INFO: VM Name: server.plantemania.ro
INFO: include disk 'scsi0' 'local-zfs:vm-125-disk-0' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/125/2022-02-16T22:22:26Z'
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: started backup task '9d616c2b-739b-46df-803f-619b034b62ca'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (drive clean)
INFO: using fast incremental mode (dirty-bitmap), 0.0 B dirty of 80.0 GiB total
INFO: 100% (0.0 B of 0.0 B) in 1s, read: 0 B/s, write: 0 B/s
INFO: backup was done incrementally, reused 80.00 GiB (100%)
INFO: Finished Backup of VM 125 (00:00:05)
INFO: Backup finished at 2022-02-17 00:22:31
INFO: Backup job finished successfully

All these VMs are running Debian 11.
 
Hey,

i have the same issue but at the moment only a single Debian 11 VM out of many. The VM is running a basic Webserver (Nginx, MariaDB, PHP 7.4).
I already tried re installing qemu-guest agent, did filesystem checks and tried qemu-guest-agent from bullseye-backports. I even updated my Proxmox to 7.1 (i already had the issue on 7.0).

I found this issue here https://gitlab.com/qemu-project/qemu/-/issues/520 but i don't use CPanel or virtfs.
Btw i have the same issue when backing up to a Proxmox Backup Server and locally to the host its self.

It would be interesting what others run in affected VM's? As i have no issues with a Debian 11 VM where Keycloak + MariaDB is running.

Guest:
Linux 5.10.0-11-cloud-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
Package: qemu-guest-agent
Version: 1:5.2+dfsg-11+deb11u1
 
All our Windows Production VM's are affected :/
For the time being just un-check "Use QEMU Guest Agent" in the VM Options and shut-down/start the VM. It'll solve the issue, but note that it might generate corrupted backups as the VM will not be "frozen". It's just a workaround but it's better than nothing.
 
Negativ, what i could find via a strace of the qemu-guest-agent while freezing the FS from the host is the following:
The ioctl call for the FSFREEZE of "/" (https://github.com/qemu/qemu/blob/master/qga/commands-posix.c#L1684) at least in my case never returns, as this is a syscall i guess it's more a problem on kernel level instead of QEMU's agent.

But i did not digged deeper yet as i not really have a plan how to debug it on that level as the system just freezes and does not crash the kernel.
 
We have the same problem on multiple Debian 11 machines on different proxmox servers !
Any idea ?
 
We have the same problem on multiple Debian 11 machines on different proxmox servers !
Any idea ?
Nope. We disabled Qemu on the Guest VMs and we're making backups without that. Yes, the backups can be corrupted (as it doesn't freeze the OS), but it's better than nothing.
 
Nope. We disabled Qemu on the Guest VMs and we're making backups without that. Yes, the backups can be corrupted (as it doesn't freeze the OS), but it's better than nothing.
Im getting the same issue on windows and linux vms. The vm hangs at the end (usually at 100%) and the proxmox backup server hangs. The problem with disabling QEMU agent is that the vms wont shutdown automatically via the backup schedule on "STOP" mode.

This is not conclusive, but it seems when it fails to backup to my proxmox backup server, the 2nd attempt will work after I unlock and stop the VM and restart my proxmox backup server. Also, after getting a successful backup on proxmox backup server, a restore will eventually hang the restore process and also hand the proxmox backup server.

Backing up to my NFS Share on my other server seems to have no problem

Here are the messages during a backup:

error.jpeg

Moving back to backups on my NFS Share until I find fix!
 
Last edited by a moderator:
I am having the same problem as well. Thank you all for the insight on alternatives to get a backup.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!