Backup crashes VM - 'guest-fsfreeze-freeze' failed - got timeout

Razva · Feb 11, 2022

Hello,

The fsfreeze-freeze command works as expected when executed from CLI but crashes a VM when executed by the backup scheduler.

Here's some useful info.

Guest:

Debian 11
Linux server.transud.ro 5.10.0-11-cloud-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
qemu-guest-agent is already the newest version (1:5.2+dfsg-11+deb11u1)

Guest log:

Code:

Feb 10 09:29:31 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:17:38 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:17:50 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:18:01 server.transud.ro qemu-ga: info: guest-ping called
Feb 10 10:18:12 server.transud.ro qemu-ga: info: guest-ping called
Feb 11 00:23:59 server.transud.ro qemu-ga: info: guest-ping called
Feb 11 00:23:59 server.transud.ro qemu-ga: info: guest-fsfreeze called
<---------------- VM stopped working and I reset it
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] Linux version 5.10.0-11-cloud-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.>
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-11-cloud-amd64 root=UUID=19d572a8-1cf6-4b9>
Feb 11 08:42:23 server.transud.ro kernel: [    0.000000] x86/fpu: x87 FPU will use FXSAVE

Host:

Code:

proxmox-ve: 7.1-1 (running kernel: 5.13.19-4-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-9
pve-kernel-5.13: 7.1-7
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-4-pve: 5.13.19-8
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

Host log:

Code:

INFO: Starting Backup of VM 124 (qemu)
INFO: Backup started at 2022-02-11 00:23:59
INFO: status = running
INFO: VM Name: server.transud.ro
INFO: include disk 'scsi0' 'local-zfs:vm-124-disk-0' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/124/2022-02-10T22:23:59Z'
INFO: issuing guest-agent 'fs-freeze' command
ERROR: VM 124 qmp command 'guest-fsfreeze-freeze' failed - got timeout
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 124 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '9f7dde1a-6693-4fe0-bbf9-9cb87410c462'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:  42% (34.1 GiB of 80.0 GiB) in 3s, read: 11.4 GiB/s, write: 278.7 MiB/s
INFO:  59% (47.6 GiB of 80.0 GiB) in 7s, read: 3.4 GiB/s, write: 0 B/s
INFO:  76% (61.4 GiB of 80.0 GiB) in 10s, read: 4.6 GiB/s, write: 0 B/s
INFO:  94% (75.6 GiB of 80.0 GiB) in 13s, read: 4.7 GiB/s, write: 0 B/s
INFO: 100% (80.0 GiB of 80.0 GiB) in 15s, read: 2.2 GiB/s, write: 0 B/s
INFO: backup is sparse: 76.36 GiB (95%) total zero data
INFO: backup was done incrementally, reused 79.18 GiB (98%)
INFO: transferred 80.00 GiB in 15 seconds (5.3 GiB/s)
INFO: Finished Backup of VM 124 (01:00:26)
INFO: Backup finished at 2022-02-11 01:24:25

Testing fsfreeze:

Code:

# qm guest cmd 124 fsfreeze-status
thawed
# qm guest cmd 124 fsfreeze-freeze
1
# qm guest cmd 124 fsfreeze-status
frozen
# qm guest cmd 124 fsfreeze-thaw
1
# qm guest cmd 124 fsfreeze-status
thawed

While executing the above commands I was logged into the VM and executing a ping to google.com. The VM didn't interrupt/crash, everything worked as expected:

Code:

cncted@server:~$ ping google.com
PING google.com (216.58.212.174) 56(84) bytes of data.
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=1 ttl=60 time=5.49 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=2 ttl=60 time=5.61 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=3 ttl=60 time=5.55 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=4 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=5 ttl=60 time=5.63 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=6 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=7 ttl=60 time=5.57 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=8 ttl=60 time=5.62 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=9 ttl=60 time=5.56 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=10 ttl=60 time=5.62 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=11 ttl=60 time=5.58 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=12 ttl=60 time=5.65 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=13 ttl=60 time=5.55 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=14 ttl=60 time=5.61 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=15 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=16 ttl=60 time=5.59 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=17 ttl=60 time=5.63 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=18 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=19 ttl=60 time=5.58 ms
64 bytes from ams15s22-in-f14.1e100.net (216.58.212.174): icmp_seq=20 ttl=60 time=5.59 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=21 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=22 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=23 ttl=60 time=5.59 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=24 ttl=60 time=5.64 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=25 ttl=60 time=5.60 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=26 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=27 ttl=60 time=5.60 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=28 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=29 ttl=60 time=5.64 ms
64 bytes from ams15s22-in-f174.1e100.net (216.58.212.174): icmp_seq=30 ttl=60 time=5.57 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=31 ttl=60 time=5.58 ms
64 bytes from fra24s01-in-f14.1e100.net (216.58.212.174): icmp_seq=32 ttl=60 time=5.59 ms
^C
--- google.com ping statistics ---
32 packets transmitted, 32 received, 0% packet loss, time 31037ms
rtt min/avg/max/mdev = 5.492/5.592/5.652/0.030 ms

Also I could ping the VM from outside without any packet loss.

Any hints? Thank you!

Razva · Feb 12, 2022

Any hints about this? Thank you

AngryAdm · Feb 14, 2022

I have had to totally disable all backup jobs because they all start qmp-timeout or fs-freeze fails halting the entire VM for several minuttes, which is unacceptable.

Razva · Feb 14, 2022

AngryAdm said:
I have had to totally disable all backup jobs because they all start qmp-timeout or fs-freeze fails halting the entire VM for several minuttes, which is unacceptable.

If you find any solutions please let me know. For me the only solution was to disable the backups until I find a fix.

AngryAdm · Feb 14, 2022

I've had to do the same as a vm seems to have crashed unexpectedly yesterday.
"The computer has rebooted from a bugcheck. The bugcheck was: 0x000000d1 (0x0000000000000228, 0x0000000000000006, 0x0000000000000000, 0xfffff8008943f715). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: ae6788b6-b50d-4b33-8d79-3253e39c2481."

Which essentially results in this:
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000000000228, memory referenced
Arg2: 0000000000000006, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff8008943f715, address which referenced memory

I noticed alot of "storeahci" errors in Windows eventlog everytime the backup runs. It has done so for a few months. This time it appears to have killed the driver and the vm.

Razva · Feb 17, 2022

I found yet another VM that behaves very similar, the message is this:

Code:

INFO: Backup started at 2022-02-17 00:22:26
INFO: status = running
INFO: VM Name: server.plantemania.ro
INFO: include disk 'scsi0' 'local-zfs:vm-125-disk-0' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/125/2022-02-16T22:22:26Z'
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: started backup task '9d616c2b-739b-46df-803f-619b034b62ca'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (drive clean)
INFO: using fast incremental mode (dirty-bitmap), 0.0 B dirty of 80.0 GiB total
INFO: 100% (0.0 B of 0.0 B) in 1s, read: 0 B/s, write: 0 B/s
INFO: backup was done incrementally, reused 80.00 GiB (100%)
INFO: Finished Backup of VM 125 (00:00:05)
INFO: Backup finished at 2022-02-17 00:22:31
INFO: Backup job finished successfully

All these VMs are running Debian 11.

Xuxe · Feb 17, 2022

Hey,

i have the same issue but at the moment only a single Debian 11 VM out of many. The VM is running a basic Webserver (Nginx, MariaDB, PHP 7.4).
I already tried re installing qemu-guest agent, did filesystem checks and tried qemu-guest-agent from bullseye-backports. I even updated my Proxmox to 7.1 (i already had the issue on 7.0).

I found this issue here https://gitlab.com/qemu-project/qemu/-/issues/520 but i don't use CPanel or virtfs.
Btw i have the same issue when backing up to a Proxmox Backup Server and locally to the host its self.

It would be interesting what others run in affected VM's? As i have no issues with a Debian 11 VM where Keycloak + MariaDB is running.

Guest:
Linux 5.10.0-11-cloud-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
Package: qemu-guest-agent
Version: 1:5.2+dfsg-11+deb11u1

Razva · Feb 17, 2022

Yeah, it seems that mostly Debian 11 is affected. No idea how to fix it.

AngryAdm · Feb 21, 2022

Razva said:
Yeah, it seems that mostly Debian 11 is affected. No idea how to fix it.

All our Windows Production VM's are affected :/

spirit · Feb 21, 2022

does it work without the agent? (no fs-freeze/unfreeze from the agent)

AngryAdm · Feb 21, 2022

It would appear so.

Razva · Feb 21, 2022

AngryAdm said:
All our Windows Production VM's are affected :/

For the time being just un-check "Use QEMU Guest Agent" in the VM Options and shut-down/start the VM. It'll solve the issue, but note that it might generate corrupted backups as the VM will not be "frozen". It's just a workaround but it's better than nothing.

Razva · Feb 24, 2022

Did anybody found a fix? Thanks

AngryAdm · Feb 24, 2022

Razva said:
Did anybody found a fix? Thanks

None that I've found.

Xuxe · Feb 24, 2022

Negativ, what i could find via a strace of the qemu-guest-agent while freezing the FS from the host is the following:
The ioctl call for the FSFREEZE of "/" (https://github.com/qemu/qemu/blob/master/qga/commands-posix.c#L1684) at least in my case never returns, as this is a syscall i guess it's more a problem on kernel level instead of QEMU's agent.

But i did not digged deeper yet as i not really have a plan how to debug it on that level as the system just freezes and does not crash the kernel.

vENZi · Apr 18, 2022

We have the same problem on multiple Debian 11 machines on different proxmox servers !
Any idea ?

Razva · Apr 18, 2022

vENZi said:
We have the same problem on multiple Debian 11 machines on different proxmox servers !
Any idea ?

Nope. We disabled Qemu on the Guest VMs and we're making backups without that. Yes, the backups can be corrupted (as it doesn't freeze the OS), but it's better than nothing.

Deleted member 146970 · Apr 19, 2022

Razva said:
Nope. We disabled Qemu on the Guest VMs and we're making backups without that. Yes, the backups can be corrupted (as it doesn't freeze the OS), but it's better than nothing.

Im getting the same issue on windows and linux vms. The vm hangs at the end (usually at 100%) and the proxmox backup server hangs. The problem with disabling QEMU agent is that the vms wont shutdown automatically via the backup schedule on "STOP" mode.

This is not conclusive, but it seems when it fails to backup to my proxmox backup server, the 2nd attempt will work after I unlock and stop the VM and restart my proxmox backup server. Also, after getting a successful backup on proxmox backup server, a restore will eventually hang the restore process and also hand the proxmox backup server.

Backing up to my NFS Share on my other server seems to have no problem

Here are the messages during a backup:

Moving back to backups on my NFS Share until I find fix!

clangren · Jun 9, 2022

I am having the same problem as well. Thank you all for the insight on alternatives to get a backup.

clangren · Jun 17, 2022

There is a github issue posted (I think) related to this issue: https://gitlab.com/qemu-project/qemu/-/issues/520, but we are not using cPanel and don't have the technical skill to troubleshoot in a more generalized fashion.

Backup crashes VM - 'guest-fsfreeze-freeze' failed - got timeout

Renowned Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Active Member

Renowned Member

Member

Distinguished Member

Member

Renowned Member

Renowned Member

Member

Active Member

Member

Renowned Member

Deleted member 146970

Guest

Member

Member