VM Hang during backup (fs-freeze)

Nov 26, 2019
13
3
8
We have been looking at Proxmox Backup Server as a good replacement for our current backup system in our Proxmox environment. Unfortunately we have run into the same issue that it seems others are seeing with fs-freeze hanging the VM:

https://forum.proxmox.com/threads/e...guest-fsfreeze-thaw-failed-got-timeout.68082/
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/

In our scenario we are seeing the same as above whereby on some VM's, the backup process hangs at the fs-freeze section resulting in the VM not having any disk IO requiring the backup job to be stopped, VM unlocked and forcefully shutdown/restarted.

The section of the backup log in question is:

Code:
INFO: include disk 'scsi0' 'ceph-vm:vm-10281-disk-0' 155G
INFO: exclude disk 'scsi1' 'backup-drives:vm-10281-disk-0' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/10281/2020-12-03T19:02:23Z'
INFO: issuing guest-agent 'fs-freeze' command

We have experienced this in both of our Proxmox clusters (our live enterprise cluster and our lab no-subscription install).

LIVE CLUSTER

Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

proxmox-backup: not correctly installed (running kernel: 5.4.65-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ifupdown2: not correctly installed
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-2
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.4-pve1

We have around 250 VM's on this cluster spread across 25 nodes. After installing PBS I started a backup process on the first node with the first VM working fine and the second one failing. Due to these being live client VM's I have stopped testing there as it is too high risk to continue. The VM's tested are:

----------------------------------

111 (live) - SUCCESS:

OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1127.19.1.el7.x86_64
cPanel: Yes
Kernelcare: No

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64

[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 0 464M 0% /dev/shm
tmpfs 464M 49M 416M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 22G 8.7G 12G 43% /
/dev/sda1 976M 204M 706M 23% /boot
/dev/sdb1 59G 2.1G 54G 4% /backup
/dev/loop0 593M 528K 561M 1% /tmp
tmpfs 93M 0 93M 0% /run/user/0

----------------------------------

10281 (live) - FAILED

OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1160.6.1.el7.x86_64
cPanel: Yes
Kernelcare: Yes

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64

[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 3.4M 460M 1% /dev/shm
tmpfs 464M 49M 415M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 150G 126G 18G 88% /
/dev/sda1 976M 199M 711M 22% /boot
/dev/sdb1 197G 119G 69G 64% /backup
/dev/loop0 876M 51M 780M 7% /tmp
tmpfs 503M 0 503M 0% /run/user/0

----------------------------------

LAB CLUSTER

We have also tested on our lab setup where we have fewer VM's setup but the same problem was noticed on a PXE booted Debian server (that utilises /dev/loop as shown on df).

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.15-pve2
ceph-fuse: 14.2.15-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

proxmox-backup: not correctly installed (running kernel: 5.4.73-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-3
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.5-pve1

----------------------------------

123 (lab) - FAILED

OS: Debian GNU/Linux 10 (buster)
Kernel: 4.19.0-6-amd64
cPanel: No
Kernelcare: No

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent/stable,stable,now 1:3.1+dfsg-8+deb10u8 amd64

root@*hidden* ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 40M 355M 11% /run
/dev/loop0 539M 539M 0 100% /run/live/rootfs/filesystem.squashfs
tmpfs 2.0G 709M 1.3G 36% /run/live/overlay
overlay 2.0G 709M 1.3G 36% /
tmpfs 2.0G 4.0K 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 2.0G 0 2.0G 0% /tmp
*ips-hidden*:/.croit/server-1 4.3T 104G 4.2T 3% /persistent
/dev/sda 30G 142M 30G 1% /var/lib/ceph/mon

----------------------------------



From reading around it looks like this could be related to /dev/loop which would make sense. As our VM's may use this in multiple places and from seeing info that could relate to outher software we use (Cloudlinux, kernelcare, etc) I feel it would be a game of whack-a-mole to try and get VM's compatible and due to the problem ceasing the backup process, we can't leave that at the mercy of a client deciding to enable securetmp in their VM without us knowing.

Therefore, it looks like the most reliable way of using PBS across our cluster is to disable Qemu Guest Agent at Proxmox level which would be a shame as we make use of the features it brings.

Has anyone found any reliable fix for this problem without significant changes inside the VM or disabling Qemu Guest?
 
We have been looking at Proxmox Backup Server as a good replacement for our current backup system in our Proxmox environment. Unfortunately we have run into the same issue that it seems others are seeing with fs-freeze hanging the VM:

https://forum.proxmox.com/threads/e...guest-fsfreeze-thaw-failed-got-timeout.68082/
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/

In our scenario we are seeing the same as above whereby on some VM's, the backup process hangs at the fs-freeze section resulting in the VM not having any disk IO requiring the backup job to be stopped, VM unlocked and forcefully shutdown/restarted.

The section of the backup log in question is:

Code:
INFO: include disk 'scsi0' 'ceph-vm:vm-10281-disk-0' 155G
INFO: exclude disk 'scsi1' 'backup-drives:vm-10281-disk-0' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/10281/2020-12-03T19:02:23Z'
INFO: issuing guest-agent 'fs-freeze' command

We have experienced this in both of our Proxmox clusters (our live enterprise cluster and our lab no-subscription install).

LIVE CLUSTER

Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

proxmox-backup: not correctly installed (running kernel: 5.4.65-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ifupdown2: not correctly installed
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-2
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.4-pve1

We have around 250 VM's on this cluster spread across 25 nodes. After installing PBS I started a backup process on the first node with the first VM working fine and the second one failing. Due to these being live client VM's I have stopped testing there as it is too high risk to continue. The VM's tested are:

----------------------------------

111 (live) - SUCCESS:

OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1127.19.1.el7.x86_64
cPanel: Yes
Kernelcare: No

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64

[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 0 464M 0% /dev/shm
tmpfs 464M 49M 416M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 22G 8.7G 12G 43% /
/dev/sda1 976M 204M 706M 23% /boot
/dev/sdb1 59G 2.1G 54G 4% /backup
/dev/loop0 593M 528K 561M 1% /tmp
tmpfs 93M 0 93M 0% /run/user/0

----------------------------------

10281 (live) - FAILED

OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1160.6.1.el7.x86_64
cPanel: Yes
Kernelcare: Yes

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64

[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 3.4M 460M 1% /dev/shm
tmpfs 464M 49M 415M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 150G 126G 18G 88% /
/dev/sda1 976M 199M 711M 22% /boot
/dev/sdb1 197G 119G 69G 64% /backup
/dev/loop0 876M 51M 780M 7% /tmp
tmpfs 503M 0 503M 0% /run/user/0

----------------------------------

LAB CLUSTER

We have also tested on our lab setup where we have fewer VM's setup but the same problem was noticed on a PXE booted Debian server (that utilises /dev/loop as shown on df).

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.15-pve2
ceph-fuse: 14.2.15-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

proxmox-backup: not correctly installed (running kernel: 5.4.73-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-3
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.5-pve1

----------------------------------

123 (lab) - FAILED

OS: Debian GNU/Linux 10 (buster)
Kernel: 4.19.0-6-amd64
cPanel: No
Kernelcare: No

Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent/stable,stable,now 1:3.1+dfsg-8+deb10u8 amd64

root@*hidden* ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 40M 355M 11% /run
/dev/loop0 539M 539M 0 100% /run/live/rootfs/filesystem.squashfs
tmpfs 2.0G 709M 1.3G 36% /run/live/overlay
overlay 2.0G 709M 1.3G 36% /
tmpfs 2.0G 4.0K 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 2.0G 0 2.0G 0% /tmp
*ips-hidden*:/.croit/server-1 4.3T 104G 4.2T 3% /persistent
/dev/sda 30G 142M 30G 1% /var/lib/ceph/mon

----------------------------------



From reading around it looks like this could be related to /dev/loop which would make sense. As our VM's may use this in multiple places and from seeing info that could relate to outher software we use (Cloudlinux, kernelcare, etc) I feel it would be a game of whack-a-mole to try and get VM's compatible and due to the problem ceasing the backup process, we can't leave that at the mercy of a client deciding to enable securetmp in their VM without us knowing.

Therefore, it looks like the most reliable way of using PBS across our cluster is to disable Qemu Guest Agent at Proxmox level which would be a shame as we make use of the features it brings.

Has anyone found any reliable fix for this problem without significant changes inside the VM or disabling Qemu Guest?
Hi @uk_user

did you ever find a fix for this issue we are seeing the same?

""Cheers
G
 
We turned off Qemu Agent for all VM's which caused the problem to go away.
ok cool workaround but not a fix.

i've opened a ticket with ProxMox support will reply to this thread any findings that may be of use.

""Cheers
G
 
  • Like
Reactions: coxmcse
Any update on this?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!