We have been looking at Proxmox Backup Server as a good replacement for our current backup system in our Proxmox environment. Unfortunately we have run into the same issue that it seems others are seeing with fs-freeze hanging the VM:
https://forum.proxmox.com/threads/e...guest-fsfreeze-thaw-failed-got-timeout.68082/
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/
In our scenario we are seeing the same as above whereby on some VM's, the backup process hangs at the fs-freeze section resulting in the VM not having any disk IO requiring the backup job to be stopped, VM unlocked and forcefully shutdown/restarted.
The section of the backup log in question is:
We have experienced this in both of our Proxmox clusters (our live enterprise cluster and our lab no-subscription install).
LIVE CLUSTER
We have around 250 VM's on this cluster spread across 25 nodes. After installing PBS I started a backup process on the first node with the first VM working fine and the second one failing. Due to these being live client VM's I have stopped testing there as it is too high risk to continue. The VM's tested are:
----------------------------------
111 (live) - SUCCESS:
OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1127.19.1.el7.x86_64
cPanel: Yes
Kernelcare: No
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64
[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 0 464M 0% /dev/shm
tmpfs 464M 49M 416M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 22G 8.7G 12G 43% /
/dev/sda1 976M 204M 706M 23% /boot
/dev/sdb1 59G 2.1G 54G 4% /backup
/dev/loop0 593M 528K 561M 1% /tmp
tmpfs 93M 0 93M 0% /run/user/0
----------------------------------
10281 (live) - FAILED
OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1160.6.1.el7.x86_64
cPanel: Yes
Kernelcare: Yes
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64
[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 3.4M 460M 1% /dev/shm
tmpfs 464M 49M 415M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 150G 126G 18G 88% /
/dev/sda1 976M 199M 711M 22% /boot
/dev/sdb1 197G 119G 69G 64% /backup
/dev/loop0 876M 51M 780M 7% /tmp
tmpfs 503M 0 503M 0% /run/user/0
----------------------------------
LAB CLUSTER
We have also tested on our lab setup where we have fewer VM's setup but the same problem was noticed on a PXE booted Debian server (that utilises /dev/loop as shown on df).
----------------------------------
123 (lab) - FAILED
OS: Debian GNU/Linux 10 (buster)
Kernel: 4.19.0-6-amd64
cPanel: No
Kernelcare: No
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent/stable,stable,now 1:3.1+dfsg-8+deb10u8 amd64
root@*hidden* ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 40M 355M 11% /run
/dev/loop0 539M 539M 0 100% /run/live/rootfs/filesystem.squashfs
tmpfs 2.0G 709M 1.3G 36% /run/live/overlay
overlay 2.0G 709M 1.3G 36% /
tmpfs 2.0G 4.0K 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 2.0G 0 2.0G 0% /tmp
*ips-hidden*:/.croit/server-1 4.3T 104G 4.2T 3% /persistent
/dev/sda 30G 142M 30G 1% /var/lib/ceph/mon
----------------------------------
From reading around it looks like this could be related to /dev/loop which would make sense. As our VM's may use this in multiple places and from seeing info that could relate to outher software we use (Cloudlinux, kernelcare, etc) I feel it would be a game of whack-a-mole to try and get VM's compatible and due to the problem ceasing the backup process, we can't leave that at the mercy of a client deciding to enable securetmp in their VM without us knowing.
Therefore, it looks like the most reliable way of using PBS across our cluster is to disable Qemu Guest Agent at Proxmox level which would be a shame as we make use of the features it brings.
Has anyone found any reliable fix for this problem without significant changes inside the VM or disabling Qemu Guest?
https://forum.proxmox.com/threads/e...guest-fsfreeze-thaw-failed-got-timeout.68082/
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/
In our scenario we are seeing the same as above whereby on some VM's, the backup process hangs at the fs-freeze section resulting in the VM not having any disk IO requiring the backup job to be stopped, VM unlocked and forcefully shutdown/restarted.
The section of the backup log in question is:
Code:
INFO: include disk 'scsi0' 'ceph-vm:vm-10281-disk-0' 155G
INFO: exclude disk 'scsi1' 'backup-drives:vm-10281-disk-0' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/10281/2020-12-03T19:02:23Z'
INFO: issuing guest-agent 'fs-freeze' command
We have experienced this in both of our Proxmox clusters (our live enterprise cluster and our lab no-subscription install).
LIVE CLUSTER
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
proxmox-backup: not correctly installed (running kernel: 5.4.65-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ifupdown2: not correctly installed
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-2
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.4-pve1
We have around 250 VM's on this cluster spread across 25 nodes. After installing PBS I started a backup process on the first node with the first VM working fine and the second one failing. Due to these being live client VM's I have stopped testing there as it is too high risk to continue. The VM's tested are:
----------------------------------
111 (live) - SUCCESS:
OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1127.19.1.el7.x86_64
cPanel: Yes
Kernelcare: No
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64
[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 0 464M 0% /dev/shm
tmpfs 464M 49M 416M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 22G 8.7G 12G 43% /
/dev/sda1 976M 204M 706M 23% /boot
/dev/sdb1 59G 2.1G 54G 4% /backup
/dev/loop0 593M 528K 561M 1% /tmp
tmpfs 93M 0 93M 0% /run/user/0
----------------------------------
10281 (live) - FAILED
OS: CentOS Linux release 7.9.2009
Kernel: 3.10.0-1160.6.1.el7.x86_64
cPanel: Yes
Kernelcare: Yes
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent-2.12.0-3.el7.x86_64
[*hidden* ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 455M 0 455M 0% /dev
tmpfs 464M 3.4M 460M 1% /dev/shm
tmpfs 464M 49M 415M 11% /run
tmpfs 464M 0 464M 0% /sys/fs/cgroup
/dev/mapper/vg-root 150G 126G 18G 88% /
/dev/sda1 976M 199M 711M 22% /boot
/dev/sdb1 197G 119G 69G 64% /backup
/dev/loop0 876M 51M 780M 7% /tmp
tmpfs 503M 0 503M 0% /run/user/0
----------------------------------
LAB CLUSTER
We have also tested on our lab setup where we have fewer VM's setup but the same problem was noticed on a PXE booted Debian server (that utilises /dev/loop as shown on df).
Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.15-pve2
ceph-fuse: 14.2.15-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
proxmox-backup: not correctly installed (running kernel: 5.4.73-1-pve)
proxmox-backup-server: 1.0.5-1 (running version: 1.0.5)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-12-pve: 4.15.18-36
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.0.4-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-xtermjs: 4.7.0-3
smartmontools: 7.1-pve2
zfsutils-linux: 0.8.5-pve1
----------------------------------
123 (lab) - FAILED
OS: Debian GNU/Linux 10 (buster)
Kernel: 4.19.0-6-amd64
cPanel: No
Kernelcare: No
Qemu Guest Agent: Enabled
Qemu Guest Agent Version: qemu-guest-agent/stable,stable,now 1:3.1+dfsg-8+deb10u8 amd64
root@*hidden* ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 40M 355M 11% /run
/dev/loop0 539M 539M 0 100% /run/live/rootfs/filesystem.squashfs
tmpfs 2.0G 709M 1.3G 36% /run/live/overlay
overlay 2.0G 709M 1.3G 36% /
tmpfs 2.0G 4.0K 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 2.0G 0 2.0G 0% /tmp
*ips-hidden*:/.croit/server-1 4.3T 104G 4.2T 3% /persistent
/dev/sda 30G 142M 30G 1% /var/lib/ceph/mon
----------------------------------
From reading around it looks like this could be related to /dev/loop which would make sense. As our VM's may use this in multiple places and from seeing info that could relate to outher software we use (Cloudlinux, kernelcare, etc) I feel it would be a game of whack-a-mole to try and get VM's compatible and due to the problem ceasing the backup process, we can't leave that at the mercy of a client deciding to enable securetmp in their VM without us knowing.
Therefore, it looks like the most reliable way of using PBS across our cluster is to disable Qemu Guest Agent at Proxmox level which would be a shame as we make use of the features it brings.
Has anyone found any reliable fix for this problem without significant changes inside the VM or disabling Qemu Guest?