Guest-agent fs-freeze command breaks the system on backup

Amin Vakil · May 11, 2020

One of vms which I backup weekly broke when proxmox issues guest-agent 'fs-freeze' command.

I could ping the server, but nothing works (almost every port were down, the remaining didn't respond to anything and I couldn't ssh to it).

Although I haven't faced this issue in three years using proxmox and last three month enabling and installing qemu-guest-agent on host and vm, I waited for 20 minutes and nothing happens, then I stopped the backup, logged in to host and entered

qm unlock 201 && qm reset 201

(201 is VMID of broken vm)

Then when vm gets booted I edited grub command line and add this to the end (just to be sure filesystem hasn't corrupted):

fsck.mode=force fsck.repair=yes

When the server was booting up lots of errors were getting fixed, errors were like this:

[ 170.013464] systemd-fsck[639]: Free blocks count wrong for group #2160 (23153, counted=23948).
[ 170.014860] systemd-fsck[639]: Fix? yes

and this

[ 170.142921] systemd-fsck[639]: Directories count wrong for group #2000 (1394, counted=871).
[ 170.145106] systemd-fsck[639]: Fix? yes

Then the server boots up and everything was fine.

I ran a manual backup again and the same problem happens, everything freezes when backup reaches guest-agent 'fs-freeze' command and I couldn't do anything expect unlock and forcefully reset the vm.

Server is updated CentOS 7 (2003) with cPanel installed on it.

I don't know if this helps or not, but the vm has three hard disks, only backup of the first one which OS is installed on it, is enabled. First hard disk is in "sas" directory and backup gets written to another directory "sata".

Also log of first scheduled backup is here (manual backup error log is exactly the same):

INFO: Starting Backup of VM 201 (qemu)
INFO: Backup started at 2020-05-09 02:00:58
INFO: status = running
INFO: VM Name: ircpanel1
INFO: include disk 'virtio0' 'sas:201/vm-201-disk-0.qcow2' 80G
INFO: exclude disk 'virtio1' 'sata:201/vm-201-disk-0.qcow2' (backup=no)
INFO: exclude disk 'virtio2' 'sata:201/vm-201-disk-1.qcow2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive
'/mnt/pve/sata/dump/vzdump-qemu-201-2020_05_09-02_00_58.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
ERROR: interrupted by signal
INFO: issuing guest-agent 'fs-thaw' command
TASK ERROR: got unexpected control message:

wolfgang · May 12, 2020

Hi,

can you try the following commands?
And report back what happened?

Code:

qm guest cmd 103 fsfreeze-status
qm guest cmd 103 fsfreeze-freeze
qm guest cmd 103 fsfreeze-status
qm guest cmd 103 fsfreeze-thaw

Amin Vakil · May 12, 2020

qm guest cmd 201 fsfreeze-status
thawed
qm guest cmd 201 fsfreeze-freeze
^C
qm guest cmd 201 fsfreeze-status
QEMU guest agent is not running

After 5 minutes I interrupted fsfreeze-freeze and all services broke, I rebooted the server but it couldn't gracefully reboot so I forced it.

Forcing fsck on grub again results this:

systemd-fsck[631]: Free blocks count wrong (70620473, counted=70441515).
systemd-fsck[631]: Fix? yes
systemd-fsck[631]: Free inodes count wrong (19338401, counted=19313508).
systemd-fsck[631]: Fix? yes

Although this time it was just these two errors.

Now when I execute fsfreeze-status I get this result:

qm guest cmd 201 fsfreeze-status
thawed

I think it's pointless and harmful to try executing fsfreeze-freeze again.

wolfgang · May 12, 2020

Here it works with a fresh installed CentOS with all updates without a problem.

Amin Vakil · May 12, 2020

Yes, I only have this problem with this particular server, any hint on what's the problem?

wolfgang · May 12, 2020

Have a look in the kernel logs maybe you see a problem.
Update to the latest kernel.

Amin Vakil · May 12, 2020

I had and couldn't figure out the problem.

Either way I'm going to migrate data to another server.

Thanks anyway!

hancok62 · Jun 22, 2020

Hello,

same problem here, same log's same all,

INFO: issuing guest-agent 'fs-freeze' command
closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
ERROR: interrupted by signal
INFO: issuing guest-agent 'fs-thaw' command

like amin when i run:

Code:

qm guest cmd 103 fsfreeze-freeze
^C
qm guest cmd 103 fsfreeze-status
QEMU guest agent is not running

when i run qemu-agent ping doesn't show anything (I think it is the right thing)

Code:

qm agent 100 ping

proxmox info:

Code:

pveversion -v

proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-3
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Code:

virtualmachine info:


Linux hostname.hostname 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
qemu-agent enabled and start in virtual machine
OS: centos7

Razva · Jul 7, 2020

Same issue here as well.

aychprox · Aug 2, 2020

having same issue on few cPanel VM. once at the stage of freezefs, the VM hang.
This happened on PBS only, run on NFS storage as backup no problem.

alexvyi · Aug 4, 2020

Same here, but after upgrading the kernel in centos, it worked.Pve 6.2-10, dell poweredge r610 with ssd.Should anybody need any further details just write back.

Faris Raouf · Aug 4, 2020

@alexvyi - do you recall what kernel you had installed? I've been using 3.10.0-1127.13.1 and qemu-guest-agent-2.12.0-3 in most of my Centos (7) VMs with no issues (also 6.2-10 fully up to date).

I also have one customer VM that has an even older kernel. (...)-8.1, I think, which is also OK, although it will also have an older qemu-guest-agent installed. All ext4 with LVM.

HW is all Dell 640s with ssds backing up to conventional hdds. VM disk images from 20GB to 600GB.

All but the older one have just been upgraded to (...)-18.2 although none have been backed up yet with that kernel.

But the more I see reports like this, the more scared I get.
If it isn't file systems being corrupted during backup, it is the partition table (as reported in several other posts).

A backup process that can potentially cause damage to the data I'm trying to protect is not something I am comfortable with, whether it is vzbackup, qmenu-guest-agent, the kernel, or something else at fault.

Unfortunately, unless the devs can duplicate it (which they can't seem to), then they have to rely on people who experience the problem to post a bug report and to provide sufficient detail for them to be able to make some progress on what might be going on.

This is understandable. Nobody can complain about that for normal issues. The Devs cannot magically detect the cause if they have no data to go on.

But for a problem of this nature, where the backup process can potentially destroy data, I would really prefer to see a more proactive response. Of course maybe I'm overlaying my obsession with backup problems to colour my judgement, but I don't think it is unreasonable to want to see a more positive response, especially with what I consider to be a significant number of users reporting issues. Please?

alexvyi · Aug 4, 2020

kernel-3.10.0-1127.18.2.el7.x86_64, and reboot the machine of course.Use a cluster set up for even more failover if you got scared and keep a proxmox server in your local lan just for storing backups. That latter is what I do.(dailly) And this issue happend today so even if the pve would have broke I would have had the backup from yesterday.

alexvyi · Aug 4, 2020

Sorry, this is the new kernel. The one that got replace today I couldnt say for sure...Sorry again.

Razva · Aug 5, 2020

So is this fixed in the latest kernel?

alexvyi · Aug 5, 2020

For my setup it is and I am backing up a vm with 1terra scsi hdd.

alexvyi · Aug 6, 2020

Today it happend again.Only after a reboot it started working.It happens only with the enterprise upgrade.On other machines I have(open source repo) I never saw this phenomena.

alexvyi · Aug 15, 2020

And today again.Can someone from proxmox take a look?Or try to recreate to see what is going wrong?

igerlster · Oct 15, 2020

Same here, newest Ubuntu 18.04 Server, all Up2Date. On my Ubuntu 20.04 Servers it's working.

alexvyi · Oct 15, 2020

I got used to it, unfortunately. And I am shutting down the vm at night, back it up, start it all over again and thats it. For now.

Guest-agent fs-freeze command breaks the system on backup

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Member

Renowned Member

Renowned Member

Member

Well-Known Member

Member

Member

Renowned Member

Member

Member

Member

Member

Member

We value your privacy