Guest-agent fs-freeze command breaks the system on backup

Amin Vakil

Active Member
Nov 21, 2017
21
2
43
One of vms which I backup weekly broke when proxmox issues guest-agent 'fs-freeze' command.

I could ping the server, but nothing works (almost every port were down, the remaining didn't respond to anything and I couldn't ssh to it).

Although I haven't faced this issue in three years using proxmox and last three month enabling and installing qemu-guest-agent on host and vm, I waited for 20 minutes and nothing happens, then I stopped the backup, logged in to host and entered

qm unlock 201 && qm reset 201

(201 is VMID of broken vm)

Then when vm gets booted I edited grub command line and add this to the end (just to be sure filesystem hasn't corrupted):

fsck.mode=force fsck.repair=yes

When the server was booting up lots of errors were getting fixed, errors were like this:

[ 170.013464] systemd-fsck[639]: Free blocks count wrong for group #2160 (23153, counted=23948). [ 170.014860] systemd-fsck[639]: Fix? yes

and this

[ 170.142921] systemd-fsck[639]: Directories count wrong for group #2000 (1394, counted=871). [ 170.145106] systemd-fsck[639]: Fix? yes

Then the server boots up and everything was fine.

I ran a manual backup again and the same problem happens, everything freezes when backup reaches guest-agent 'fs-freeze' command and I couldn't do anything expect unlock and forcefully reset the vm.

Server is updated CentOS 7 (2003) with cPanel installed on it.

I don't know if this helps or not, but the vm has three hard disks, only backup of the first one which OS is installed on it, is enabled. First hard disk is in "sas" directory and backup gets written to another directory "sata".

Also log of first scheduled backup is here (manual backup error log is exactly the same):

INFO: Starting Backup of VM 201 (qemu) INFO: Backup started at 2020-05-09 02:00:58 INFO: status = running INFO: VM Name: ircpanel1 INFO: include disk 'virtio0' 'sas:201/vm-201-disk-0.qcow2' 80G INFO: exclude disk 'virtio1' 'sata:201/vm-201-disk-0.qcow2' (backup=no) INFO: exclude disk 'virtio2' 'sata:201/vm-201-disk-1.qcow2' (backup=no) INFO: backup mode: snapshot INFO: ionice priority: 7 INFO: creating archive '/mnt/pve/sata/dump/vzdump-qemu-201-2020_05_09-02_00_58.vma.lzo' INFO: issuing guest-agent 'fs-freeze' command closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927. ERROR: interrupted by signal INFO: issuing guest-agent 'fs-thaw' command TASK ERROR: got unexpected control message:
 
Hi,

can you try the following commands?
And report back what happened?
Code:
qm guest cmd 103 fsfreeze-status
qm guest cmd 103 fsfreeze-freeze
qm guest cmd 103 fsfreeze-status
qm guest cmd 103 fsfreeze-thaw
 
qm guest cmd 201 fsfreeze-status thawed qm guest cmd 201 fsfreeze-freeze ^C qm guest cmd 201 fsfreeze-status QEMU guest agent is not running

After 5 minutes I interrupted fsfreeze-freeze and all services broke, I rebooted the server but it couldn't gracefully reboot so I forced it.

Forcing fsck on grub again results this:
systemd-fsck[631]: Free blocks count wrong (70620473, counted=70441515). systemd-fsck[631]: Fix? yes systemd-fsck[631]: Free inodes count wrong (19338401, counted=19313508). systemd-fsck[631]: Fix? yes

Although this time it was just these two errors.

Now when I execute fsfreeze-status I get this result:
qm guest cmd 201 fsfreeze-status thawed

I think it's pointless and harmful to try executing fsfreeze-freeze again.
 
Here it works with a fresh installed CentOS with all updates without a problem.
 
Have a look in the kernel logs maybe you see a problem.
Update to the latest kernel.
 
I had and couldn't figure out the problem.

Either way I'm going to migrate data to another server.

Thanks anyway!
 
Hello,

same problem here, same log's same all,

INFO: issuing guest-agent 'fs-freeze' command
closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
ERROR: interrupted by signal
INFO: issuing guest-agent 'fs-thaw' command


like amin when i run:
Code:
qm guest cmd 103 fsfreeze-freeze
^C
qm guest cmd 103 fsfreeze-status
QEMU guest agent is not running

when i run qemu-agent ping doesn't show anything (I think it is the right thing)

Code:
qm agent 100 ping



proxmox info:
Code:
pveversion -v

proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-3
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1


Code:
virtualmachine info:


Linux hostname.hostname 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
qemu-agent enabled and start in virtual machine
OS: centos7
 
Last edited:
  • Like
Reactions: poxin
having same issue on few cPanel VM. once at the stage of freezefs, the VM hang.
This happened on PBS only, run on NFS storage as backup no problem.
 
Same here, but after upgrading the kernel in centos, it worked.Pve 6.2-10, dell poweredge r610 with ssd.Should anybody need any further details just write back.
 
@alexvyi - do you recall what kernel you had installed? I've been using 3.10.0-1127.13.1 and qemu-guest-agent-2.12.0-3 in most of my Centos (7) VMs with no issues (also 6.2-10 fully up to date).

I also have one customer VM that has an even older kernel. (...)-8.1, I think, which is also OK, although it will also have an older qemu-guest-agent installed. All ext4 with LVM.

HW is all Dell 640s with ssds backing up to conventional hdds. VM disk images from 20GB to 600GB.

All but the older one have just been upgraded to (...)-18.2 although none have been backed up yet with that kernel.

But the more I see reports like this, the more scared I get.
If it isn't file systems being corrupted during backup, it is the partition table (as reported in several other posts).

A backup process that can potentially cause damage to the data I'm trying to protect is not something I am comfortable with, whether it is vzbackup, qmenu-guest-agent, the kernel, or something else at fault.

Unfortunately, unless the devs can duplicate it (which they can't seem to), then they have to rely on people who experience the problem to post a bug report and to provide sufficient detail for them to be able to make some progress on what might be going on.

This is understandable. Nobody can complain about that for normal issues. The Devs cannot magically detect the cause if they have no data to go on.

But for a problem of this nature, where the backup process can potentially destroy data, I would really prefer to see a more proactive response. Of course maybe I'm overlaying my obsession with backup problems to colour my judgement, but I don't think it is unreasonable to want to see a more positive response, especially with what I consider to be a significant number of users reporting issues. Please?
 
kernel-3.10.0-1127.18.2.el7.x86_64, and reboot the machine of course.Use a cluster set up for even more failover if you got scared and keep a proxmox server in your local lan just for storing backups. That latter is what I do.(dailly) And this issue happend today so even if the pve would have broke I would have had the backup from yesterday.
 
Today it happend again.Only after a reboot it started working.It happens only with the enterprise upgrade.On other machines I have(open source repo) I never saw this phenomena.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!