Disable fs-freeze on snapshot backups

godzilla

Member
May 20, 2021
78
5
13
43
Is it possible to somehow disable the call to guest-agent fsfreeze when performing a snapshot backup?

Despite the update to latest Proxmox VE 7.3-4 and opt-in 5.19.17-2-pve kernel, I'm still having the issue where the fsfreeze command blocks the guest filesystem and there's no solution except forcing a guest reboot.
 
  • Like
Reactions: DC-CA1
Hi,

there is an feature request and accompanying patchset for implementing such an option out there, but it's still pending for review.
So there will be eventually a knob for this.

But reading the above thread thread, there seem to be some workarounds/solution available for this. Are they not viable as a stop-gap measure?
 
Hi,

there is an feature request and accompanying patchset for implementing such an option out there, but it's still pending for review.
So there will be eventually a knob for this.

But reading the above thread thread, there seem to be some workarounds/solution available for this. Are they not viable as a stop-gap measure?

Hi @cheiss ,

thanks for your kind reply. Unfortunately, solutions lowering the security level are not an option in my organization. I rather lose functionality than security.

I commented on the feature request, I hope they publish the feature soon.
 
Last edited:
  • Like
Reactions: cheiss
Hi,

just to let you know: The backend side of this feature as been committed and should be available as part of the qemu-server 7.3-4 package, which is already available on the pve-no-subscription repository.

As the web GUI part is still pending, in the meantime you can enable it with either qm set <vmid> -agent 1,freeze-fs-on-backup=0 or pvesh /nodes/<node>/qemu/<vmid>/config -agent 1,freeze-fs-on-backup=0.
(As this overrides the complete agent option and you have fstrim_cloned_disks set, you need to include that as well.)
 
HI @cheiss, thanks for the update!

Do you know if the GUI part will be published sometimes soon or should we wait for the next PVE release?

Thank you!
 
Do you know if the GUI part will be published sometimes soon or should we wait for the next PVE release?
Hopefully soon, in the next few weeks. I will ping the GUI part of this series this week, they are relatively trivial anyway.
They unfortunately just got a bit lost on the mailing list it seems, definitely no need to wait for the next PVE release.
 
  • Like
Reactions: davemcl
Is it possible to somehow disable the call to guest-agent fsfreeze when performing a snapshot backup?

Despite the update to latest Proxmox VE 7.3-4 and opt-in 5.19.17-2-pve kernel, I'm still having the issue where the fsfreeze command blocks the guest filesystem and there's no solution except forcing a guest reboot.
The reason this hasn't been fixed despite being 1) a showstopper 2) known about for 18 months, is because its nothing to do with fsfreeze nor the guest agent. If you have a VM without guest agent, then you cannot issue fs-freeze and fs-thaw and the filesystems still break. Therefore the problem is more fundamental and they are looking the wrong place. fs-thaw just happens to be the first point at which they realise that the filesystem has disappeared and since disappearing filesystems usually don't log the error, guest-agent which has logs outside the VM is the only place they see the error.

What I don't understand is that when you are using replication between nodes, its creating snapshots, and when you use PBS its also creates snapshots, but these don't appear to cause a problem - afaik they are doing the same thing so why don't they cause the issue?

If you're lucky enough to be logging to an external syslog you see that the file system becomes unavailable as-if unplugged and its impossible to fix without a reboot. Until fixed, you can't really use PVE critical unless you disable backups, we've had to resort to dodgy rsyncs and handcraft everything in a spaghetti mess that the PVE was meant to do in a much more managed way.
 
I agree with @drjaymz@ - it is NOT fsfreeze, guest agent or snapshot as such.
I replicate every 15 minutes with no probem.

I have 2 vm that has this issue (and lots that don't). (Ubuntu 20.04 fully updated)
One VM has guest running, with 'freeze-fs-on-backup: Disabled', the other VM with no guest installed and Guest Disabled.
(2 different physical servers in 2 differet locations)

Frozen state:
I can open the VM Console from PVE and I can see the text on the VM console - but it is unresponsive.
From PVE shell:
Code:
root@pm2:~# qm status 303
status: running
So status is 'running' - but this is clearly false, as the vm is frozen.

From PVE on the VM with guest running, I tried
Code:
qm guest exec 303 "test"
This returns something like 'guest not running' when it is frozen.
When the VM is running normaly I get:
Code:
root@pm2:~# qm guest exec 303 "test"
{
   "exitcode" : 1,
   "exited" : 1
}

From backup log:
303: 2023-05-25 02:21:04 INFO: Starting Backup of VM 303 (qemu) 303: 2023-05-25 02:21:04 INFO: status = running 303: 2023-05-25 02:21:04 INFO: VM Name: Omstillingsbordet 303: 2023-05-25 02:21:04 INFO: include disk 'scsi0' 'local-zfs:vm-303-disk-0' 303: 2023-05-25 02:21:04 INFO: include disk 'scsi1' 'local-zfs:vm-303-disk-1' 303: 2023-05-25 02:21:04 INFO: backup mode: snapshot 303: 2023-05-25 02:21:04 INFO: ionice priority: 7 303: 2023-05-25 02:21:04 INFO: creating Proxmox Backup Server archive 'vm/303/2023-05-25T00:21:04Z' 303: 2023-05-25 02:21:05 INFO: started backup task 'e37c0716-9016-4a92-b12b-82b75d90a6ec' 303: 2023-05-25 02:21:05 INFO: resuming VM again 303: 2023-05-25 02:21:05 INFO: scsi0: dirty-bitmap status: OK (1.0 GiB of 50.0 GiB dirty) 303: 2023-05-25 02:21:05 INFO: scsi1: dirty-bitmap status: OK (928.0 MiB of 200.0 GiB dirty)

I loose access at 02:21:04 - and a
Code:
root@pm2:~# qm reset 303
is necessary.

This freezing is random but (In my case) ALWAYS in connection with a backup to PBS.
 
Last edited:
I agree with @drjaymz@ - it is NOT fsfreeze, guest agent or snapshot as such.
I replicate every 15 minutes with no probem.

I have 2 vm that has this issue (and lots that don't). (Ubuntu 20.04 fully updated)
One VM has guest running, with 'freeze-fs-on-backup: Disabled', the other VM with no guest installed and Guest Disabled.
(2 different physical servers in 2 differet locations)

Frozen state:
I can open the VM Console from PVE and I can see the text on the VM console - but it is unresponsive.
From PVE shell:
Code:
root@pm2:~# qm status 303
status: running
So status is 'running' - but this is clearly false, as the vm is frozen.

From PVE on the VM with guest running, I tried
Code:
qm guest exec 303 "test"
This returns something like 'guest not running' when it is frozen.
When the VM is running normaly I get:
Code:
root@pm2:~# qm guest exec 303 "test"
{
   "exitcode" : 1,
   "exited" : 1
}

From backup log:
303: 2023-05-25 02:21:04 INFO: Starting Backup of VM 303 (qemu) 303: 2023-05-25 02:21:04 INFO: status = running 303: 2023-05-25 02:21:04 INFO: VM Name: Omstillingsbordet 303: 2023-05-25 02:21:04 INFO: include disk 'scsi0' 'local-zfs:vm-303-disk-0' 303: 2023-05-25 02:21:04 INFO: include disk 'scsi1' 'local-zfs:vm-303-disk-1' 303: 2023-05-25 02:21:04 INFO: backup mode: snapshot 303: 2023-05-25 02:21:04 INFO: ionice priority: 7 303: 2023-05-25 02:21:04 INFO: creating Proxmox Backup Server archive 'vm/303/2023-05-25T00:21:04Z' 303: 2023-05-25 02:21:05 INFO: started backup task 'e37c0716-9016-4a92-b12b-82b75d90a6ec' 303: 2023-05-25 02:21:05 INFO: resuming VM again 303: 2023-05-25 02:21:05 INFO: scsi0: dirty-bitmap status: OK (1.0 GiB of 50.0 GiB dirty) 303: 2023-05-25 02:21:05 INFO: scsi1: dirty-bitmap status: OK (928.0 MiB of 200.0 GiB dirty)

I loose access at 02:21:04 - and a
Code:
root@pm2:~# qm reset 303
is necessary.

This freezing is random but (In my case) ALWAYS in connection with a backup to PBS.
This isn't the same issue I am getting but may be related. In my case on Linux 2.4 kernel guests ONLY, I get a situation where it looks like its frozen, but if you happened to already have an ssh connection open and you're in a root you can see that its still running but that the filesystems are very broken; I/O errors corruption etc. Resetting the machine fixes it. We tried dumping out QEMU's view of the VM and disks and it was happy, and the problem is definitely to do with the VM not liking something that has happened to the filesystem. I can even save the broken state, reload it and it stays broken even though thats a completely new instance of QEMU.

It always occurs off the back of a backup and the VM ran for the last 20 years under KVM/QEMU it was only when using the backup functions we started getting grief. The snapshotting used for backups does lock the filesystems regardless of guest agent and I think something isn't handled correctly such that the guest seems to have a bad map of the drive in memory. Its a problem I had for 2 years and really is a show stopper meaning you can't rely on proxmox not to corrupt your data which is about as serious a flaw you can get.
 
Last edited:
In fact not only snapsot can cause this issue, the stop methode is not working, too.
This is a log for a backup job to pbs.
NFO: include disk 'scsi0' 'local-lvm:vm-109-disk-1' 544972M INFO: include disk 'efidisk0' 'local-lvm:vm-109-disk-0' 4M INFO: stopping virtual guest INFO: VM quit/powerdown failed ERROR: Backup of VM 109 failed - command 'qm shutdown 109 --skiplock --keepActive --timeout 600' failed: exit code 255 INFO: Failed at 2023-05-27 04:10:43 INFO: Backup job finished with errors TASK ERROR: job errors
The source vm is a fresh created ubuntu 22.04 cloud img.

The shutdown process is stuck at sending SIGTERM to sys_sync.

pveversion:
pve-manager/7.4-3/9002ab8a (running kernel: 6.2.6-1-pve)

Update: This is caused by a faulty RAM
 
Last edited:

drjaymz@ @fbnielsen


Were you able to resolve or isolate the problem in the meantime?

The VM always freezes for me too, I can access the VM, but then I have the following error in the syslog.
Code:
INFO: task jbd2/sda1-8:351 blocked for more than 120 seconds.
    Not tainted 4.19.0-25-amd64 #1 Debian 4.19.289-2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I did the PVE upgrade from 7.4 to 8.1 a few days ago.
With 7.4 I also had these errors in the log from time to time, but at some point they disappeared.
I would say I haven't seen this error for about a year now.

I am also use PBS for backups
And as storage backend I use both ZFS on local disks, as well as ceph on dedicated hardware
 

drjaymz@ @fbnielsen


Were you able to resolve or isolate the problem in the meantime?

The VM always freezes for me too, I can access the VM, but then I have the following error in the syslog.
Code:
INFO: task jbd2/sda1-8:351 blocked for more than 120 seconds.
    Not tainted 4.19.0-25-amd64 #1 Debian 4.19.289-2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I did the PVE upgrade from 7.4 to 8.1 a few days ago.
With 7.4 I also had these errors in the log from time to time, but at some point they disappeared.
I would say I haven't seen this error for about a year now.

I am also use PBS for backups
And as storage backend I use both ZFS on local disks, as well as ceph on dedicated hardware

Still occurs but less often so whatever causes it hasn't been fixed. I have both v7 and v8 setups about 32 probably about 320 guest VM's. I may see it every couple of months and then sometimes twice in a row. As I said earlier what you see is that normally it looks like the VM has frozen and you can't log in however if you were already in you'll see that basically you get an I/O error on one or more disks. I have not seen that exact error you describe / not looked but it sounds like its the same thing, loss of access to the underlying disk.
 
Maybe that's just my impression, but I feel that it happens more often in case you have multiple backup tasks running over the same node. Can anyone confirm?
 
No, that's not the case for me.
I always have only one backup task active per pve node.

@cheiss
If you want to debug the error, it has been occurring daily for me since the pve 8.1 upgrade.
At first the VMs were affected that mainly write their data to an S3 backend (local disk RBD image), but yesterday a VM with a cephfs mount was also affected
 
I'm in the same boat. Haven't seen this issue for a long time on our cluster.

I thought that it might have been caused by my PBS not being updated at the same time - about 2 days after the v7 to v8 upgrade - but it is still happening. We run two sets of backup jobs every night - one PBS and one ZFS - and it seems to affect the PBS a lot more than ZFS.

Going to have to disable fs-freeze/thaw so we don't have services down over the holidays and tackle it properly in the New Year...
 
For me it occurs less frequently after I moved the VMs that were affected to other proxmox nodes.
The utilization of the proxmox nodes is about the same, so that's probably not the reason too.

But yes, the only effective countermeasure at the moment is probably to completely deactivate the backup for these VMs...
 
  • Like
Reactions: b.miller
When I see this it happens because fs-freeze cannot freeze the processes running inside the VM. This causes the VM to fuck up whilst it waits for a response from the fs-freeze command that never arrives. Usually, I see this on cPanel servers. cPanel secures the /tmp folder which prevents fs-freeze from working. Somewhere you have a process that cannot be frozen. If cPanel is in use try running /scripts/securetmp and answer N, Y, N and take a backup again.
 
  • Like
Reactions: b.miller

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!