Disable fs-freeze on snapshot backups

godzilla · Feb 10, 2023

Is it possible to somehow disable the call to guest-agent fsfreeze when performing a snapshot backup?

Despite the update to latest Proxmox VE 7.3-4 and opt-in 5.19.17-2-pve kernel, I'm still having the issue where the fsfreeze command blocks the guest filesystem and there's no solution except forcing a guest reboot.

cheiss · Feb 11, 2023

Hi,

there is an feature request and accompanying patchset for implementing such an option out there, but it's still pending for review.
So there will be eventually a knob for this.

But reading the above thread thread, there seem to be some workarounds/solution available for this. Are they not viable as a stop-gap measure?

godzilla · Feb 13, 2023

cheiss said:
Hi,

there is an feature request and accompanying patchset for implementing such an option out there, but it's still pending for review.
So there will be eventually a knob for this.

But reading the above thread thread, there seem to be some workarounds/solution available for this. Are they not viable as a stop-gap measure?

Hi @cheiss ,

thanks for your kind reply. Unfortunately, solutions lowering the security level are not an option in my organization. I rather lose functionality than security.

I commented on the feature request, I hope they publish the feature soon.

cheiss · Mar 9, 2023

Hi,

just to let you know: The backend side of this feature as been committed and should be available as part of the qemu-server 7.3-4 package, which is already available on the pve-no-subscription repository.

As the web GUI part is still pending, in the meantime you can enable it with either qm set <vmid> -agent 1,freeze-fs-on-backup=0 or pvesh /nodes/<node>/qemu/<vmid>/config -agent 1,freeze-fs-on-backup=0.
(As this overrides the complete agent option and you have fstrim_cloned_disks set, you need to include that as well.)

godzilla · Mar 9, 2023

HI @cheiss, thanks for the update!

Do you know if the GUI part will be published sometimes soon or should we wait for the next PVE release?

Thank you!

cheiss · Mar 13, 2023

godzilla said:
Do you know if the GUI part will be published sometimes soon or should we wait for the next PVE release?

Hopefully soon, in the next few weeks. I will ping the GUI part of this series this week, they are relatively trivial anyway.
They unfortunately just got a bit lost on the mailing list it seems, definitely no need to wait for the next PVE release.

drjaymz@ · Apr 21, 2023

godzilla said:
Is it possible to somehow disable the call to guest-agent fsfreeze when performing a snapshot backup?

Despite the update to latest Proxmox VE 7.3-4 and opt-in 5.19.17-2-pve kernel, I'm still having the issue where the fsfreeze command blocks the guest filesystem and there's no solution except forcing a guest reboot.

The reason this hasn't been fixed despite being 1) a showstopper 2) known about for 18 months, is because its nothing to do with fsfreeze nor the guest agent. If you have a VM without guest agent, then you cannot issue fs-freeze and fs-thaw and the filesystems still break. Therefore the problem is more fundamental and they are looking the wrong place. fs-thaw just happens to be the first point at which they realise that the filesystem has disappeared and since disappearing filesystems usually don't log the error, guest-agent which has logs outside the VM is the only place they see the error.

What I don't understand is that when you are using replication between nodes, its creating snapshots, and when you use PBS its also creates snapshots, but these don't appear to cause a problem - afaik they are doing the same thing so why don't they cause the issue?

If you're lucky enough to be logging to an external syslog you see that the file system becomes unavailable as-if unplugged and its impossible to fix without a reboot. Until fixed, you can't really use PVE critical unless you disable backups, we've had to resort to dodgy rsyncs and handcraft everything in a spaghetti mess that the PVE was meant to do in a much more managed way.

fbnielsen · May 25, 2023

I agree with @drjaymz@ - it is NOT fsfreeze, guest agent or snapshot as such.
I replicate every 15 minutes with no probem.

I have 2 vm that has this issue (and lots that don't). (Ubuntu 20.04 fully updated)
One VM has guest running, with 'freeze-fs-on-backup: Disabled', the other VM with no guest installed and Guest Disabled.
(2 different physical servers in 2 differet locations)

Frozen state:
I can open the VM Console from PVE and I can see the text on the VM console - but it is unresponsive.
From PVE shell:

Code:

root@pm2:~# qm status 303
status: running

So status is 'running' - but this is clearly false, as the vm is frozen.

From PVE on the VM with guest running, I tried

Code:

qm guest exec 303 "test"

This returns something like 'guest not running' when it is frozen.
When the VM is running normaly I get:

Code:

root@pm2:~# qm guest exec 303 "test"
{
   "exitcode" : 1,
   "exited" : 1
}

From backup log:

303: 2023-05-25 02:21:04 INFO: Starting Backup of VM 303 (qemu)
303: 2023-05-25 02:21:04 INFO: status = running
303: 2023-05-25 02:21:04 INFO: VM Name: Omstillingsbordet
303: 2023-05-25 02:21:04 INFO: include disk 'scsi0' 'local-zfs:vm-303-disk-0'
303: 2023-05-25 02:21:04 INFO: include disk 'scsi1' 'local-zfs:vm-303-disk-1'
303: 2023-05-25 02:21:04 INFO: backup mode: snapshot
303: 2023-05-25 02:21:04 INFO: ionice priority: 7
303: 2023-05-25 02:21:04 INFO: creating Proxmox Backup Server archive 'vm/303/2023-05-25T00:21:04Z'
303: 2023-05-25 02:21:05 INFO: started backup task 'e37c0716-9016-4a92-b12b-82b75d90a6ec'
303: 2023-05-25 02:21:05 INFO: resuming VM again
303: 2023-05-25 02:21:05 INFO: scsi0: dirty-bitmap status: OK (1.0 GiB of 50.0 GiB dirty)
303: 2023-05-25 02:21:05 INFO: scsi1: dirty-bitmap status: OK (928.0 MiB of 200.0 GiB dirty)

I loose access at 02:21:04 - and a

Code:

root@pm2:~# qm reset 303

is necessary.

This freezing is random but (In my case) ALWAYS in connection with a backup to PBS.

drjaymz@ · May 25, 2023

fbnielsen said:
I agree with @drjaymz@ - it is NOT fsfreeze, guest agent or snapshot as such.
I replicate every 15 minutes with no probem.

I have 2 vm that has this issue (and lots that don't). (Ubuntu 20.04 fully updated)
One VM has guest running, with 'freeze-fs-on-backup: Disabled', the other VM with no guest installed and Guest Disabled.
(2 different physical servers in 2 differet locations)

Frozen state:
I can open the VM Console from PVE and I can see the text on the VM console - but it is unresponsive.
From PVE shell:

Code:

root@pm2:~# qm status 303 status: running

So status is 'running' - but this is clearly false, as the vm is frozen.

From PVE on the VM with guest running, I tried

Code:

qm guest exec 303 "test"

This returns something like 'guest not running' when it is frozen.
When the VM is running normaly I get:

Code:

root@pm2:~# qm guest exec 303 "test" { "exitcode" : 1, "exited" : 1 }

From backup log:
303: 2023-05-25 02:21:04 INFO: Starting Backup of VM 303 (qemu) 303: 2023-05-25 02:21:04 INFO: status = running 303: 2023-05-25 02:21:04 INFO: VM Name: Omstillingsbordet 303: 2023-05-25 02:21:04 INFO: include disk 'scsi0' 'local-zfs:vm-303-disk-0' 303: 2023-05-25 02:21:04 INFO: include disk 'scsi1' 'local-zfs:vm-303-disk-1' 303: 2023-05-25 02:21:04 INFO: backup mode: snapshot 303: 2023-05-25 02:21:04 INFO: ionice priority: 7 303: 2023-05-25 02:21:04 INFO: creating Proxmox Backup Server archive 'vm/303/2023-05-25T00:21:04Z' 303: 2023-05-25 02:21:05 INFO: started backup task 'e37c0716-9016-4a92-b12b-82b75d90a6ec' 303: 2023-05-25 02:21:05 INFO: resuming VM again 303: 2023-05-25 02:21:05 INFO: scsi0: dirty-bitmap status: OK (1.0 GiB of 50.0 GiB dirty) 303: 2023-05-25 02:21:05 INFO: scsi1: dirty-bitmap status: OK (928.0 MiB of 200.0 GiB dirty)

I loose access at 02:21:04 - and a

Code:

root@pm2:~# qm reset 303

is necessary.

This freezing is random but (In my case) ALWAYS in connection with a backup to PBS.

This isn't the same issue I am getting but may be related. In my case on Linux 2.4 kernel guests ONLY, I get a situation where it looks like its frozen, but if you happened to already have an ssh connection open and you're in a root you can see that its still running but that the filesystems are very broken; I/O errors corruption etc. Resetting the machine fixes it. We tried dumping out QEMU's view of the VM and disks and it was happy, and the problem is definitely to do with the VM not liking something that has happened to the filesystem. I can even save the broken state, reload it and it stays broken even though thats a completely new instance of QEMU.

It always occurs off the back of a backup and the VM ran for the last 20 years under KVM/QEMU it was only when using the backup functions we started getting grief. The snapshotting used for backups does lock the filesystems regardless of guest agent and I think something isn't handled correctly such that the guest seems to have a bad map of the drive in memory. Its a problem I had for 2 years and really is a show stopper meaning you can't rely on proxmox not to corrupt your data which is about as serious a flaw you can get.

triatk · May 27, 2023

In fact not only snapsot can cause this issue, the stop methode is not working, too.
This is a log for a backup job to pbs.


NFO: include disk 'scsi0' 'local-lvm:vm-109-disk-1' 544972M
INFO: include disk 'efidisk0' 'local-lvm:vm-109-disk-0' 4M
INFO: stopping virtual guest
INFO: VM quit/powerdown failed
ERROR: Backup of VM 109 failed - command 'qm shutdown 109 --skiplock --keepActive --timeout 600' failed: exit code 255
INFO: Failed at 2023-05-27 04:10:43
INFO: Backup job finished with errors
TASK ERROR: job errors

The source vm is a fresh created ubuntu 22.04 cloud img.

The shutdown process is stuck at sending SIGTERM to sys_sync.

pveversion:
pve-manager/7.4-3/9002ab8a (running kernel: 6.2.6-1-pve)

Update: This is caused by a faulty RAM

privnote · Dec 11, 2023

drjaymz@ @fbnielsen

Were you able to resolve or isolate the problem in the meantime?

The VM always freezes for me too, I can access the VM, but then I have the following error in the syslog.

Code:

INFO: task jbd2/sda1-8:351 blocked for more than 120 seconds.
    Not tainted 4.19.0-25-amd64 #1 Debian 4.19.289-2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I did the PVE upgrade from 7.4 to 8.1 a few days ago.
With 7.4 I also had these errors in the log from time to time, but at some point they disappeared.
I would say I haven't seen this error for about a year now.

I am also use PBS for backups
And as storage backend I use both ZFS on local disks, as well as ceph on dedicated hardware

drjaymz@ · Dec 12, 2023

privnote said:
drjaymz@ @fbnielsen

Were you able to resolve or isolate the problem in the meantime?

The VM always freezes for me too, I can access the VM, but then I have the following error in the syslog.

Code:

INFO: task jbd2/sda1-8:351 blocked for more than 120 seconds. Not tainted 4.19.0-25-amd64 #1 Debian 4.19.289-2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I did the PVE upgrade from 7.4 to 8.1 a few days ago.
With 7.4 I also had these errors in the log from time to time, but at some point they disappeared.
I would say I haven't seen this error for about a year now.

I am also use PBS for backups
And as storage backend I use both ZFS on local disks, as well as ceph on dedicated hardware

Still occurs but less often so whatever causes it hasn't been fixed. I have both v7 and v8 setups about 32 probably about 320 guest VM's. I may see it every couple of months and then sometimes twice in a row. As I said earlier what you see is that normally it looks like the VM has frozen and you can't log in however if you were already in you'll see that basically you get an I/O error on one or more disks. I have not seen that exact error you describe / not looked but it sounds like its the same thing, loss of access to the underlying disk.

godzilla · Dec 14, 2023

Maybe that's just my impression, but I feel that it happens more often in case you have multiple backup tasks running over the same node. Can anyone confirm?

privnote · Dec 22, 2023

No, that's not the case for me.
I always have only one backup task active per pve node.

@cheiss
If you want to debug the error, it has been occurring daily for me since the pve 8.1 upgrade.
At first the VMs were affected that mainly write their data to an S3 backend (local disk RBD image), but yesterday a VM with a cephfs mount was also affected

b.miller · Dec 22, 2023

I'm in the same boat. Haven't seen this issue for a long time on our cluster.

I thought that it might have been caused by my PBS not being updated at the same time - about 2 days after the v7 to v8 upgrade - but it is still happening. We run two sets of backup jobs every night - one PBS and one ZFS - and it seems to affect the PBS a lot more than ZFS.

Going to have to disable fs-freeze/thaw so we don't have services down over the holidays and tackle it properly in the New Year...

privnote · Dec 22, 2023

@b.miller
In my case, deactivating fs-freeze/thaw unfortunately did not help

b.miller · Dec 22, 2023

privnote said:
@b.miller
In my case, deactivating fs-freeze/thaw unfortunately did not help

No? Shame. Well just in case I deactivated the PBS jobs and left only ZFS running. I tried to replicate with freeze on/off today with one of the VMs that seemed most susceptible to the issue, but of course no luck. Maybe it's an io issue. I recall something about io_uring being related.

privnote · Dec 22, 2023

For me it occurs less frequently after I moved the VMs that were affected to other proxmox nodes.
The utilization of the proxmox nodes is about the same, so that's probably not the reason too.

But yes, the only effective countermeasure at the moment is probably to completely deactivate the backup for these VMs...

Cha0s · Dec 22, 2023

privnote said:
But yes, the only effective countermeasure at the moment is probably to completely deactivate the backup for these VMs...

For me the solution has been to disable the qemu-agent. This allows to backup up the VMs with PBS without them getting "blocked".

Gh0st · Dec 26, 2023

When I see this it happens because fs-freeze cannot freeze the processes running inside the VM. This causes the VM to fuck up whilst it waits for a response from the fs-freeze command that never arrives. Usually, I see this on cPanel servers. cPanel secures the /tmp folder which prevents fs-freeze from working. Somewhere you have a process that cannot be frozen. If cPanel is in use try running /scripts/securetmp and answer N, Y, N and take a backup again.

Disable fs-freeze on snapshot backups

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

New Member

Member

New Member

New Member

drjaymz@ @fbnielsen​

Member

drjaymz@ @fbnielsen​

Member

New Member

Member

New Member

Member

New Member

Well-Known Member

Member

drjaymz@ @fbnielsen

drjaymz@ @fbnielsen