Snapshot backup not working ( guest-agent fs-freeze gets timeout)

FingerlessGloves · Mar 7, 2022

So it effects 10.3 as well.

Please report this upstream to mariadb as well https://jira.mariadb.org/browse/MDEV-27196
Also upstream to qemu project https://gitlab.com/qemu-project/qemu/-/issues/881

At the moment we don't know which side the issue lays with, more people who report it, more likely we looking to find a fix to the issue.

vix9 · Mar 7, 2022

I'm spinning up a test VM right now to test this with and without qemu-guest-agent

huwylphi · May 7, 2022

any news about that issue?

Keyinator · May 31, 2022

Took me weeks to figure out I was experiencing the same problem.

My specs:
Ubuntu 22.04 (ct)
10.6.7-MariaDB
lvm-thin-volume

vix9 · May 31, 2022

Honestly, this issue was so disruptive I almost switched Hypervisors over it. In the end, we stayed with proxmox due to our admin comfort level. We did spin up a new 4 node cluster with dedicated ZFS mirrors in each node. I have had no real luck with Proxmox and SSD's (just seem to have a short lifespan in these machines) so we went with Enterprise spinning drives. Backups still caused issues in the SQL VMs. So, next step was setting up Proxmox-autosnap (https://github.com/apprell/proxmox-autosnap) and having one of my TrueNAS machines pull those snapshots. Yes, restore to a brand new VM has a few more steps, but the VM lockup issue has (so far) been eliminated. My snapshots now pull to a local TrueNAS machine, and those snapshots replicate offsite to another TrueNAS.

mach · Aug 9, 2022

I've ran into this issue as well with latest PVE on both LXC and VM's with latest Debian and 10.8.3-MariaDB.

Is the only workaround to switch to stop mode backup or switch to centos for VMs?

FingerlessGloves · Aug 9, 2022

mach said:
I've ran into this issue as well with latest PVE on both LXC and VM's with latest Debian and 10.8.3-MariaDB.

Is the only workaround to switch to stop mode backup or switch to centos for VMs?

Hi Mach,

Disable the QEMU agent on the VM options, that'll allow the backup to run. I'm not sure if the issue is Debian or the MariaDB builds.
If you could kindly report upstream links here that would be great, as I would love to get this fixed.

janus57 · Aug 21, 2022

Hello,

it seem that is a qemu-agent issue, see :
- https://cloudlinux.zendesk.com/hc/e...calls-fsfreeze-CloudLinux-VM-hangs-on-ProxMox
- https://gitlab.com/qemu-project/qemu/-/issues/520

Best Regards,
janus57

FingerlessGloves · Aug 21, 2022

janus57 said:
Hello,

it seem that is a qemu issue, see :
- https://cloudlinux.zendesk.com/hc/e...calls-fsfreeze-CloudLinux-VM-hangs-on-ProxMox
- https://gitlab.com/qemu-project/qemu/-/issues/520

Best Regards,
janus57

Issue is upstream

janus57 · Aug 21, 2022

FingerlessGloves said:
Issue is upstream

Hello,

Just to say that the issue is not only related to MariaDB.

The issue seems to be provoked by mount points in FS that qemu-agent freezes in the wrong order and can be bypassed.

Best Regards,
janus57

William Edwards · Sep 27, 2022

I'm seeing this issue on two VMs:

One runs Debian 11 with kernel 5.10.0. The VM was installed 13 days ago. The issue occurred only once on this VM.
One runs PMG 7 (Debian 11) with kernel 5.15.53. 4 days ago, the VM was upgraded from PMG 6 to PMG 7. The issue started occurring on the 3rd day after the upgrade. It happens on every backup job of that VM since then.

Both VMs run MariaDB 10.6. However:

There's many more VMs with Debian 11 and MariaDB 10.6 that haven't experienced this issue yet.
The issue occurs on the PMG VM even when MariaDB is not running. I did not test this on the other VM.
I upgraded two other PMG VMs, also running MariaDB 10.6, to the same PMG version and kernel, on which this issue doesn't occur.

On the PMG VM, I can reproduce the issue with the following kernels. I did not test alternative kernels on the other VM.

5.4.195-1-pve
5.15.53-1-pve

I could not reproduce the issue with the kernel 4.13.13-5-pve.

drjaymz@ · Jan 20, 2023

I'm here because we're using snapshots to backup a warehouse management server and the snapshot breaks the FS locking up the VM permanently. But everyone tells me its the guest agent having an issue with fs-freeze and fs-unfreeze; apparently a bug thats been around for a long while and everyone just shrugs their shoulders and all the threads on the issue don't go anywhere after chit chat for about a month (this thread is no exception). I guess people just go and find another hypervisor.
I don't have the guest agent installed in the VM. So I have no workable backup solution nor workable backups. Have I misunderstood what is being referred to by 'guest agent'?

So when backing this up using snapshot mode the VM locks up. If you do get in you cannot run any command because they cannot find the binaries and all file systems are invalid. Bootdisk size in the screen shot changes to 0B. You have to destroy and restart. There is nothing in the syslog's of the VM because obviously it cannot write to the syslog.

The only thing I can do is to disable snapshots and rsync the FS. the reason we are using Proxmox in the first place was that it integrated these things so I didn't have to have another DIY spaghetti mess.

Since the last post is in September am I to assume that is no solution and that we're all not bothering with backups?

drjaymz@ · Jan 20, 2023

FingerlessGloves said:
So it effects 10.3 as well.

Please report this upstream to mariadb as well https://jira.mariadb.org/browse/MDEV-27196
Also upstream to qemu project https://gitlab.com/qemu-project/qemu/-/issues/881

At the moment we don't know which side the issue lays with, more people who report it, more likely we looking to find a fix to the issue.

Is it possible its nothing to do with fs-freeze / thaw?
I don't have the guest agent but its very much the same problem on snapshot - basically disks become inaccessible. I posit that if thats the underlying issue, then the fs-thaw will also fail.

In my case there is nothing in the log to indicate a problem its just buggered. But if you were communicating with guest-agent it will tell you its buggered hence why people seem to think that is where the issue lies.

vix9 · Jan 20, 2023

drjaymz@ said:
Is it possible its nothing to do with fs-freeze / thaw?
I don't have the guest agent but its very much the same problem on snapshot - basically disks become inaccessible. I posit that if thats the underlying issue, then the fs-thaw will also fail.

In my case there is nothing in the log to indicate a problem its just buggered. But if you were communicating with guest-agent it will tell you its buggered hence why people seem to think that is where the issue lies.

Just curious, what's your backing storage on this? We were having similar issues. Beat the lockups by using ZFS snapshots and replication to get everything backed up.

drjaymz@ · Jan 20, 2023

vix9 said:
Just curious, what's your backing storage on this? We were having similar issues. Beat the lockups by using ZFS snapshots and replication to get everything backed up.

It is ZFS.

Its a cluster of 3 HA and it replicates to the other two. We have a backup storage mounted as well.

How did you implement this?

vix9 · Jan 20, 2023

drjaymz@ said:
It is ZFS.

Its a cluster of 3 HA and it replicates to the other two. We have a backup storage mounted as well.

How did you implement this?

Well, I have two different setups. One with a local ZFS pool created with Proxmox. With that, I used this with some modifications - https://github.com/apprell/proxmox-autosnap. Then I pull the snapshots with a TrueNAS server. One other setups I have TrueNAS machines with NFS shares as the backing storage. Those are simple push replications set up in TrueNAS.

drjaymz@ · Jan 20, 2023

vix9 said:
Well, I have two different setups. One with a local ZFS pool created with Proxmox. With that, I used this with some modifications - https://github.com/apprell/proxmox-autosnap. Then I pull the snapshots with a TrueNAS server. One other setups I have TrueNAS machines with NFS shares as the backing storage. Those are simple push replications set up in TrueNAS.

OK, here's my plan. Which I have gathered from about two dozen posts on the issue:

I am using the autosnap you mention there. Then I send that to a backup NFS drive we have mounted

Bash:

$ zfs send rpool/data/vm-101-disk-1@Backup_2023_01_20  | bzip2 | pv > /mnt/pve/pittolm/snap/vm-101-disk-1.zstream.bzip2

Now. why won't the zfs snapshot cause the same problem as I was having in the first place with VM lockups?
The reason I ask is that I have also noticed that if you create a clone from a snapshot and then use automated backup on that - it does exactly the same thing - the original VM locks up. This is another reason I think there is fundamentally a bug right down at the bottom to do with zfs.
See here cant-backup-a-mariadb-server-without-breaking-it another problem that seems to be the same thing and yet again shrugged shoulders, rolled eyes, whistle and walk away...

vENZi · Jan 20, 2023

We are experiencing the same problem in multiple servers !
Why this is not fixed sooner - this is major issue !?
Please make this bug ASAP for fixing ! Thank you !

drjaymz@ · Jan 23, 2023

vENZi said:
We are experiencing the same problem in multiple servers !
Why this is not fixed sooner - this is major issue !?
Please make this bug ASAP for fixing ! Thank you !

Its been around for some time - I believe it hasn't been fixed because they are looking in the wrong place - concentrating on the fs-thaw and scratching their heads. We don't have the necessary agent installed for fs-freeze/thaw to be called inside the VM, and yet the problem still occurs. So the error message of fs-thaw failed isn't a problem with fs-thaw its because the file system has broken underneath leading to the error message from fs-thaw.

I first saw this error in March 2022! What I don't understand is why there are not more people shouting about it, unless they are all running hobbyist systems and crashing isn't a problem for them.
To reproduce all you need to do is enable backup job as snapshot on a VM (probably on ZFS) and wait.

drjaymz@ · Jan 23, 2023

vix9 said:
Well, I have two different setups. One with a local ZFS pool created with Proxmox. With that, I used this with some modifications - https://github.com/apprell/proxmox-autosnap. Then I pull the snapshots with a TrueNAS server. One other setups I have TrueNAS machines with NFS shares as the backing storage. Those are simple push replications set up in TrueNAS.

what did you need to have in place to restore from a snapshot to a fresh prox node?

I assume as a bare min

the vm config file
then use zfs receive..
1. the snapshotted partitions (I have 3)
2. the machine state if we had one. I think this is independent of the partitions but the filename must match what is in the config file.
Then maybe you can find it in the UI and restore - otherwise we'll have to work out what the command is to undo it all.

its a massive pain that backup doesn't work without destroying your vm because that does all of the above and probably a load of other steps we don't know about about.

Snapshot backup not working ( guest-agent fs-freeze gets timeout)

Well-Known Member

Member

Member

Member

Member

Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

We value your privacy