Snapshot backup not working ( guest-agent fs-freeze gets timeout)

huwylphi · Feb 7, 2023

Maybe, just thinking... Regarding mariadb; an other interesting input might be:
instead of running mariadb natively in the debian VM, to install docker engine and run mariadb as a docker container (https://hub.docker.com/_/mariadb) and see what happens on creating a live backup of the VM...

William Edwards · Feb 7, 2023

huwylphi said:
Maybe, just thinking... Regarding mariadb; an other interesting input might be:
instead of running mariadb natively in the debian VM, to install docker engine and run mariadb as a docker container (https://hub.docker.com/_/mariadb) and see what happens on creating a live backup of the VM...

Why?

drjaymz@ · Feb 7, 2023

We've had had the issue on all containers and VM's. MariaDB just seems to to notice it early on and thats where we first noticed it. I still don't think its actually anything to do with fs-freeze at all, since its occurred without the qm-guest, qm-guest just happens to be the only process able to report that the FS is broken. Everything else is within the VM and can write the error because their error log is on the FS that is broken. I think its a fundamental bug and its not getting fixed because they are looking in the wrong place.

huwylphi · Feb 8, 2023

William Edwards said:
Why?

the idea would be to try to "isolate" mariadb in a container running inside the VM. If the issue would not appear anymore for that VM, it could mean that the issue is related to mariadb. IN my environment I've several debian VMs but only that one where mariadb is installed is causing issue.
But as drjaymz@ mentioned below, that issue might not be related with mariadb at all.

drjaymz@ · Feb 8, 2023

huwylphi said:
the idea would be to try to "isolate" mariadb in a container running inside the VM. If the issue would not appear anymore for that VM, it could mean that the issue is related to mariadb. IN my environment I've several debian VMs but only that one where mariadb is installed is causing issue.
But as drjaymz@ mentioned below, that issue might not be related with mariadb at all.

What problem with MariaDB did you see? The first issue we saw with MariaDB is that queries suddenly started never returning, when that was in a cluster the entire cluster then would not return queries because I presume it was waiting forever for a node to commit. My memory is sketchy on this but a restart of mariadb service fixed resulted in them working ok. So at that point, that was not a broken FS, and MariaDB could have been more helpful as to why queries were never returning but nothing in the logs at all. This would start after a snapshot had concluded and it wasn't random it was every single time without fail. This was Ubuntu 22.04 container with MariaDB 10.6.7 we upgraded to 10.8 where that problem stopped. Seemed others had the same problem, so not just us.

But then we encountered the next problem - which was intermittent. That is that when you do a snapshot, on completion the entire filesystem disappears - you can log in to a VM or container but you can't call any binaries or list any directories, it just freezes. The only way you know what is going on is if you have a process inside that reports the problem or your syslog goes to an external syslog. This is why I think that people have concentrated on qm-agent because it notices that it can't access the file system and then tells the outside that the fs-thaw has timedout, but we know better. Same issue occurs on a VM not running guest agent, which means its nothing to do with fs-freeze/thaw IMO.

Dig a bit deeper and you find that it seems to be an issue with qemu since a kernel change over 12 months ago.

This puzzles me as to why when fundamentally it doesn't work why isn't there more noise in the groups? So maybe its a specific combination of our hardware? Just in case, here it is:

Code:

DELL poweredge r450, Intel(R) Xeon(R) Silver 4310 CPU, 32Gb of HMA82GR7DJR8N-XN, MegaRAID Tri-Mode SAS3516 , 4x MTFDDAK480TDT, ZFS
Kernel : Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1

huwylphi · Feb 8, 2023

drjaymz@ said:
What problem with MariaDB did you see? The first issue we saw with MariaDB is that queries suddenly started never returning, when that was in a cluster the entire cluster then would not return queries because I presume it was waiting forever for a node to commit. My memory is sketchy on this but a restart of mariadb service fixed resulted in them working ok. So at that point, that was not a broken FS, and MariaDB could have been more helpful as to why queries were never returning but nothing in the logs at all. This would start after a snapshot had concluded and it wasn't random it was every single time without fail. This was Ubuntu 22.04 container with MariaDB 10.6.7 we upgraded to 10.8 where that problem stopped. Seemed others had the same problem, so not just us.

But then we encountered the next problem - which was intermittent. That is that when you do a snapshot, on completion the entire filesystem disappears - you can log in to a VM or container but you can't call any binaries or list any directories, it just freezes. The only way you know what is going on is if you have a process inside that reports the problem or your syslog goes to an external syslog. This is why I think that people have concentrated on qm-agent because it notices that it can't access the file system and then tells the outside that the fs-thaw has timedout, but we know better. Same issue occurs on a VM not running guest agent, which means its nothing to do with fs-freeze/thaw IMO.

Dig a bit deeper and you find that it seems to be an issue with qemu since a kernel change over 12 months ago.

This puzzles me as to why when fundamentally it doesn't work why isn't there more noise in the groups? So maybe its a specific combination of our hardware? Just in case, here it is:

Code:

DELL poweredge r450, Intel(R) Xeon(R) Silver 4310 CPU, 32Gb of HMA82GR7DJR8N-XN, MegaRAID Tri-Mode SAS3516 , 4x MTFDDAK480TDT, ZFS Kernel : Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1

Thanks for that recap.
I mentioned mariadb because in my situation the only VM having that issue is where mariadb 10.7.7 is installed on a debian 11 VM. All other debian VMs (including one VM with older mariadb v10.3.22 on ubuntu 16.04.7) are not affected. All VMs are on local SSD disks (ZFS) and are periodically replicated across the hosts within the cluster. Also a backup with Proxmox Backup Server is executed nightly and a local backup weekly. But the issue happens seldomly (1-2 times per month) and is hard to manually reproduce... I'm not convinced that upgrading mariadb to version 10.8 would solve the issue since it seems other people still have that issue with v10.8: https://gitlab.com/qemu-project/qemu/-/issues/881
Maybe this issue has nothing to do with mariadb. My intention here is not to debate that, but just to give further inputs that might help leading someone to the right direction by adding further use cases where the issue happens.

drjaymz@ · Feb 8, 2023

huwylphi said:
Thanks for that recap.
I mentioned mariadb because in my situation the only VM having that issue is where mariadb 10.7.7 is installed on a debian 11 VM. All other debian VMs (including one VM with older mariadb v10.3.22 on ubuntu 16.04.7) are not affected. All VMs are on local SSD disks (ZFS) and are periodically replicated across the hosts within the cluster. Also a backup with Proxmox Backup Server is executed nightly and a local backup weekly. But the issue happens seldomly (1-2 times per month) and is hard to manually reproduce... I'm not convinced that upgrading mariadb to version 10.8 would solve the issue since it seems other people still have that issue with v10.8: https://gitlab.com/qemu-project/qemu/-/issues/881
Maybe this issue has nothing to do with mariadb. My intention here is not to debate that, but just to give further inputs that might help leading someone to the right direction by adding further use cases where the issue happens.

That could well be correct, once I upgraded (for a different reason, a bug requiring tables to be constantly manually analyzed) and we still had the problem I stopped doing it.
I believe the problem is linked to the kernel so upgrading a program isn't going to fix anything. I say that because people suggested kernel versions that never have the problem and then after a point always have the problem. But I never paid attention to the version or remember where I read that.

The reason I asked is because MariaDB exhibited different problem - i.e. it wasn't the whole FS that was locked, restarting mariadb everything was fine, but the other failure mode you have no choice but to destroy the vm and start again.

William Edwards · Feb 13, 2023

Another data point:

Before last week, this issue occurred only on machines with MariaDB and Debian 11. Debian 10 machines are not affected.

Last week, for the first time, the issue occurred on a Debian 11 machine without MariaDB.

drjaymz@ · Feb 13, 2023

William Edwards said:
Another data point:

Before last week, this issue occurred only on machines with MariaDB and Debian 11. Debian 10 machines are not affected.

Last week, for the first time, the issue occurred on a Debian 11 machine without MariaDB.

indeed.

anda42 · Feb 27, 2024

So... Posting in this old thread. I am seeing this issue on multiple machines - ceph file system, backed up to iscsi SAN. Not all running mariadb.

Anyone found a solution?

drjaymz@ · Feb 27, 2024

anda42 said:
So... Posting in this old thread. I am seeing this issue on multiple machines - ceph file system, backed up to iscsi SAN. Not all running mariadb.

Anyone found a solution?

TLDR: No.

However, I haven't seen the issue for some time now yet I don't think it's fixed. I could never understand why replication which effectively does the same thing doesn't cause an issue - but now we are many versions down the road for PVE and ZFS.

Instead of snapshots, I now tend to use Proxmox Backup server and that has never caused the issue. You can use it in the same way and if you don't want to install it on a seperate box you can install it as a VM on PVE as long as you don't back yourself up to yourself.

As far as I know, an issue sometimes still occurs when you snapshot a running VM causing that VM to lose complete access to its file systems. Not only when you call the snapshot but also if you then use the snapshot file you created sometime later - such as to create a clone. However, if you use PBS to backup using the method snapshot no problems.

qemu agent: some suggest that the fs-freeze and thaw are the issue but its a symptom not the cause - i.e. thaw doesn't work BECAUSE The filesystem is not accessible not the other way around.

anda42 · Feb 27, 2024

Actually I am seeing this issue using PBS, which afaik does snapshots as backup.

I found in another thread that disabling iothread might help. Trying that out.

drjaymz@ · Feb 27, 2024

anda42 said:
Actually I am seeing this issue using PBS, which afaik does snapshots as backup.

I found in another thread that disabling iothread might help. Trying that out.

Ah - very good. Yes, it does exactly the same and should cause the issue. But, we have dozens of VM's backing up hourly for over a year now and not one issue - which is why I said we have not seen it for a while. And instead of cloning or snapshotting in the GUI I tend to use the closest backup instead and thought maybe that's part of the reason the problem appears to have gone away - although I don't think any silver bullets were found or announced thus, I suspect the original issue is still there.

Not sure why disabling IOThread helps but if you can reproduce the issue then I guess you'll be able to test it.

nerthazrim · May 12, 2024

On my end, I lost hope of seeing a fix for that issue...
I'm now on Debian 12 and MariaDB 11.3 but I'm afraid to re-include this specific VM in the backup job.

I'm dumping the DBs in SQL files and backuping those with proxmox-backup-client installed in the VM...

Has anyone seen a change with Debian 12? Is it working well now?

Impact · May 13, 2024

I don't have these issues on bookworm with 10.11.6 from the debian repos. I can't exactly remember when the issue stopped but I had it once too. Previously I used to use the MariaDB repo. I believe it was pinned to 10.x.

vENZi · May 13, 2024

nerthazrim said:
On my end, I lost hope of seeing a fix for that issue...
I'm now on Debian 12 and MariaDB 11.3 but I'm afraid to re-include this specific VM in the backup job.

I'm dumping the DBs in SQL files and backuping those with proxmox-backup-client installed in the VM...

Has anyone seen a change with Debian 12? Is it working well now?

Just move to vmware critically machines - proxmox will not fix this issue

It is here for years and no fix is available ... I lost hope also and start moving to vmware !

anda42 · May 13, 2024

drjaymz@ said:
Ah - very good. Yes, it does exactly the same and should cause the issue. But, we have dozens of VM's backing up hourly for over a year now and not one issue - which is why I said we have not seen it for a while. And instead of cloning or snapshotting in the GUI I tend to use the closest backup instead and thought maybe that's part of the reason the problem appears to have gone away - although I don't think any silver bullets were found or announced thus, I suspect the original issue is still there.

Not sure why disabling IOThread helps but if you can reproduce the issue then I guess you'll be able to test it.

I dont know either. However, disabling IOthread has completely fixed the issue for me. Pretty sure of this, because the issue re-appeared after some templates with IOthread enabled was cloned.

Running mainly:
ubuntu 22
proxmox 8.1.3
ceph 18.2.0

Search

Search

Snapshot backup not working ( guest-agent fs-freeze gets timeout)

huwylphi

Member

William Edwards

Well-Known Member

drjaymz@

Member

huwylphi

Member

drjaymz@

Member

huwylphi

Member

drjaymz@

Member

William Edwards

Well-Known Member

drjaymz@

Member

anda42

New Member

drjaymz@

Member

anda42

New Member

drjaymz@

Member

nerthazrim

Member

Impact

New Member

vENZi

Member

anda42

New Member