Snapshot backup not working ( guest-agent fs-freeze gets timeout)

NoLine · Nov 17, 2021

Trying to take a snapshot backup does not work.
it's looks like this:
https://forum.proxmox.com/threads/vm-hang-during-backup-fs-freeze.80152/

So far I've only seen this on the guest with Debian 11 (and MariaDB from mariadb.org)

Code:

INFO: starting new backup job: vzdump 144 --node kvm02 --remove 0 --mode snapshot --storage local --compress zstd
INFO: Starting Backup of VM 144 (qemu)
INFO: Backup started at 2021-11-17 16:29:37
INFO: status = running
INFO: VM Name: NFY-isengard
INFO: include disk 'scsi0' 'local-zfs:vm-144-disk-0' 60G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-144-2021_11_17-16_29_37.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command

and freeze

in guest syslog:

Code:

Nov 17 16:29:36 isengard qemu-ga: info: guest-ping called
Nov 17 16:29:37 isengard qemu-ga: info: guest-fsfreeze called
Nov 17 16:32:06 isengard kernel: [  363.556779] INFO: task qemu-ga:370 blocked for more than 120 seconds.
Nov 17 16:32:06 isengard kernel: [  363.556814]       Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1
Nov 17 16:32:06 isengard kernel: [  363.556829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 17 16:32:06 isengard kernel: [  363.556852] task:qemu-ga         state:D stack:    0 pid:  370 ppid:     1 flags:0x00004000
Nov 17 16:32:06 isengard kernel: [  363.556861] Call Trace:
Nov 17 16:32:06 isengard kernel: [  363.556881]  __schedule+0x282/0x870
Nov 17 16:32:06 isengard kernel: [  363.556886]  schedule+0x46/0xb0
Nov 17 16:32:06 isengard kernel: [  363.556888]  percpu_down_write+0xd2/0xe0
Nov 17 16:32:06 isengard kernel: [  363.556891]  freeze_super+0x7f/0x130
Nov 17 16:32:06 isengard kernel: [  363.556893]  __x64_sys_ioctl+0x62/0xb0
Nov 17 16:32:06 isengard kernel: [  363.556895]  do_syscall_64+0x33/0x80
Nov 17 16:32:06 isengard kernel: [  363.556897]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 17 16:32:06 isengard kernel: [  363.556903] RIP: 0033:0x7f26d5183cc7
Nov 17 16:32:06 isengard kernel: [  363.556905] RSP: 002b:00007ffc4a00b2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 17 16:32:06 isengard kernel: [  363.556907] RAX: ffffffffffffffda RBX: 0000559616e42790 RCX: 00007f26d5183cc7
Nov 17 16:32:06 isengard kernel: [  363.556916] RDX: 0000000000080000 RSI: 00000000c0045877 RDI: 0000000000000006
Nov 17 16:32:06 isengard kernel: [  363.556917] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000002a
Nov 17 16:32:06 isengard kernel: [  363.556918] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Nov 17 16:32:06 isengard kernel: [  363.556927] R13: 00007ffc4a00b3f8 R14: 00000000c0045877 R15: 0000000000000006
Nov 17 16:36:07 isengard kernel: [  605.220798] INFO: task qemu-ga:370 blocked for more than 120 seconds.
Nov 17 16:36:07 isengard kernel: [  605.220837]       Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1
Nov 17 16:36:07 isengard kernel: [  605.220860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 17 16:36:07 isengard kernel: [  605.220893] task:qemu-ga         state:D stack:    0 pid:  370 ppid:     1 flags:0x00004000
Nov 17 16:36:07 isengard kernel: [  605.220896] Call Trace:
Nov 17 16:36:07 isengard kernel: [  605.220904]  __schedule+0x282/0x870
Nov 17 16:36:07 isengard kernel: [  605.220906]  schedule+0x46/0xb0
Nov 17 16:36:07 isengard kernel: [  605.220909]  percpu_down_write+0xd2/0xe0
Nov 17 16:36:07 isengard kernel: [  605.220911]  freeze_super+0x7f/0x130
Nov 17 16:36:07 isengard kernel: [  605.220914]  __x64_sys_ioctl+0x62/0xb0
Nov 17 16:36:07 isengard kernel: [  605.220925]  do_syscall_64+0x33/0x80
Nov 17 16:36:07 isengard kernel: [  605.220926]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 17 16:36:07 isengard kernel: [  605.220929] RIP: 0033:0x7f26d5183cc7
Nov 17 16:36:07 isengard kernel: [  605.220932] RSP: 002b:00007ffc4a00b2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 17 16:36:07 isengard kernel: [  605.220934] RAX: ffffffffffffffda RBX: 0000559616e42790 RCX: 00007f26d5183cc7
Nov 17 16:36:07 isengard kernel: [  605.220935] RDX: 0000000000080000 RSI: 00000000c0045877 RDI: 0000000000000006
Nov 17 16:36:07 isengard kernel: [  605.220935] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000002a
Nov 17 16:36:07 isengard kernel: [  605.220936] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Nov 17 16:36:07 isengard kernel: [  605.220936] R13: 00007ffc4a00b3f8 R14: 00000000c0045877 R15: 0000000000000006

Host:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.143-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-8
pve-kernel-5.4: 6.4-7
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1

any suggestion ?

SOLTECSIS - Carles Munyoz · Nov 17, 2021

It is advisable that you upgrade your Proxmox and Qemu Guest Agent to the latest version.
Is it possible?

nerthazrim · Nov 18, 2021

Hi,

Same problem here, VM on PVE 7.5.1, Debian 11, mariadb 10.6.5 installed with the mariadb repos. Nothing else except an NFS mount point to my NAS.
EDIT : there is also filebeat installed to push logs to an ELK stack

Proxmox fully up-to-date (according to the repos, anyway), Debian 11 fully up-to-date too, including qemu guest-agent (unless there is a specific repo to enable?) : qemu-guest-agent 1:5.2+dfsg-11+deb11u1
No configuration tweak on qemu-guest-agent

No other Debian 11 or Ubuntu VMs has the issue, only this one.

I've set up manual mysql backups in the mean time but help would be very appreciated to be able to re-include this VM in the PBS backup jobs

SOLTECSIS - Carles Munyoz · Nov 18, 2021

What about disabling Qemu Guest Agent?

nerthazrim · Nov 18, 2021

From what I could read from other threads, this seems to work, yes. However, I did not try it as this VM hosts a database.
I would not trust a backup run on a database VM without properly freezing the FS first. As far as I'm concerned, I prefer disabling the PBS backups and run mysql-dumps manually on a regular basis.

Furthermore, disabling the guest agent is a workaround, I'm asking if anyone has an idea of a proper fix or what information someone would need to come up with a proper fix.

SOLTECSIS - Carles Munyoz · Nov 18, 2021

I understand and agree with you.
Maybe you can do both things:

Database backups over a directory.
Backup of the VM with PBS (make sure that this is launched after the database backup).

This way, you will have in the PBS an historic of consistent database backups that you can restore if you need it.

nerthazrim · Nov 18, 2021

Actually, you're describing my current setup

Still, workarounds have already been discussed in other threads and I don't think it's of any interest here.
We're looking for a permanent definitive fix if possible or some advise on how to investigate further and find that fix, please.

SOLTECSIS - Carles Munyoz · Nov 18, 2021

Ok, sorry me

nerthazrim · Nov 24, 2021

Another piece of information: I've upgraded PVE yesterday, all packages. I restarted the node and started only the mariadb VM.

I tried to freeze it, it worked. Thawed it, it worked. I started a PVE backup job on this VM only and it worked as well.
I thought that the update fixed the issue, re-enabled this VM in the global backup job that runs twice a day but the VM froze again at 2am tonight, during the job.

So, apparently, there is something that "changed" in the VM and that made the problem appear. I have no idea what... Maybe mariadb client connections? But why would it hang the VM when PVE tries to freeze it, there I'm lost...

NoLine · Nov 29, 2021

I gave up on debian 11.

Now I'm testing Centos 8 Stream and MariaDB from mariadb.org
Currently I see no problems.

nerthazrim · Nov 29, 2021

NoLine said:
I gave up on debian 11.

Now I'm testing Centos 8 Stream and MariaDB from mariadb.org
Currently I see no problems.

Good to know it works well on CentOS...
Have you tried to open a ticket / forum post with mariadb? Maybe they could explain what's wrong as well, as it looks like other Debian 11 VMs are not affected?

NoLine · Nov 29, 2021

If there are no problems for a few days, I will try to submit a ticket to Debian and MariaDB.

Has anyone tested debian 11 and mariadb from the debian repo?

nerthazrim · Jan 20, 2022

Solution found here:
https://julian-huebenthal.de/2021/1...bian-11-vm-freezes-on-backups-guestfs-freeze/

EDIT : didn't work. After while, the problem comes back. Reinstalling the guest-agent allows to fsfreeze the VM for some time and then any fsfreeze completely freezes the VM again...

FingerlessGloves · Feb 21, 2022

I have the exact same problem, Debian and MariaDB from their own Repo.
Debian 11
MariaDB 10.7

Only setup the VM last week, so far no other Debian VM have been effected. I did reinstall the agent on the MariaDB VM, which got it working for a while but eventually issue comes back.

NoLine said:
If there are no problems for a few days, I will try to submit a ticket to Debian and MariaDB.

Has anyone tested debian 11 and mariadb from the debian repo?

Have you reported it? I can also add to the ticket if you have

Edit: I've reported the issue upstream too https://gitlab.com/qemu-project/qemu/-/issues/881

Backup Log, I had to unlock the VM from qm and then force a reset for the VM to become functional again. Which I did at 8am

Code:

103: 2022-02-21 07:00:25 INFO: Starting Backup of VM 103 (qemu)
103: 2022-02-21 07:00:25 INFO: status = running
103: 2022-02-21 07:00:25 INFO: VM Name: mariadb1
103: 2022-02-21 07:00:25 INFO: include disk 'scsi0' 'local-zfs:vm-103-disk-0' 10G
103: 2022-02-21 07:00:25 INFO: backup mode: snapshot
103: 2022-02-21 07:00:25 INFO: ionice priority: 7
103: 2022-02-21 07:00:25 INFO: creating Proxmox Backup Server archive 'vm/103/2022-02-21T07:00:25Z'
103: 2022-02-21 07:00:25 INFO: issuing guest-agent 'fs-freeze' command
103: 2022-02-21 08:00:25 ERROR: VM 103 qmp command 'guest-fsfreeze-freeze' failed - got timeout
103: 2022-02-21 08:00:25 INFO: issuing guest-agent 'fs-thaw' command
103: 2022-02-21 08:00:35 ERROR: VM 103 qmp command 'guest-fsfreeze-thaw' failed - got timeout
103: 2022-02-21 08:00:35 INFO: started backup task '5c3798d2-4a18-45f9-8120-46c47b27ad69'
103: 2022-02-21 08:00:35 INFO: resuming VM again
103: 2022-02-21 08:00:35 ERROR: VM 103 qmp command 'cont' failed - Resetting the Virtual Machine is required
103: 2022-02-21 08:00:35 INFO: aborting backup job
103: 2022-02-21 08:00:35 INFO: resuming VM again
103: 2022-02-21 08:00:35 ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - Resetting the Virtual Machine is required

FingerlessGloves · Feb 23, 2022

This has now also been reported on the MariaDB bug tracker
https://jira.mariadb.org/browse/MDEV-27196

vix9 · Mar 4, 2022

Experiencing a similar issue on machines with Ubuntu 20 and MariaDB. Backups will lock up the machine.

FingerlessGloves · Mar 5, 2022

vix9 said:
Experiencing a similar issue on machines with Ubuntu 20 and MariaDB. Backups will lock up the machine.

Are you using mariadb's own repo or ubuntu's?

vix9 · Mar 7, 2022

Looks like the Ubuntu repo.

FingerlessGloves · Mar 7, 2022

vix9 said:
Looks like the Ubuntu repo.

Which MariaDB version does ubuntu 20.04 carry?

vix9 · Mar 7, 2022

mysqld Ver 10.3.34-MariaDB-0ubuntu0.20.04.1 for debian-linux-gnu on x86_64 (Ubuntu 20.04)

Snapshot backup not working ( guest-agent fs-freeze gets timeout)

Member

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Member

Member

Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member