Snapshot hangs if qemu-guest-agent is running / Cloudlinux

TheMrg

Active Member
Aug 1, 2019
118
4
38
42
We use KVM.
We use Centos7 with and Cpanel (think it does not matters) and installed qemu-guest-agent.
Snapshots working fine. we install cloudlinux yesterday on this machine (based on centos7).
Today we like create a new snapshot. it takes over 1h and we abborded it.
some log of an other try:

Feb 27 09:11:12 backup pvedaemon[15239]: <root@pam> snapshot VM 10001: test2
after 40 minutes we canceled snasphot task
Feb 27 09:58:47 backup pvedaemon[15239]: closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
Feb 27 09:58:47 backup pvedaemon[15239]: VM 10001 qmp command failed - received interrupt
Feb 27 09:58:47 backup pvedaemon[15239]: guest-fsfreeze-freeze problems - received interrupt
Feb 27 09:58:52 backup pvedaemon[26912]: <root@pam> end task UPID:backup:00003B87:0055B0BF:5E5779A0:qmsnapshot:10001:root@pam: unexpected status

The machine is locked. We do qm unlock 10001
But still the VM is freezed. We have do to STOP and START to this machine.

We checked very long. Nothing helps. Only
systemctl stop qemu-guest-agent
brings the trick. now snapshots work as expected.

Other VMs works fine with Snapshot and qemu agent.

What can it be?

Nothing in Proxmox log. only see above
message.log of VM:
Feb 27 09:10:01 c01cpan systemd: Started Session c1248 of user root.
Feb 27 09:10:01 c01cpan systemd: Started Session c1249 of user root.
Feb 27 09:10:05 c01cpan systemd: Removed slice User Slice of root.
Feb 27 09:10:07 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:10:19 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:10:32 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:10:44 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:10:56 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:11:12 c01cpan qemu-ga: info: guest-ping called
Feb 27 09:11:12 c01cpan qemu-ga: info: guest-fsfreeze called
Feb 27 09:11:12 c01cpan qemu-ga: info: executing fsfreeze hook with arg 'freeze'
next line is after STOP and START:
Feb 27 10:02:14 c01cpan kernel: Initializing cgroup subsys cpuset
Feb 27 10:02:14 c01cpan kernel: Initializing cgroup subsys cpu
Feb 27 10:02:14 c01cpan kernel: Initializing cgroup subsys cpuacct
Feb 27 10:02:14 c01cpan kernel: Linux version 3.10.0-962.3.2.lve1.5.28.el7.x86_64

Thanks so far.
 
looks like the guest-agent command fsfreeze took too long.
this is run in order to maximize the consistency of the guests filesystem before snapshotting

My guess is that at the point where you tried to snapshot something in the guest was blocking the `sync` call
usually the journal should contain some hints - you could also look for processes in D or D+ state

I hope this helps!
 
Thanks. Our workarround is disable qemu agent.
Can you explain how exactly to journal?
 
the journal is the system log - you can read it with `journalctl` - see the manpage `man journalctl`

I'd suggest that you still check what's causing the IO-Load in the VM
 
ok, but if this happens this is not possible. server hungs, we can not do anythink on the server.
with qemu-agent disabled we try to create backup:
INFO: issuing guest-agent 'fs-freeze' command
and server hungs.
 
INFO: issuing guest-agent 'fs-freeze' command
and server hungs.
If you stop the guest-agent inside the VM - you also need to disable it in the VM's options!

try logging into the VM before running a backup and running `sync` - that might show if the problem only occurs during backup or if the st
 
"If you stop the guest-agent inside the VM - you also need to disable it in the VM's options! "
Thanks :)
The Server is productive, so we can not test so much :/

But with running guest-agent snapshot+backup hung due freeze.

It is unsafe to run without guest-agent?
 
But with running guest-agent snapshot+backup hung due freeze.
as said - I would guess that this is the result of too much load on the storage where the guest has its disks!

It is unsafe to run without guest-agent?

not in general - the guest-agent is used (while doing a backup/snapshot) to freeze the filesystem - and thus increase the consistency of the snapshot.
You can (and should) try to restore a backup and see if everything runs as expected
 
how long does a `sync` take?
`time sync`
 
Just ran into this today too. When using cloudlinux 8 and qemu guest agent is enabled it will lock up the VM on the freeze operation. Turning off guest agent in proxmox works with no issues.
 
Securetmp enables /dev/loop mounts

You'll need to disable it if you want to run snapshots/backups with CloudLinux

https://support.cpanel.net/hc/en-us/articles/360058525333-How-to-disable-scripts-securetmp

This is what CloudLinux said

Hello,

The issue is not related to CloudLinux directly, but to Qemu agent, which does not freeze the file system(s) correctly. What is actually happening:

When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

I'm afraid we have no further recommendations at this point.

Thank you.
 
Securetmp enables /dev/loop mounts

You'll need to disable it if you want to run snapshots/backups with CloudLinux

https://support.cpanel.net/hc/en-us/articles/360058525333-How-to-disable-scripts-securetmp

This is what CloudLinux said

Hello,

The issue is not related to CloudLinux directly, but to Qemu agent, which does not freeze the file system(s) correctly. What is actually happening:

When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

I'm afraid we have no further recommendations at this point.

Thank you.
is there anyway Proxmox Team can fix Qemu
[Securetmp enables /dev/loop mounts]

Code:
When VM backup is invoked, Qemu agent freezes the file systems, so no single change will be made during the backup. But Qemu agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:
1) freeze loopback fs
---> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash
.

- so how can we on Cloudlinux and or cPanel VM's use Qemu-guest-agent for snapshot backups and make sure (we don't need to disable securetmp)

thanks
 
Ok found out that when using cPanel not cloudlinux and you have set a user to jailed shell in cPanel this will cause fs freeze to hang. When you remove user(s) from jailed shell then fs freeze works as it should for snapshot backups.

Just if anyone has this issue. But if you need to keep secure your users in jail shell. You need to disable qemu in pve for that VM. Or snapshot backups will not work and fs freeze will hang.

And I suppose in cloudlinux every user is jailed using cagefs for security. That’s why this is an issue.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!