[SOLVED] snapshot stopping VM

I don't understand which process you want me to strace? The qm process?

The qemu agent process that is not working correctly. The qm process will tell you that it got a timeout, but not why. Therefore you need to strace the guest agent.
 
Ok, here is the strace log. First 95% seems to be a bunch of missing Perl scripts, but since it works on other VM's I suppose that's not important. Then there's a line 7791 "resource temporarily unavailable":
futex(0x7f9f16ab6010, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1591953038, tv_nsec=874605592}, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
 

Attachments

  • qm-fsfreeze-freeze.strace.log
    623.6 KB · Views: 5
The strace looks like it was done on the PVE guest, is that right? You will not have these files on your guest:

Code:
stat("/etc/pve/nodes/server36/qemu-server/105.conf", {st_mode=S_IFREG|0640, st_size=506, ...}) = 0
open("/var/run/qemu-server/105.pid", O_RDONLY) = 5
 
No, it was of course done on the host:

Code:
# strace -o qm-fsfreeze-freeze.strace qm guest cmd 105 fsfreeze-freeze
 
Sorry, I wanted to write host but it has to be done on the guest, because the guest is not responding properly. Host is getting a timeout, but we don't know why.
 
Sorry, maybe I'm thick, but how would I do that? Please be specific.

I think you want me to run the command on the host while running an strace in the guest, is that right? Which process do I strace in the guest?
 
Sorry, maybe I'm thick, but how would I do that? Please be specific.

I think you want me to run the command on the host while running an strace in the guest, is that right? Which process do I strace in the guest?

The qemu guest agent, because it is not answering correctly. It would be interesting to see what it does while the outer qm has problems. We saw from the host, that the guest does not answer in time, so we need to find out what is going on on the guest.
 
I did an strace on /usr/bin/qemu-ga - I hope that is what you mean - and the last thing it did was go through Cpanel's /home/virtfs directory used for their jailshell implementation.

Code:
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcc9b17fa50) = 22174
wait4(22174, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 22174
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=22174, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
open("/proc/self/mountinfo", O_RDONLY)  = 6
fstat(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcc9b18c000
read(6, "17 39 0:17 / /sys rw,nosuid,node"..., 1024) = 1024
read(6, "/blkio rw,nosuid,nodev,noexec,re"..., 1024) = 1024
read(6, " 0:32 / /sys/fs/cgroup/cpuset rw"..., 1024) = 1024
read(6, ":0 / /var/tmp rw,nosuid,noexec,r"..., 1024) = 1024
read(6, "dable - xfs /dev/mapper/centos-r"..., 1024) = 1024
read(6, "etc/alternatives ro,nosuid,relat"..., 1024) = 1024
read(6, "me/virtfs/xxxxxxxx/usr/local/cpa"..., 1024) = 1024
read(6, "virtfs/xxxxxxxx/usr/local/cpanel"..., 1024) = 1024
read(6, "fs /dev/mapper/centos-root rw,se"..., 1024) = 1024
read(6, "fs/xxxxxxxxxxxxxxxx/var/spool rw"..., 1024) = 1024
read(6, "uid,noexec,relatime unbindable -"..., 1024) = 1024
read(6, "/3rdparty/mailman/spam rw,relati"..., 1024) = 1024
read(6, "xxxxxxxxxxxx/home/xxxxxxxxxxxxxx"..., 1024) = 1024
read(6, ",noexec,relatime unbindable - xf"..., 1024) = 1024
read(6, "/home/virtfs/xxxxxxxx/etc/altern"..., 1024) = 1024
read(6, "party/mailman/data /home/virtfs/"..., 1024) = 1024
read(6, "/mailman/qfiles /home/virtfs/xxx"..., 1024) = 1024
read(6, "pper/centos-root rw,seclabel,att"..., 1024) = 1024
read(6, " - xfs /dev/mapper/centos-root r"..., 1024) = 1024
read(6, "/cpanel/email_send_limits /home/"..., 1024) = 1024
read(6, "ilman/lists /home/virtfs/xxxxxxx"..., 1024) = 1024
read(6, ",size=2452204k,mode=700\n", 1024) = 24
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x7fcc9b18c000, 4096)            = 0
open("/var/run/qga.state.isfrozen", O_WRONLY|O_CREAT, 0600) = 6
close(6)                                = 0
open("/home/virtfs/xxxxxxxx/home/xxxxxxxx", O_RDONLY|O_CLOEXEC) = 6
ioctl(6, FIFREEZE

I've obfuscated some directory names as they are identifiable.
 
Well, seeing it seems it might have something to do with Cpanel's jailshell's I went ahead and disabled all jailshells, and tried again ... and still NO difference. :/

So I removed the qemu-guest-agent and tried running the vzdump. That works ok - of course without the freeze - but then, due to the relatively large size of the container and relatively small size of the dump-dir I now ran out of space in the dumpdir. The containers is 350G, the dumpdir 220G. Thing is, only 170G is actually used by the container, so I would have to fstrim the container before running the vzdump -- and obviously I can't do that without qemu-guest-agent support. Catch-22... :/

So back to trying to get the fsfreeze thing working. There has been suggestions that the problem could be caused by slow IO. I doubt it, the container has a 24G RAM by now, disks are SSD (hence the limited disk space), and a sync completes in a fraction of a second. Any other suggestions what I can do to fix this problem?

Alternatively, is there any way to reduce a running container's size? It's LVM.
 
I just tried the fsfreeze while having top running in the client, and it turns out that the client isn't actually freezing up immediately. It starts by running up loadavg like crazy. I don't see any particular process doing this, it's just increasing until that at last cause the client to be unresponsive.
 
So here's the plan.

After having having disabled qemu-guest-agent and having run an fstrim in the VM I was finally able to successfully create a vzdump of the VM.

This VM is using 180G of a 350G disk. Now I'd like to restore this vzdump with qmrestore, to a smaller disk, to see if that would make any change to the fsfreeze process, but I am a little in doubt of the right syntax for qmrestore.

How do I tell qmrestore to use a smaller disk for this restore, like 250G ? Or how do you restore a VM to a smaller disk? I see it suggested often, but the man page is not very clear about it - to say the least.

EDIT: hint: qemu-img, See fx. https://forum.proxmox.com/threads/re-sized-vm-hard-disk-too-large-how-to-make-smaller.70303/
 
Last edited:
Problem solved. After conferring with Cpanel support they suggested removing a loopback /tmp partition used for security https://docs.cpanel.net/knowledge-base/security/tips-to-make-your-server-more-secure/ - after removing this the fsfreeze/thaw now appear to function correctly.

Nice that you finally solved your problem.

How do I tell qmrestore to use a smaller disk for this restore, like 250G ?

You cannot. You need to do that manually, because there is no way the backup mechanism can know where you wrote data and where not. For some OS, there might be some software around that is able to do that (e.g. Windows), but there is no general solution to this and therefore there is no such thing as restore to smaller disk or reduce a disk via the GUI.
 
Hi, the same thing happened to me last night. I am runing cPanel on a VM. It got completly unresponsive until I rebot it. I did a manual backup in snapshot mode after and it went well :confused:.

105: 2020-10-29 22:15:51 INFO: creating archive '/qnap/backups/dump/vzdump-qemu-105-2020_10_29-22_15_50.vma.lzo'
105: 2020-10-29 22:15:51 INFO: issuing guest-agent 'fs-freeze' command
105: 2020-10-29 23:15:51 ERROR: VM 105 qmp command 'guest-fsfreeze-freeze' failed - got timeout
105: 2020-10-29 23:15:52 INFO: issuing guest-agent 'fs-thaw' command
105: 2020-10-29 23:16:02 ERROR: VM 105 qmp command 'guest-fsfreeze-thaw' failed - got timeout
105: 2020-10-29 23:16:02 INFO: started backup task '41548ee6-4f72-4ec0-8013-c065b8e526be'
 
Hi, it happened again last night! I have this backup job running every night, but it has only failed twice. This time it was very difficult to reboot the VM, I had to disable it in HA then reset it.

105: 2020-12-02 20:00:03 INFO: creating archive '/qnap/backups/dump/vzdump-qemu-105-2020_12_02-20_00_02.vma.lzo'
105: 2020-12-02 20:00:03 INFO: issuing guest-agent 'fs-freeze' command
105: 2020-12-02 21:00:03 ERROR: VM 105 qmp command 'guest-fsfreeze-freeze' failed - got timeout
105: 2020-12-02 21:00:04 INFO: issuing guest-agent 'fs-thaw' command
105: 2020-12-02 21:00:14 ERROR: VM 105 qmp command 'guest-fsfreeze-thaw' failed - got timeout
105: 2020-12-02 21:00:14 INFO: started backup task 'a3a6cad3-33a4-44a6-bcba-31f899daadc8'
 
Problem solved. After conferring with Cpanel support they suggested removing a loopback /tmp partition used for security https://docs.cpanel.net/knowledge-base/security/tips-to-make-your-server-more-secure/ - after removing this the fsfreeze/thaw now appear to function correctly.
Hi @rcd

thanks for your notes.

We are experiencing the same issues with cPanel can you please share what exactly was changed for the loop back device as the referenced link doesn't make any mention of loopback so i'm a little confised as to what we should be looking at specifically.

thanks in advance.

speak soon

""Cheers
G
 
Sorry for entering this thread in 2022.
Same problem with my vzdump for one VM.
Version 7.2 - latest proxmox-updates. VM is a debian bullseye with guest-tools.
Suggestion: in my vm i mount another partition for /var/www. My this cause the problem?
Suggestion: is there a limit for the partition-size? My /var/www is 8T (yes tera) - the root disk just several GB
 
Last edited:
Sorry for entering this thread in 2022.
Same problem with my vzdump for one VM.
Version 7.2 - latest proxmox-updates. VM is a debian bullseye with guest-tools.
Suggestion: in my vm i mount another partition for /var/www. My this cause the problem?
Suggestion: is there a limit for the partition-size? My /var/www is 8T (yes tera) - the root disk just several GB
Hi @high_performer

it may be that 8 TB is way too big to be backed up with VZDump as it may time out.

how much of the 8 TB is being consumed?

do you hve any log output?

it may be a different issue so better to open a new thread and add your log details in the new thread and more info to be investigated.

""Cheers
G
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!