[SOLVED] snapshot stopping VM

rcd

Active Member
Jul 12, 2019
245
23
38
62
I have a problem with snapshots. I run "vzdump 100 --mode snapshot", but it stops the virtual machine for the complete duration of the backup. It also does not seem to start again after it completes the dump, I had to reset the VM to get it to work again.

The backup device is the standard /var/lib/vz, a lvm device.

Code:
# vzdump 105 --mode snapshot
INFO: starting new backup job: vzdump 105 --mode snapshot
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2019-11-02 09:13:22
INFO: status = running
INFO: update VM 105: -lock backup
INFO: VM Name: 3605-EU
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-105-2019_11_02-09_13_22.vma'

At this point, after about 30 min, the vma file size still shows zero, and the VM hangs hard. Can't ssh to it, can do anything.
$ ssh 3605
ssh: connect to host 3605 port 22: Network is unreachable

I then run vzdump -stop, and it kinda triggers something to continue

Code:
ERROR: interrupted by signal
ERROR: VM 105 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 105 qga socket - timeout after 101 retries
INFO: started backup task '07c365f3-5d7b-4367-b5d6-b6ea07fb16d5'
INFO: status: 1% (3611295744/214748364800), sparse 0% (1058140160), duration 3, read/write 1203/851 MB/s
INFO: status: 2% (4366991360/214748364800), sparse 0% (1061658624), duration 6, read/write 251/250 MB/s
INFO: status: 3% (6603407360/214748364800), sparse 0% (1084887040), duration 16, read/write 223/221 MB/s
INFO: status: 4% (8758231040/214748364800), sparse 0% (1084887040), duration 26, read/write 215/215 MB/s
INFO: status: 5% (10879369216/214748364800), sparse 0% (1084903424), duration 35, read/write 235/235 MB/s
INFO: status: 6% (12886605824/214748364800), sparse 0% (1096822784), duration 45, read/write 200/199 MB/s
INFO: status: 7% (15205138432/214748364800), sparse 0% (1096822784), duration 56, read/write 210/210 MB/s
INFO: status: 8% (17340039168/214748364800), sparse 0% (1097998336), duration 65, read/write 237/237 MB/s
INFO: status: 9% (19419561984/214748364800), sparse 0% (1191800832), duration 74, read/write 231/220 MB/s
INFO: status: 10% (21576351744/214748364800), sparse 0% (1201209344), duration 83, read/write 239/238 MB/s
INFO: status: 11% (23652925440/214748364800), sparse 0% (1219858432), duration 92, read/write 230/228 MB/s

The VM is still completely unresponsive, so I run teh vzdump -stop again

Code:
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 105 failed - interrupted by signal
INFO: Failed at 2019-11-02 09:47:49
ERROR: Backup job failed - interrupted by signal
interrupted by signal
root@server36:~#

Now the snapshot appears to have been stopped but the VM is still completely unresponsive. I try everything but the only way I can get it back is by issuing a qm reset.

All other VM's on the machine snapshots just fine, but this one is bigger, 200G total, 120G used. /var/lib/vz is 200G. I don't know why that should be the problem though as it doesn't even start writing anything before it freezes.

What is going on here? I need to be able to snapshot this VM. Any logfiles that could give a hint?
 
qemu-guest-agent is installed in the VM:

Code:
# rpm -qa | grep qemu
qemu-guest-agent-2.12.0-3.el7.x86_64

And enabled in Proxmox
 
Proxmox GUI:
1572778409698.png

Client:
Bash:
# ps -ef | grep qemu
root      1100     1  0 05:49 ?        00:00:05 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook

Hypervisor:
Bash:
# qm agent 105 ping
# qm agent 105 info
{
   "supported_commands" : [
      {
         "enabled" : true,
         "name" : "guest-get-osinfo",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-timezone",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-users",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-host-name",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-exec",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-exec-status",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-memory-block-info",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-set-memory-blocks",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-memory-blocks",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-set-user-password",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-fsinfo",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-set-vcpus",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-vcpus",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-network-get-interfaces",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-suspend-hybrid",
         "success-response" : false
      },
      {
         "enabled" : true,
         "name" : "guest-suspend-ram",
         "success-response" : false
      },
      {
         "enabled" : true,
         "name" : "guest-suspend-disk",
         "success-response" : false
      },
      {
         "enabled" : true,
         "name" : "guest-fstrim",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-fsfreeze-thaw",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-fsfreeze-freeze-list",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-fsfreeze-freeze",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-fsfreeze-status",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-flush",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-seek",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-write",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-read",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-close",
         "success-response" : true
      },
      {
         "enabled" : false,
         "name" : "guest-file-open",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-shutdown",
         "success-response" : false
      },
      {
         "enabled" : true,
         "name" : "guest-info",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-set-time",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-get-time",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-ping",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-sync",
         "success-response" : true
      },
      {
         "enabled" : true,
         "name" : "guest-sync-delimited",
         "success-response" : true
      }
   ],
   "version" : "2.12.0"
}
#


It happens to be a live server with hundreds of users on it, so I'm not keen on trying random stuff without having at least checked whatever logs there may be giving a hint to what is the problem. Is there some logfile that may give an idea of what is going on?
 
Last edited:
if you have a guest agent enabled and working, we start a 'guest-fsfreeze-freeze' before backup and a 'thaw' afterwards
depending on how the guest agent inside is configured, this can take a while (our timeout is 60min) and of course create load inside (e.g. in a vm with much data in the cache, doing a 'sync' will produce a high load if the storage is not fast enough)

you first 'stop' command then interrupts the waiting for the freeze completion, and the vzdump tries again but fails (because it already runs) and backups the vm, despite the freeze process not finishing.

so my guess is that the freeze command overloads your storage/vm and it seems to stop -> reconfigure the guest agent or disable it altogether
 
The storage is SSD, so can't be much faster. The VM has 8 GB ram and 4 cores and runs with a loadavg on around 1.00 at the times I tried.

My objective is to run a full backup without stopping the machine. It was my understanding that is the purpose of snapshot backups? Will that still work with the guest agent disabled?
 
Try to disable the qemu-agent option from proxmox and see if the problem disappear. This would confirm an issue with the freeze operation ran by the guest, and will help debuging it further.
 
I still don't understand what is expected to happen if I disable the qemu-agent option?
 
Without the qemu agent, the backup will be made the same way. But without the FS freeze in the guest. So the FS might need an fsck in case of a restore (it'll be in the same state as it would be if there was a power cut at the time of the backup). The qemu agent just make this a little bit more reliable, by freezing the activity in the guest
 
I tried to disable qemu-agent, but it went Enabled|Disabled (on top of each other) as if it has to be rebootet to activate?
 
I rebooted the VM but it didn't suffice, i had to actually shut it down and restart it.

Anyway, up again, without the qemu agent, and a vzdump -snapshot performs as it should.
 
OK, now that we're sure the issue is when using the guest-agent, you can try to debug a bit more. Try to enable it again and see what's happening if you manually send a guest-fsfreeze-freeze command
 
From the cmdline of the host :
Code:
root@pve:~# qm agent 119 fsfreeze-status
thawed
root@pve:~# qm agent 119 fsfreeze-freeze
5
root@pve:~# qm agent 119 fsfreeze-status
frozen
root@pve:~# qm agent 119 fsfreeze-thaw
5
root@pve:~# qm agent 119 fsfreeze-status
thawed
root@pve:~#

If you can reproduce the issue manually (fsfreeze-freeze hanging indefinitely), you'll have to debug it inside your guest.
 
If if if freeze up again, is there any way to make it unfreeze other than a reset of the VM ?

Also, before doing this, would it make any sense to remove /reinstall the qemu-agent software?
 
That seemed to go fine ...

Code:
root@pve:~# qm agent 105 fsfreeze-status
thawed
root@pve:~# qm agent 105 fsfreeze-freeze
3
root@pve:~# qm agent 105 fsfreeze-status
frozen
root@pve:~# qm agent 105 fsfreeze-thaw
3
root@pve:~# qm agent 105 fsfreeze-status
thawed
root@pve:~#
 
Drunk with the success I decided to try the snapshot again.

Code:
root@pme:~# qm agent 105 fsfreeze-status
thawed
root@pme:~# vzdump 105 --mode snapshot
INFO: starting new backup job: vzdump 105 --mode snapshot
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2019-11-05 05:37:47
INFO: status = running
INFO: update VM 105: -lock backup
INFO: VM Name: 3605-EU
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-105-2019_11_05-05_37_47.vma'
...

At this point vm 105 had stopped responding, so I opened another terminal ...

Code:
root@pme:~# qm agent 105 fsfreeze-status
QEMU guest agent is not running
root@pme:~# qm agent 105 fsfreeze-thaw
QEMU guest agent is not running

So, before the snapshot the qemu-agent was running fine, but as soon as I start the snapshot it appears to die.

So I had to reset the vm, and after the reboot it seems the agent runs again.

Code:
root@pve:~# qm agent 105 fsfreeze-status
thawed

Ideas? Any logfile that can give a hint?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!