[SOLVED] snapshot stopping VM

You have to activate debugging on the qemu-guest-agent to get more out of it. Please see /etc/default/qemu-guest-agent.

I don't have any /etc/default/qemu-guest-agent on my VM:

Code:
]# rpm -ql qemu-guest-agent
/etc/qemu-ga
/etc/qemu-ga/fsfreeze-hook
/etc/qemu-ga/fsfreeze-hook.d
/etc/sysconfig/qemu-ga
/usr/bin/qemu-ga
/usr/lib/systemd/system/qemu-guest-agent.service
/usr/lib/udev/rules.d/99-qemu-guest-agent.rules
/usr/share/doc/qemu-guest-agent-2.12.0
/usr/share/doc/qemu-guest-agent-2.12.0/COPYING
/usr/share/doc/qemu-guest-agent-2.12.0/README
/usr/share/man/man8/qemu-ga.8.gz
/usr/share/qemu-kvm/qemu-ga
/usr/share/qemu-kvm/qemu-ga/fsfreeze-hook.d
/usr/share/qemu-kvm/qemu-ga/fsfreeze-hook.d/mysql-flush.sh.sample
/var/log/qemu-ga

Google and man also does not tell me anything about this file?
 
Now this is the weirdest ... the only change I made was adding

Code:
[general]
verbose = 1

To the bottom of /etc/sysconfig/qemu-ga. I did't know how to "restart" qemu so I rebooted the VM, then ran the vzdump command like before, but now it runs as it should.

o_O
 
So here is the continuation of the story. Trying to do an automatic snapshot last night I woke up to a unresponsive server, and this logfile:

2019-11-19 01:15:02 INFO: Starting Backup of VM 105 (qemu)
2019-11-19 01:15:02 INFO: status = running
2019-11-19 01:15:03 INFO: update VM 105: -lock backup
2019-11-19 01:15:03 INFO: VM Name: 3605-EU
2019-11-19 01:15:03 INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 200G
2019-11-19 01:15:03 INFO: backup mode: snapshot
2019-11-19 01:15:03 INFO: ionice priority: 7
2019-11-19 01:15:03 INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-105-2019_11_19-01_15_02.vma.lzo'
2019-11-19 02:15:03 ERROR: VM 105 qmp command 'guest-fsfreeze-freeze' failed - got timeout
2019-11-19 02:15:13 ERROR: VM 105 qmp command 'guest-fsfreeze-thaw' failed - got timeout
2019-11-19 02:15:13 INFO: started backup task 'd08c500e-f900-43ca-91d1-a99f90af0e69'

The snapshot/backup was taken, the problem is just that the server was dead in the water for two hours. Off peak hours but still.

Anyway, any idea why I got the timeout on the freeze/thaw commands?

Would it be possible and/or make sense to do the freeze/thaw command first, before starting the snapshot, to make sure everything works before running the snapshot, or would that just freeze the server in the same way?
 
Ok, some more details:

Code:
root@server36:~# qm agent 105 fsfreeze-status
thawed
root@server36:~# qm agent 105 fsfreeze-freeze
# ... 10 minutes pass, nothing happens, 105 is dead to the world, so I decide to break this and thaw it.
^C
root@server36:~# qm agent 105 fsfreeze-thaw
QEMU guest agent is not running
root@server36:~# qm agent 105 fsfreeze-status
QEMU guest agent is not running
# ... so now the agent is gone AWOL, ok, so let's try to reset it:
root@server36:~# qm stop 105
trying to acquire lock...
can't lock file '/var/lock/qemu-server/lock-105.conf' - got timeout
# ... let's skip the lockfile
root@server36:~# qm stop 105 --skiplock true
trying to acquire lock...
can't lock file '/var/lock/qemu-server/lock-105.conf' - got timeout

In the end I simply rm'd the lockfile and reset the VM, but why on earth have a --skiplock command if it doesn't work? The man page simply says "qm reset vmid --skiplock boolean" without detailing what values are defined for "boolean": I tried {TRUE, true, 1} but nothing works.

In the end I disabled the QEMU guest agent as suggested, but I think it would be nice to have the QEMU guest agent for the other things it does. Why isn't there an option in vzdump to ignore the QEMU agent?
 
Yeah, I had those timeouts on occasion too, but it was very volatile. We have over 100 VMs that are backed up, some of them daily and we saw one or two years on each backup run at most 2 that failed sporadically with the same error. Each VM lost its connection to the virtualised storage and freezed in the end too so that we needed to reset the VM and everything was up and running again. In the last year or so, we do not see this behaviour anymore, so I guess is was some underlying problem in QEMU/KVM that was fixed with of the of updates.
 
I believe I have the latest QEMU drivers available:

Hypervisor:
Code:
# dpkg -l | grep -i qemu
ii  pve-qemu-kvm                         3.0.1-4                        amd64        Full virtualization on x86 hardware
ii  qemu-server                          5.0-54                         amd64        Qemu Server Tools

VM:
Code:
# rpm -qa | grep -i qemu
qemu-guest-agent-2.12.0-3.el7.x86_64
 
I tried to get around this by disabling the qemu agent, but now it is just crashing the server instead.

INFO: starting new backup job: vzdump 105 --compress lzo --mode snapshot
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2019-12-08 09:19:26
INFO: status = running
INFO: update VM 105: -lock backup
INFO: VM Name: 3605-EU
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-105-2019_12_08-09_19_26.vma.lzo'
INFO: started backup task 'a57a2dab-4ba8-4705-b219-10407a15e711'
INFO: status: 0% (1337982976/214748364800), sparse 0% (884207616), duration 3, read/write 445/151 MB/s
INFO: status: 1% (2233073664/214748364800), sparse 0% (1005961216), duration 8, read/write 179/154 MB/s
INFO: status: 2% (4345233408/214748364800), sparse 0% (1129299968), duration 20, read/write 176/165 MB/s
INFO: status: 3% (6456868864/214748364800), sparse 0% (1167572992), duration 33, read/write 162/159 MB/s
INFO: status: 4% (8734900224/214748364800), sparse 0% (1202794496), duration 46, read/write 175/172 MB/s
INFO: status: 5% (10904141824/214748364800), sparse 0% (1270341632), duration 59, read/write 166/161 MB/s
INFO: status: 6% (13033930752/214748364800), sparse 0% (1279619072), duration 73, read/write 152/151 MB/s
INFO: status: 7% (15062728704/214748364800), sparse 0% (1284661248), duration 85, read/write 169/168 MB/s
INFO: status: 8% (17287348224/214748364800), sparse 0% (1286201344), duration 98, read/write 171/171 MB/s
INFO: status: 9% (19526189056/214748364800), sparse 0% (1483706368), duration 110, read/write 186/170 MB/s
INFO: status: 10% (21613838336/214748364800), sparse 0% (1534369792), duration 122, read/write 173/169 MB/s
INFO: status: 11% (23688118272/214748364800), sparse 0% (1556529152), duration 134, read/write 172/171 MB/s
INFO: status: 12% (25915817984/214748364800), sparse 0% (1593987072), duration 147, read/write 171/168 MB/s
INFO: status: 13% (27976990720/214748364800), sparse 0% (1614819328), duration 160, read/write 158/156 MB/s
INFO: status: 14% (30184439808/214748364800), sparse 0% (1624965120), duration 172, read/write 183/183 MB/s
INFO: status: 15% (32255442944/214748364800), sparse 0% (1643663360), duration 183, read/write 188/186 MB/s
INFO: status: 16% (34422652928/214748364800), sparse 0% (1720737792), duration 196, read/write 166/160 MB/s
INFO: status: 17% (36585472000/214748364800), sparse 0% (1748467712), duration 209, read/write 166/164 MB/s
INFO: status: 18% (38690488320/214748364800), sparse 0% (1933090816), duration 221, read/write 175/160 MB/s
INFO: status: 19% (41015377920/214748364800), sparse 1% (3174694912), duration 228, read/write 332/154 MB/s
INFO: status: 20% (43097849856/214748364800), sparse 1% (3286736896), duration 241, read/write 160/151 MB/s
INFO: status: 21% (45141721088/214748364800), sparse 1% (3290038272), duration 252, read/write 185/185 MB/s
INFO: status: 22% (47431548928/214748364800), sparse 1% (3535233024), duration 265, read/write 176/157 MB/s
INFO: status: 23% (49410801664/214748364800), sparse 1% (3651522560), duration 275, read/write 197/186 MB/s
INFO: status: 24% (51646758912/214748364800), sparse 1% (3714195456), duration 288, read/write 171/167 MB/s
INFO: status: 25% (53721563136/214748364800), sparse 1% (3740983296), duration 300, read/write 172/170 MB/s
INFO: status: 26% (55959945216/214748364800), sparse 1% (4093202432), duration 312, read/write 186/157 MB/s
INFO: status: 27% (58036518912/214748364800), sparse 2% (4879020032), duration 320, read/write 259/161 MB/s
INFO: status: 28% (60243443712/214748364800), sparse 2% (4894195712), duration 333, read/write 169/168 MB/s
INFO: status: 29% (62284627968/214748364800), sparse 2% (4907098112), duration 345, read/write 170/169 MB/s
INFO: status: 30% (64607092736/214748364800), sparse 2% (5151129600), duration 357, read/write 193/173 MB/s
INFO: status: 31% (66671083520/214748364800), sparse 2% (5215911936), duration 369, read/write 171/166 MB/s
INFO: status: 32% (68740186112/214748364800), sparse 2% (5365456896), duration 381, read/write 172/159 MB/s
INFO: status: 33% (70890291200/214748364800), sparse 2% (5465829376), duration 392, read/write 195/186 MB/s
INFO: status: 34% (73123037184/214748364800), sparse 2% (5691367424), duration 403, read/write 202/182 MB/s
INFO: status: 35% (75258527744/214748364800), sparse 2% (5729705984), duration 415, read/write 177/174 MB/s
INFO: status: 36% (77426130944/214748364800), sparse 2% (5730824192), duration 427, read/write 180/180 MB/s
INFO: status: 37% (79606185984/214748364800), sparse 2% (6110089216), duration 438, read/write 198/163 MB/s
INFO: status: 38% (81700913152/214748364800), sparse 2% (6119223296), duration 450, read/write 174/173 MB/s

Then BANG reboot.

I had a look around the files available under /var/log but don't see anything obvious. For example this from kern.log:

Dec 8 09:19:26 server36 vzdump[15269]: <root@pam> starting task UPID:server36:00003BA7:0041A611:5DECC01E:vzdump:105:root@pam:
Dec 8 09:19:27 server36 qm[15274]: <root@pam> update VM 105: -lock backup
Dec 8 09:25:26 server36 pvedaemon[1683]: <root@pam> successful auth for user 'root@pam'
Dec 8 09:28:16 server36 kernel: [ 0.000000] microcode: microcode updated early to revision 0x21, date = 2019-02-13
Dec 8 09:28:16 server36 kernel: [ 0.000000] Linux version 4.15.18-19-pve (root@dev5) (gcc version 6.3.0 20170516 (Deb

help?
 
No takers? No suggestions?

Code:
# pveperf
CPU BOGOMIPS:      54277.84
REGEX/SECOND:      1791183
HD SIZE:           19.10 GB (/dev/sda1)
BUFFERED READS:    455.75 MB/sec
AVERAGE SEEK TIME: 0.05 ms
FSYNCS/SECOND:     2527.58
DNS EXT:           27.17 ms
DNS INT:           2.06 ms
 
Last edited:
Hello there.

I can confirm i have the exact same problem.
Recently updated to pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-16-pve) and now, CentOS 7 backup bug
 
  • Like
Reactions: rcd
Still no takers? I accidentally tried to make a snapshot of this vm again today and the same thing happened. Not pleased !
 
In another problem with Proxmox I realized that it is not possible to take a snapshot of a CT with bindmounts. I'm wondering if this is the same problem causing my VM snapshots. This is in fact a VM with bindmounts.

Still don't see what that has to do with freezing and thawing the container though...
 
Is there somewhere I can open a bug report on this? It is a little frustrating that after so many months of this being a confirmed bug, there still seem to be no action taken to fix it.

I just tried again, same as always:

Code:
# qm guest cmd 105 fsfreeze-status
thawed
# qm guest cmd 105 fsfreeze-freeze
^C  << hangs, having to break out of the command
# qm guest cmd 105 fsfreeze-status
QEMU guest agent is not running
# qm reset 105 --skiplock
# qm guest cmd 105 fsfreeze-status
thawed

/var/log/vzdump/qemu-105.log
Code:
...
2020-05-31 03:15:58 INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-105-2020_05_31-03_15_57.vma.lzo'
2020-05-31 04:15:58 ERROR: VM 105 qmp command 'guest-fsfreeze-freeze' failed - got timeout
2020-05-31 04:16:08 ERROR: VM 105 qmp command 'guest-fsfreeze-thaw' failed - got timeout
2020-05-31 04:16:08 INFO: started backup task '973576ea-d503-4800-b826-874b7b4c874b'
...

Isn't there anything I can do to try to debug this?
 
Isn't there anything I can do to try to debug this?

What about the debug information I asked for over a half year ago?

Is there somewhere I can open a bug report on this? It is a little frustrating that after so many months of this being a confirmed bug, there still seem to be no action taken to fix it.

Why should it be a confirmed bug? There are a lot of other people - me included - not having this problem:

I just installed a fresh CentOS 7 including all updates up to today and also installed qemu-guest-agent-2.12.0-3 and it works:


Code:
root@proxmox ~ > qm guest cmd 104 get-osinfo
{
   "id" : "centos",
   "kernel-release" : "3.10.0-1127.10.1.el7.x86_64",
   "kernel-version" : "#1 SMP Wed Jun 3 14:28:03 UTC 2020",
   "machine" : "x86_64",
   "name" : "CentOS Linux",
   "pretty-name" : "CentOS Linux 7 (Core)",
   "version" : "7 (Core)",
   "version-id" : "7"
}

root@proxmox ~ > qm guest cmd 104 fsfreeze-status
thawed
root@proxmox ~ > qm guest cmd 104 fsfreeze-freeze
2
root@proxmox ~ > qm guest cmd 104 fsfreeze-status
frozen
root@proxmox ~ > qm guest cmd 104 fsfreeze-thaw
2
root@proxmox ~ > qm guest cmd 104 fsfreeze-status
thawed

root@proxmox ~ > qm guest cmd 104 info | jq ".version"
"2.12.0"
 
What debug information? I already added the suggested "verbose = 1" to /etc/sysconfig/qemu-ga on Nov 8, and as I explained that the immediate impression was that it fixed the problem, but the week after it did the same thing again.

The only "debugging" info I find is that now it shows in the log that the fsfreeze-freeze "fails". There are no further details. I knew that already, I need to find out WHY the freeze fails.

It was my impression from your post on Nov 22 that you were seeing the same problem sometimes, obviously I misunderstood you.

Sorry if I didn't make it clear: I have it running on all other VM's, the problem is one particular VM. Yes I understand that the problem is specific to this VM, but I still need to fix it somehow, preferably without having to reinstall the VM altogether.

Let me know exactly what debugging information you expect, and I will of course provide it.
 
Last edited:
It was my impression from your post on Nov 22 that you were seeing the same problem sometimes, obviously I misunderstood you.

I do have the same problem, but it sometimes works and sometimes does not. I cannot reproduce my problem. I can also not reproduce your problem, so this is hard on me to help.

The only "debugging" info I find is that now it shows in the log that the fsfreeze-freeze "fails". There are no further details. I knew that already, I need to find out WHY the freeze fails.

That is bad, I thought there would be better or more information on why it fails. Maybe you can attach to the process with strace and have a look at the output there?
 
I don't understand which process you want me to strace? The qm process?