one VM behaving badly - backup fails and it becomes unresponsive/HighCPU use

Glowsome

Renowned Member
Jul 25, 2017
184
45
68
52
The Netherlands
www.comsolve.nl
I have the strangest thing with just -one- of my VM's where when its being backed up it times out :
Code:
INFO: Starting Backup of VM 125 (qemu)
INFO: Backup started at 2021-11-26 03:12:21
INFO: status = running
INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_11_26-03_12_21.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '067e0aaf-5f51-421a-a0b6-2af4ffc5491c'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 448 retries
INFO: aborting backup job

Strange thing is that 'sometimes' the backup does register a success, so i cannot really get a bearing on this issue.

Details about the VM:
Code:
cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

rpm -qa | grep qemu
qemu-guest-agent-3.1.1.1-57.2.x86_64

As this is not the only VM running this OS/Patchlevel (i am running 3) it puzzles me as to why only this machine is having issues.

The one thing i did see on a SSH-session is a message mentioning CPU hung for x seconds as described in : What are all these "Bug: soft lockup" messages about?
Still this does not account for just this _one_ machine suffering it.

When a/the backup fails it peaks at CPU-usage, and the only solve is to shutdown the machine and restart it to set it free/ resume functionality.

So looking for answers after i have exhausted all my knowledge in this.


- Glowsome
 
Last edited:
Additional info on the VM ( logged from an SSH-session i still had running :
Code:
Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39135.014041] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39163.013409] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39171.164887] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 62s!

Message from syslogd@vm-lx-01 at Nov 28 03:19:31 ...
 kernel:[39653.606352] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 41s!

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.589978] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 39s! [kworker/0:0:29821]

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.590332] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 37s! [swapper/1:0]
 
I have a very similar issue, which I discovered because the repeated
Code:
Message from syslogd@dc1 at Dec  3 05:28:11 ...
 kernel:[992745.113396] watchdog: BUG: soft lockup - CPU#1 stuck for 30s! [kworker/1:1:29984]
errors on my Domain Controller caused disconnections on some clients, that caused alerts.

It happens since I updated from 6.4 to 7.1 (7.1-5)
 
FYI:
As far as i have been able to determine it was due to a/the qemu agent just sucking up all runtime....
So as a measure i have re-installed the agent forcefully with :
Code:
 zypper in -f qemu-guest-agent

Need to determine if this solves anything, but hey, its atleast an action aimed at the package that seems to be the cause :)
 
  • Like
Reactions: alc
Likewise : after re-installing qemu-agent exactly the same thing happened last night
 
Tonight it even got more weird ... the VM was shutdown after it was touched by the backup-job :
Code:
INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_12_08-03_10_57.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'f9d4887d-f2ba-439b-be69-d75fb74c47a4'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 447 retries
INFO: aborting backup job
ERROR: VM 125 qmp command 'backup-cancel' failed - client closed connection
INFO: resuming VM again
ERROR: Backup of VM 125 failed - VM 125 not running
INFO: Failed at 2021-12-08 03:11:57
INFO: Backup job finished with errors
TASK ERROR: job errors

I had to manually 'start' the VM again as it was in a down-state.
 
Last edited:
No luck either.
I still get 4 alert messages from my TrueNAS everynight because my AD (VM) disconnects during every daily backup, which did not happen before version 7 of proxmox.
 
As i have not found any means to further debug or resolve this i will be planning on replacing this box.

However that will have quite the impact, as it is running some vital roles - making it quite painfull to implement changes on this box.