one VM behaving badly - backup fails and it becomes unresponsive/HighCPU use

Glowsome

Renowned Member
Jul 25, 2017
178
45
68
51
The Netherlands
www.comsolve.nl
I have the strangest thing with just -one- of my VM's where when its being backed up it times out :
Code:
INFO: Starting Backup of VM 125 (qemu)
INFO: Backup started at 2021-11-26 03:12:21
INFO: status = running
INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_11_26-03_12_21.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '067e0aaf-5f51-421a-a0b6-2af4ffc5491c'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 448 retries
INFO: aborting backup job

Strange thing is that 'sometimes' the backup does register a success, so i cannot really get a bearing on this issue.

Details about the VM:
Code:
cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

rpm -qa | grep qemu
qemu-guest-agent-3.1.1.1-57.2.x86_64

As this is not the only VM running this OS/Patchlevel (i am running 3) it puzzles me as to why only this machine is having issues.

The one thing i did see on a SSH-session is a message mentioning CPU hung for x seconds as described in : What are all these "Bug: soft lockup" messages about?
Still this does not account for just this _one_ machine suffering it.

When a/the backup fails it peaks at CPU-usage, and the only solve is to shutdown the machine and restart it to set it free/ resume functionality.

So looking for answers after i have exhausted all my knowledge in this.


- Glowsome
 
Last edited:
Additional info on the VM ( logged from an SSH-session i still had running :
Code:
Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39135.014041] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39163.013409] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39171.164887] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 62s!

Message from syslogd@vm-lx-01 at Nov 28 03:19:31 ...
 kernel:[39653.606352] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 41s!

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.589978] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 39s! [kworker/0:0:29821]

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.590332] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 37s! [swapper/1:0]
 
I have a very similar issue, which I discovered because the repeated
Code:
Message from syslogd@dc1 at Dec  3 05:28:11 ...
 kernel:[992745.113396] watchdog: BUG: soft lockup - CPU#1 stuck for 30s! [kworker/1:1:29984]
errors on my Domain Controller caused disconnections on some clients, that caused alerts.

It happens since I updated from 6.4 to 7.1 (7.1-5)
 
FYI:
As far as i have been able to determine it was due to a/the qemu agent just sucking up all runtime....
So as a measure i have re-installed the agent forcefully with :
Code:
 zypper in -f qemu-guest-agent

Need to determine if this solves anything, but hey, its atleast an action aimed at the package that seems to be the cause :)
 
  • Like
Reactions: alc
Likewise : after re-installing qemu-agent exactly the same thing happened last night
 
Tonight it even got more weird ... the VM was shutdown after it was touched by the backup-job :
Code:
INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_12_08-03_10_57.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'f9d4887d-f2ba-439b-be69-d75fb74c47a4'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 447 retries
INFO: aborting backup job
ERROR: VM 125 qmp command 'backup-cancel' failed - client closed connection
INFO: resuming VM again
ERROR: Backup of VM 125 failed - VM 125 not running
INFO: Failed at 2021-12-08 03:11:57
INFO: Backup job finished with errors
TASK ERROR: job errors

I had to manually 'start' the VM again as it was in a down-state.
 
Last edited:
No luck either.
I still get 4 alert messages from my TrueNAS everynight because my AD (VM) disconnects during every daily backup, which did not happen before version 7 of proxmox.
 
As i have not found any means to further debug or resolve this i will be planning on replacing this box.

However that will have quite the impact, as it is running some vital roles - making it quite painfull to implement changes on this box.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!