one VM behaving badly - backup fails and it becomes unresponsive/HighCPU use

Glowsome · Nov 27, 2021

I have the strangest thing with just -one- of my VM's where when its being backed up it times out :

Code:

INFO: Starting Backup of VM 125 (qemu)
INFO: Backup started at 2021-11-26 03:12:21
INFO: status = running
INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_11_26-03_12_21.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '067e0aaf-5f51-421a-a0b6-2af4ffc5491c'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 448 retries
INFO: aborting backup job

Strange thing is that 'sometimes' the backup does register a success, so i cannot really get a bearing on this issue.

Details about the VM:

Code:

cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

rpm -qa | grep qemu
qemu-guest-agent-3.1.1.1-57.2.x86_64

As this is not the only VM running this OS/Patchlevel (i am running 3) it puzzles me as to why only this machine is having issues.

The one thing i did see on a SSH-session is a message mentioning CPU hung for x seconds as described in : What are all these "Bug: soft lockup" messages about?
Still this does not account for just this _one_ machine suffering it.

When a/the backup fails it peaks at CPU-usage, and the only solve is to shutdown the machine and restart it to set it free/ resume functionality.

So looking for answers after i have exhausted all my knowledge in this.

- Glowsome

Glowsome · Nov 29, 2021

Additional info on the VM ( logged from an SSH-session i still had running :

Code:

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39135.014041] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39163.013409] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [sshd:25223]

Message from syslogd@vm-lx-01 at Nov 28 03:11:29 ...
 kernel:[39171.164887] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 62s!

Message from syslogd@vm-lx-01 at Nov 28 03:19:31 ...
 kernel:[39653.606352] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 41s!

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.589978] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 39s! [kworker/0:0:29821]

Message from syslogd@vm-lx-01 at Nov 28 03:32:15 ...
 kernel:[40417.590332] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 37s! [swapper/1:0]

Glowsome · Nov 29, 2021

Extra info, this is not a 'Pure SLES box, its an Open Enterprise Server:

Code:

vm-lx-01:~ # cat /etc/novell-release
Open Enterprise Server 2018 (x86_64)
VERSION = 2018.2
PATCHLEVEL = 2

alc · Dec 3, 2021

I have a very similar issue, which I discovered because the repeated

Code:

Message from syslogd@dc1 at Dec  3 05:28:11 ...
 kernel:[992745.113396] watchdog: BUG: soft lockup - CPU#1 stuck for 30s! [kworker/1:1:29984]

errors on my Domain Controller caused disconnections on some clients, that caused alerts.

It happens since I updated from 6.4 to 7.1 (7.1-5)

Glowsome · Dec 6, 2021

FYI:
As far as i have been able to determine it was due to a/the qemu agent just sucking up all runtime....
So as a measure i have re-installed the agent forcefully with :

Code:

 zypper in -f qemu-guest-agent

Need to determine if this solves anything, but hey, its atleast an action aimed at the package that seems to be the cause

alc · Dec 6, 2021

I'll check this and will update after tonight's backups

Glowsome · Dec 6, 2021

Well unfortunately no change in the issue, so reinstalling the agent did not solve it

Below is a screenie of the agent's consumption of resourses on the VM

alc · Dec 7, 2021

Likewise : after re-installing qemu-agent exactly the same thing happened last night

Glowsome · Dec 8, 2021

Tonight it even got more weird ... the VM was shutdown after it was touched by the backup-job :

Code:

INFO: VM Name: vm-lx-01
INFO: include disk 'scsi0' 'vms01:125/vm-125-disk-0.qcow2' 64G
INFO: include disk 'scsi1' 'vms01:125/vm-125-disk-1.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/data/backups01/dump/vzdump-qemu-125-2021_12_08-03_10_57.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 125 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'f9d4887d-f2ba-439b-be69-d75fb74c47a4'
INFO: resuming VM again
ERROR: VM 125 qmp command 'cont' failed - unable to connect to VM 125 qmp socket - timeout after 447 retries
INFO: aborting backup job
ERROR: VM 125 qmp command 'backup-cancel' failed - client closed connection
INFO: resuming VM again
ERROR: Backup of VM 125 failed - VM 125 not running
INFO: Failed at 2021-12-08 03:11:57
INFO: Backup job finished with errors
TASK ERROR: job errors

I had to manually 'start' the VM again as it was in a down-state.

Glowsome · Mar 3, 2022

Still no closer to a solve on my end..
Anyone out there who might have found a solution ti this behaviour ?

alc · Mar 7, 2022

No luck either.
I still get 4 alert messages from my TrueNAS everynight because my AD (VM) disconnects during every daily backup, which did not happen before version 7 of proxmox.

Glowsome · Mar 13, 2022

As i have not found any means to further debug or resolve this i will be planning on replacing this box.

However that will have quite the impact, as it is running some vital roles - making it quite painfull to implement changes on this box.

Search

Search

one VM behaving badly - backup fails and it becomes unresponsive/HighCPU use

Glowsome

Renowned Member

Glowsome

Renowned Member

Glowsome

Renowned Member

alc

Active Member

Glowsome

Renowned Member

alc

Active Member

Glowsome

Renowned Member

alc

Active Member

Glowsome

Renowned Member

Glowsome

Renowned Member

alc

Active Member

Glowsome

Renowned Member

We value your privacy