Launching backup in Proxmox VE 6.4-13 increases server load and makes containers to stutter

Dec 17, 2021
35
2
13
58
Hi.

I have a single VE server deployed (48 x AMD EPYC 7451 24-Core Processor, 258Gb RAM, 2x4TB SATA drives mirror by lvm).
This VE runs 36 LXC containers.
I store my backups in a separate Proxmox Backup Server.

When I launch a backup for a single CT, system load goes from 4/5 to 62 for the backup duration. Also, I start to get messages like this on in the containers
ex: kernel:[4840272.152002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!

I really feel his is not right, but I have very limited experience with containers.

I have been unable to identify any contention source in my configuration.

Can anyone propose a way to better analyze what's going on or what is the reason of the behavior?

Thanks in advance
Javier Vilarroig
 
Hello, I have a data point for comparison for you:
  • Node: "Empty" Dell Server with EPYC 7443 Dual CPU 24C48T; NVMe; ZFS
  • PBS = a remote VM via 10 GBit/s Copper
Container:
  • fresh installed debian Bullseye, 48 GB Disk
  • uncompressable dummy-file: dd if=/dev/urandom of=/opt/42g.dat bs=1M count=42000

State = started and kept running for three Backups
  • First Backup took 161s; System Load ~1.0, maybe peak CPU=200% - unclear
  • 2nd. Backup = 60 s; System Load ~1.0
  • 3rd. Backup = 61s
Now shutdown
  • 4th. Backup = 60s
During these Tests there was no relevant load aside of that expected "a little bit above 1.0 with peaks below 2.0" recognized and (because local storage is on NVME) there is zero IO delay on the node.

Does this help?

----
Edit, PS: "journalctl -f" shows zero messages during another (5th + 6th) Backup. Remember: this container is idle.
 
Last edited:
Thanks for your data :)

Clearly my experience is not the same you are having.

The biggest difference I can see is the mass storage setup. Mine is regular SATA yours is NVMe. My expectation is that this can have an impact on the time it takes to take the backup, but not on the system load. HD wait time must be used by someone else.

Also I think we talk about a different level of base load. I have 36 LXC containers, most of them running a heavy Java web app.

Still, my system manages to have a load of only between 4 and 6. Going to a pike of 60 seems unreasonable for me.

I'm still trying to find the culprit, but nothing yet.

Thanks again.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!