Guest VM CPU spike during backup

ozdjh

Well-Known Member
Oct 8, 2019
117
27
48
Happy new year everybody.

We've migrated some production VMs to our new Proxmox cluster. The only issue we are seeing is that our monitoring system is generating "host unreachable" alerts for the VMs each night during backups. We assumed it was related to pings getting dropped due to network traffic but everything checked out fine. When looking at the graphs for the guest VMs we can see that the CPU load goes up to 100% when it's being backed up. My understanding was that a snapshot of the vdisk was copied by the pve node and that the guest VM wasn't directly involved (other than calling the guest agent during the snapshot). I've disabled the guest agent on one VM and tested, which still resulted in a significant CPU spike.

Can someone with an understanding of the backup process explain why we'd be seeing the VMs spike like this? I didn't think the guest was involved in any way.


Thanks

David
...
 

Attachments

  • Screen Shot 2020-01-05 at 8.19.04 am.png
    Screen Shot 2020-01-05 at 8.19.04 am.png
    21.1 KB · Views: 27
This does not generally happen. Does limiting I/O (bwlimit option for vzdump) help? What is the resource usage like on the host during the backup? Can you give us more details about what storages you use?
 
Hi

IO on the entire platform is minimal at the moment as we're still evaluating Proxmox. The storage side of the platform is ceph using 100% nvme drives, with a 40GbE public network and 10GbE cluster network supporting about 8 test VMs. There's also dual PVE cluster networks and a network for the VMs. Those VMs are spread over 4 nodes with 1.5TB of RAM and a bunch of cpu cores, so it's more idle than any platform ever should be :)

Several days passed since I posted this and there wasn't a response, so I've opened a support ticket on it. When we get to the bottom of the problem I'll update this thread in case others experience a similar problem.


Thanks

David
...
 
  • Like
Reactions: sumsum and Dominic
We experience here the same issue on 6.1-2 (running kernel: 5.3.13-1-pve), pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
disabling the guest agent or/and limiting I/O does not help. The Backup Storage used is nfs4

The Spike in Windows & Linux VM's goes up to 100% for several seconds. during backup (vm lost network and is unsresponsive).
 
Hi sumsum

Thanks for your post. That is exactly the same situation we are seeing on 6.1-2, kernel 5.3.10-1-pve, and pve-manager 6.1-3. And yes, disabling guest agent doesn't improve things for us either. I've had a console session open during the backups and the console freezes for many seconds at a time. Also, on a Windows guest I can see a spike in CPU usage for "System interrupts" process when the console starts working again.

It appears that it only happens if the backup storage is NFS. We did tests with backups to local storage which were fine. We've tested using multiple different NFS targets (FreeNAS and Debian buster) and both show the same problem. We've been using nfs4 too, so I may force it to try nfs3 just to see if that makes a difference. I'm happy to try to gather any useful info I can at the moment.

I'll update here when we make progress on the support ticket.


Thanks

David
...
 
Just to sum up what we've tested right now

  • Backup on Local Storage works more or less. we have no high cpu spikes but sometimes interrupts
  • NFS Target with nfs vers=4.2 high CPU spikes and issues as described
  • NFS Targets with nfs vers=4.1high CPU spikes and issues as described
  • we played with bwlimit and ionice w/o any positiv impact.
Let me know if I can help for testing other use cases.

Thx
sumsum
 
I just found something very interesting. I setup a CIFS share on the same FreeNAS box we've been testing against and used that as the backup target. I sat on the console of a VM and watched it's CPU graph while it was being backed up. No spikes and it never became unresponsive. Also, none of the VMs generated alerts from our monitoring system (which they do every night during backups). Everything worked perfectly.

It's strange that CIFS is the best way to get a Linux box to talk to a BSD box. Very unnatural :)


David
 
The kernel nfs-client and server has a lot of flaws.
Thats also the reason, large setups use nfs-ganesha instead of the kernel server.
 
Just following up on this again. We installed samba on our primary backup target and exposed the raid set as a CIFS volume. A full backup to the CIFS target worked perfectly. No CPU spikes or network drop-outs.

Interesting though, the network throughput to the target looked significantly slower, so NFS should be the better choice once the interaction between the client side NFS code and KVM / QEMU is sorted out.


David
 
I don't think there will be a stable solution anytime soon. Kernel NFS has a lot of issues and deadlocks, at least in VM context.
The reason for faster NFS might be "async" mode which is very unsafe, just like the "writeback unsafe" option in proxmox.

If oyu don't mind the risks you can also tune SMB. For non-VM workloads, SMB has the same performance but has better support for most OS.
 
There's been no activity on bugzilla. I think it's going to take some time to resolve this one.

We'd rather not use a fileserver for writing the backups anyway, due to the potential for hard lockups if the nfs / cifs target goes away. We decided to run up another node in the cluster just for backups, and generate VMA files from ceph exports. That way the file server is only in use when a backup is restored which greatly reduces the chance of a problem. Also, the nodes running the VMs aren't involved in the backup process at all. It's working well so far.

Restoring the VMAs over https would be even better, as would making the vma tool "rbd aware" so we didn't have to spool the exports do disk before creating the archive. We may look at adding that functionality in time.
 
Last edited:
I can assure you we are working on it, I will post any further update to bugzilla if I have any meaningful information for you.
 
No it's most likely not related, but anything preventing the guest/qemu/kvm from doing I/O will result in the symptoms described.
 
Just following up on this here as there's been some movement on this issue. As mentioned in the bugzilla thread by Tim Marx

Some first improvements are now available in the no-subscription-repo for testing.

pve-qemu-kvm: 4.1.1-3

We've rolled our own solution for backups that doesn't use NFS or CIFS so it's not a burning issue for us anymore. But I thought others following this thread would be interested in testing this.

David
...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!