Guest VM CPU spike during backup

ozdjh · Jan 5, 2020

Happy new year everybody.

We've migrated some production VMs to our new Proxmox cluster. The only issue we are seeing is that our monitoring system is generating "host unreachable" alerts for the VMs each night during backups. We assumed it was related to pings getting dropped due to network traffic but everything checked out fine. When looking at the graphs for the guest VMs we can see that the CPU load goes up to 100% when it's being backed up. My understanding was that a snapshot of the vdisk was copied by the pve node and that the guest VM wasn't directly involved (other than calling the guest agent during the snapshot). I've disabled the guest agent on one VM and tested, which still resulted in a significant CPU spike.

Can someone with an understanding of the backup process explain why we'd be seeing the VMs spike like this? I didn't think the guest was involved in any way.

Thanks

David
...

Dominic · Jan 10, 2020

This does not generally happen. Does limiting I/O (bwlimit option for vzdump) help? What is the resource usage like on the host during the backup? Can you give us more details about what storages you use?

ozdjh · Jan 10, 2020

Hi

IO on the entire platform is minimal at the moment as we're still evaluating Proxmox. The storage side of the platform is ceph using 100% nvme drives, with a 40GbE public network and 10GbE cluster network supporting about 8 test VMs. There's also dual PVE cluster networks and a network for the VMs. Those VMs are spread over 4 nodes with 1.5TB of RAM and a bunch of cpu cores, so it's more idle than any platform ever should be

Several days passed since I posted this and there wasn't a response, so I've opened a support ticket on it. When we get to the bottom of the problem I'll update this thread in case others experience a similar problem.

Thanks

David
...

sumsum · Jan 13, 2020

We experience here the same issue on 6.1-2 (running kernel: 5.3.13-1-pve), pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
disabling the guest agent or/and limiting I/O does not help. The Backup Storage used is nfs4

The Spike in Windows & Linux VM's goes up to 100% for several seconds. during backup (vm lost network and is unsresponsive).

ozdjh · Jan 13, 2020

Hi sumsum

Thanks for your post. That is exactly the same situation we are seeing on 6.1-2, kernel 5.3.10-1-pve, and pve-manager 6.1-3. And yes, disabling guest agent doesn't improve things for us either. I've had a console session open during the backups and the console freezes for many seconds at a time. Also, on a Windows guest I can see a spike in CPU usage for "System interrupts" process when the console starts working again.

It appears that it only happens if the backup storage is NFS. We did tests with backups to local storage which were fine. We've tested using multiple different NFS targets (FreeNAS and Debian buster) and both show the same problem. We've been using nfs4 too, so I may force it to try nfs3 just to see if that makes a difference. I'm happy to try to gather any useful info I can at the moment.

I'll update here when we make progress on the support ticket.

Thanks

David
...

sumsum · Jan 13, 2020

Just to sum up what we've tested right now

Backup on Local Storage works more or less. we have no high cpu spikes but sometimes interrupts
NFS Target with nfs vers=4.2 high CPU spikes and issues as described
NFS Targets with nfs vers=4.1high CPU spikes and issues as described
we played with bwlimit and ionice w/o any positiv impact.

Let me know if I can help for testing other use cases.

Thx
sumsum

ozdjh · Jan 13, 2020

I just found something very interesting. I setup a CIFS share on the same FreeNAS box we've been testing against and used that as the backup target. I sat on the console of a VM and watched it's CPU graph while it was being backed up. No spikes and it never became unresponsive. Also, none of the VMs generated alerts from our monitoring system (which they do every night during backups). Everything worked perfectly.

It's strange that CIFS is the best way to get a Linux box to talk to a BSD box. Very unnatural

David

syfy323 · Jan 13, 2020

The kernel nfs-client and server has a lot of flaws.
Thats also the reason, large setups use nfs-ganesha instead of the kernel server.

ozdjh · Jan 14, 2020

Just following up on this again. We installed samba on our primary backup target and exposed the raid set as a CIFS volume. A full backup to the CIFS target worked perfectly. No CPU spikes or network drop-outs.

Interesting though, the network throughput to the target looked significantly slower, so NFS should be the better choice once the interaction between the client side NFS code and KVM / QEMU is sorted out.

David

syfy323 · Jan 14, 2020

I don't think there will be a stable solution anytime soon. Kernel NFS has a lot of issues and deadlocks, at least in VM context.
The reason for faster NFS might be "async" mode which is very unsafe, just like the "writeback unsafe" option in proxmox.

If oyu don't mind the risks you can also tune SMB. For non-VM workloads, SMB has the same performance but has better support for most OS.

tim · Jan 14, 2020

just for reference, this is now tracked here: https://bugzilla.proxmox.com/show_bug.cgi?id=2554

scornelissen · Feb 7, 2020

Any news about this issue?

ozdjh · Feb 8, 2020

There's been no activity on bugzilla. I think it's going to take some time to resolve this one.

We'd rather not use a fileserver for writing the backups anyway, due to the potential for hard lockups if the nfs / cifs target goes away. We decided to run up another node in the cluster just for backups, and generate VMA files from ceph exports. That way the file server is only in use when a backup is restored which greatly reduces the chance of a problem. Also, the nodes running the VMs aren't involved in the backup process at all. It's working well so far.

Restoring the VMAs over https would be even better, as would making the vma tool "rbd aware" so we didn't have to spool the exports do disk before creating the archive. We may look at adding that functionality in time.

tim · Feb 10, 2020

I can assure you we are working on it, I will post any further update to bugzilla if I have any meaningful information for you.

gkovacs · Feb 13, 2020

This looks related to CPU freeze issue that many of us are experiencing when backing up or restoring VMs, that is unresolved for many years:
https://forum.proxmox.com/threads/kvm-guests-freeze-hung-tasks-during-backup-restore-migrate.34362

Here is my newly reopened bugreport:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

tim · Feb 14, 2020

No it's most likely not related, but anything preventing the guest/qemu/kvm from doing I/O will result in the symptoms described.

ozdjh · Mar 9, 2020

Just following up on this here as there's been some movement on this issue. As mentioned in the bugzilla thread by Tim Marx

Some first improvements are now available in the no-subscription-repo for testing.

pve-qemu-kvm: 4.1.1-3

We've rolled our own solution for backups that doesn't use NFS or CIFS so it's not a burning issue for us anymore. But I thought others following this thread would be interested in testing this.

David
...

Search

Search

Guest VM CPU spike during backup

ozdjh

Well-Known Member

Attachments

Dominic

Proxmox Retired Staff

ozdjh

Well-Known Member

sumsum

Renowned Member

ozdjh

Well-Known Member

sumsum

Renowned Member

ozdjh

Well-Known Member

syfy323

Member

ozdjh

Well-Known Member

syfy323

Member

tim

Proxmox Staff Member

scornelissen

Active Member

ozdjh

Well-Known Member

tim

Proxmox Staff Member

gkovacs

Renowned Member

tim

Proxmox Staff Member

ozdjh

Well-Known Member