Nightly backup causes VM instability

dosmage

Active Member
Nov 30, 2016
26
0
41
41
Hello! We're have issues, even after upgrading to 7.1, every night one of our file servers becomes unstable during a backup. We've used NFS, SMB and now Proxmox Backup Server.

The VM is showing hung tasks in dmesg and occasionally the NFS services will fail and only a reboot clears the issue. Is there any workarounds to continue using backups and mitigate VM freezing during the backup process?
 
Are the VMs also located on an NFS? You could try limiting the network traffic to the Proxmox Backup Server [1], so that the network is not saturated during the backup.

[1] https://pbs.proxmox.com/docs/network-management.html#traffic-control
Hello! Thank you for this! A little more information. The VM that becomes unstable during backups is stored on our Ceph storage, as are all of our VMs. There are 5 enterprise hosts, soon to be 8, participating in the Ceph network with 35 SSDs spread equally across the participating hosts and 4 replicas.

The Ceph, backup (or private) and data ethernet networks have dedicated 10GbE NICs with jumbo frames enabled which we've tested with iperf3 from every host and storage server and vice versa. The iperf3 tests show all paths to be attaining 10GbE speeds and a ping test with MTUs show there are no jumbo frame MTU issues.

Ethernet traffic analysis does not indicate saturation. It may be possible that the Ceph storage system is saturating with reads but anecdotally reducing bwlimit speeds has been shown to only prolong the backups' time and subsequently the length of the VMs' instability. We haven't figured out a way to graph Ceph saturation. Although I would imagine if Ceph itself is saturating we'd see more instability from other hosts during our nightly backups on any of our 7 enterprise PVE hosts.

If I could get a recommendation on the bwlimit option, I'm all for writing a traffic shaping rule on Proxmox Backup System to reduce the backup speeds during the night from 0 to 50Mbps, down from the average speed of 84Mbps.

We may want to use one of our a support request with this issue as during the nightly backups, which are essential to maintain our continuity of business, are causing some of our websites to page down during the 7 hours it takes to perform a backup of our fileserver. I will try to hunt down and link our license key to my forums' profile to properly indicate our license status. Thank you again in advance.
 
Given your description, the Proxmox VE cluster doesn't sound like the issue, and if there is a dedicated link to the Proxmox Backup Server, a bandwidth limitation on Proxmox Backup Server shouldn't be necessary.

Could you provide some more information on your backup server, such as disk types, speed, configurations? There is some discussion on another thread [1] regarding how a slow backup server can be the root of such issues.

Could you also provide more information on the VM? Does it have a high amount of write activity (or at least high relative to the others)? What exactly is it running? Could you also post its config (cat /etc/pve/qemu-server/<VMID>.conf)?

[1] https://forum.proxmox.com/threads/vms-freezing-and-unreachable-when-backup-server-is-slow.96521/
 
I certainly can provide more information!

Our backup server is 24 core Xeon system, 256 GB of RAM running Proxmox Backup Server. The main storage is a 24 drive raidz2, all constituent drives are Seagate 2TB 12/Gbs SATA 7200 RPM drives.

In the discourse of attempting to maximize bandwidth we had attempted a ZFS raid10 setup and included SSD write and special metadata drives, although this didn't improve transfer speeds during backups. Given all configurations resulted in similar write speeds we decided on maximizing storage capacity. Additionally in our tests backing up a Ceph backed VM to a local ZFS raidz1 with 10 SSDs presented the same transfer speeds. With that test we assumed that Ceph was the limiting factor.

Code:
cat /etc/pve/nodes/****/qemu-server/163.conf
agent: 1
bootdisk: scsi0
cores: 8
cpu: host
ide2: none,media=cdrom
memory: 16384
name: ***.****
net0: virtio=**:**:**:**:**:**,bridge=vmbr0,firewall=1,tag=***
numa: 0
ostype: l26
protection: 1
scsi0: ceph_vm:vm-163-disk-0,discard=on,iothread=1,size=20G,ssd=1
scsi1: ceph_vm:vm-163-disk-1,discard=on,iothread=1,size=11T,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=********-****-****-****-************
sockets: 2
vmgenid: ********-****-****-****-************

The fileserver above only exports NFS filesystems to our a couple hundred of our computer laboratories and a handful of our web servers and research boxes. The system is only running NFS, Samba and sshd. The system sits idle at a local of 1.8. The system's writes are generally very low, especially at night when the university is closed. Right now the hour graph shows that disk writes are about 1Mbps on average with spikes ranging from 4Mbps to 10Mbps. It is relatively the highest write server on the Ceph storage. If there is a away to pull a 24 period out of the Proxmox rrd graphs I can answer the question a little better.

Please let me know if there's any other information I may provide. I did redact some identifying information from the cat of the VMID.
 
I can't say for certain, but unfortunately the issue may just be down to your raidz2 array and the design of Proxmox Backup Server.

The backup process intercepts write calls from the VM and backs up the relevant block [1]. We cache some blocks in memory, but generally if the backup storage is too slow, this kind of thing can happen.

In the discourse of attempting to maximize bandwidth we had attempted a ZFS raid10 setup and included SSD write and special metadata drives, although this didn't improve transfer speeds during backups.
Just to clarify, did you notice the same VM lag after adding a special device [2] to the backup server?

Does the VM get powered off often? In theory, if VMs see relatively few writes, the dirty bitmap [3] can enable the actual backup transfer process to run quite quickly.

[1] https://git.proxmox.com/?p=pve-qemu...16aeb06259c2bcd196e4949c;hb=refs/heads/master
[2] https://pbs.proxmox.com/docs/sysadmin.html#local-zfs-special-device
[3] https://pbs.proxmox.com/docs/technical-overview.html?highlight=dirty bitmap#fixed-sized-chunks
 
Not sure if this topic is still relevant, but I've seen similar issues over the last few months.
I've got a few VM load balancers that have become unstable on a regular basis (once a week). After first suspecting the internet line, as well as many other things, I now seem to have also found a link between the backup times and VM instability with specific error messages showing up in the VM logs only while the backup is running, making the backup system a very likely cadidate.

Although I can't say for certain when the problems started, it wouldn't surprise me if it also coincided with an upgrade to 7.1 (currently running 7.2-4)

Setup;
- 4 Proxmox nodes
- Ceph cluster with 32 standard HDD's, on the compute nodes - no separate backup or storage systems.
- 1Gbit networking between all nodes. (often busy but links never saturated for more than a few seconds)

If useful, happy to provide any further information.
 
In case it is useful for anyone else, after switching my backup mode from "Snapshot" to "Suspend" seems to have done the trick and the particular VM's are now more stable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!