LXC Backup Interrupts LAN

May 18, 2019
231
15
38
Varies
I have a VM a that needs to be in constant connection to CT B.

VM A gets a daily backup:

Code:
INFO: status = running
INFO: update VM A: -lock backup
INFO: include disk 'scsi0' '________-disk-0'
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/dump/vzdump-qemu-___-2019_12_13-05_00_02.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'xxxxxxxxxxxxx'

And everything is still fine. Then the CT backups start. CT B is not even included in this backup job (it comes later in a separate job)

Code:
INFO: starting new backup job: vzdump ___ ___ ___ ___ ___ ___ ___ ___ ___  -storage dir_zfs-dir --quiet 1 --node node2 --mode snapshot --mailnotification failure --compress lzo
INFO: filesystem type on dumpdir is 'zfs' -using /var/tmp/vzdumptmp9156 for temporary files
INFO: Starting Backup of VM ___ (lxc)
INFO: Backup started at 2019-12-13 05:15:03
INFO: status = running
INFO: CT Name: example.com
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
  Logical volume "snap_vm-___-disk-0_vzdump" created.
INFO: creating archive '/mnt/dump/vzdump-lxc-___-2019_12_13-05_15_03.tar.lzo'
INFO: Total bytes written: 23857827840 (23GiB, 195MiB/s)
INFO: archive file size: 19.61GB
INFO: delete old backup '/mnt/dump/vzdump-lxc-___-2019_12_11-05_15_02.tar.lzo'
INFO: remove vzdump snapshot
  Logical volume "snap_vm-___-disk-0_vzdump" successfully removed
INFO: Finished Backup of VM ___ (00:02:04)

At 05:15:23 the connection between VM A and CT B shows as failed. Yesterday the same thing happened at the same time (less than 10 seconds after the first LXC backup starts). Starting an LXC backup manually also causes the connection to fail.

The services in both VM A and CT B use IP addresses to talk to each other (VM A reaches out to CT B every few seconds). The connection between the two doesn't stop, it is only a hiccup. But it is enough to cause the two services to fail. During the backup, the IO Delay gets to 40-45%, but no other machines have a problem with that (most of them perform a task every few seconds or at most every couple of minutes, which I would know if it failed to perform). This is the only scenario in my node where services depend on intra LAN communication aside from SMTP and DNS (which is sporadic).

Moving VM A to the same NVME disk where the CTs are, and out of the spinning disk that stores the backups, keeps the issue from happening. But I don't understand how I/O delay, a disk bottleneck, affects network communication, especially since VM A writes to disk very very little. I don't absolutely have to have it on the spinning disks, but it is a crucial service, with low disk usage, and I feel safer having it on ZFS-RAID10 than on a single NVME disk.

Extra info: VM A and CT B share a public IP. Pve-firewall is not active at any level (service is actually masked).
 
The connection between the two doesn't stop, it is only a hiccup.

Have you been able to reproduce this with tools like ping or iperf3? Additionally, does setting bandwidth limits in the backup change something? Like this:
Code:
vzdump 101 --storage local --bwlimit 100
 
Have you been able to reproduce this with tools like ping or iperf3? Additionally, does setting bandwidth limits in the backup change something? Like this:
Code:
vzdump 101 --storage local --bwlimit 100

I will try it out, but these two talking is so crucial I am going to have to wait for this to happen again with another machine.
Question is how do I set bwlimit to a backup job created via GUI?
 
Backup jobs scheduled in the GUI will generate a cron entry in /etc/cron.d/vzdump. You can add the option there, but you have to be careful with the syntax.

Alternatively, you can edit /etc/vzdump.conf. This file controlles the global vzdump configuration. You can find more details in this chapter of our reference documentation. Attention, the section Bandwidth Limit (and thus also the example with pvesm set) in this chapter is a subsection of "Restore".

Is one of those options helpful for you?
 
  • Like
Reactions: Proxygen
This is helpful, thanks. But

You can use the ‘--bwlimit <integer>` option from the restore CLI commands to set up a restore job specific bandwidth limit. Kibit/s is used as unit for the limit, this means passing `10240’ will limit the read speed of the backup to 10 MiB/s, ensuring that the rest of the possible storage bandwidth is available for the already running virtual guests, and thus the backup does not impact their operations.

My concern is the write speed of the backup. Unless read refers to read speed of the source, not the target.
 
My concern is the write speed of the backup.
The read limit indirectly affects the write limit, as we cannot write more than we read.

The exact behavior of the bwlimit option when creating a backup depends on the guest type. For virtual machines we use it as "speed" parameter for Qemu/QMP. For containers we use cstream with the -t option.
 
  • Like
Reactions: Proxygen
Code:
ionice: <integer> (0 - 8) (default = 7)
       Set CFQ ionice priority.

Assuming ionice here is --class 2/best-effort (it would be absurd to have it be realtime), why is it 0-8 when only 0-7 are priority levels?

Is this just a proxy to ionice, and 8 becomes --class 3/idle? If I can use idle I might as well not use --bwlimit
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!