LXC Backup Interrupts LAN

Proxygen · Dec 13, 2019

I have a VM a that needs to be in constant connection to CT B.

VM A gets a daily backup:

Code:

INFO: status = running
INFO: update VM A: -lock backup
INFO: include disk 'scsi0' '________-disk-0'
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/dump/vzdump-qemu-___-2019_12_13-05_00_02.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'xxxxxxxxxxxxx'

And everything is still fine. Then the CT backups start. CT B is not even included in this backup job (it comes later in a separate job)

Code:

INFO: starting new backup job: vzdump ___ ___ ___ ___ ___ ___ ___ ___ ___  -storage dir_zfs-dir --quiet 1 --node node2 --mode snapshot --mailnotification failure --compress lzo
INFO: filesystem type on dumpdir is 'zfs' -using /var/tmp/vzdumptmp9156 for temporary files
INFO: Starting Backup of VM ___ (lxc)
INFO: Backup started at 2019-12-13 05:15:03
INFO: status = running
INFO: CT Name: example.com
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
  Logical volume "snap_vm-___-disk-0_vzdump" created.
INFO: creating archive '/mnt/dump/vzdump-lxc-___-2019_12_13-05_15_03.tar.lzo'
INFO: Total bytes written: 23857827840 (23GiB, 195MiB/s)
INFO: archive file size: 19.61GB
INFO: delete old backup '/mnt/dump/vzdump-lxc-___-2019_12_11-05_15_02.tar.lzo'
INFO: remove vzdump snapshot
  Logical volume "snap_vm-___-disk-0_vzdump" successfully removed
INFO: Finished Backup of VM ___ (00:02:04)

At 05:15:23 the connection between VM A and CT B shows as failed. Yesterday the same thing happened at the same time (less than 10 seconds after the first LXC backup starts). Starting an LXC backup manually also causes the connection to fail.

The services in both VM A and CT B use IP addresses to talk to each other (VM A reaches out to CT B every few seconds). The connection between the two doesn't stop, it is only a hiccup. But it is enough to cause the two services to fail. During the backup, the IO Delay gets to 40-45%, but no other machines have a problem with that (most of them perform a task every few seconds or at most every couple of minutes, which I would know if it failed to perform). This is the only scenario in my node where services depend on intra LAN communication aside from SMTP and DNS (which is sporadic).

Moving VM A to the same NVME disk where the CTs are, and out of the spinning disk that stores the backups, keeps the issue from happening. But I don't understand how I/O delay, a disk bottleneck, affects network communication, especially since VM A writes to disk very very little. I don't absolutely have to have it on the spinning disks, but it is a crucial service, with low disk usage, and I feel safer having it on ZFS-RAID10 than on a single NVME disk.

Extra info: VM A and CT B share a public IP. Pve-firewall is not active at any level (service is actually masked).

Dominic · Dec 19, 2019

The connection between the two doesn't stop, it is only a hiccup.

Have you been able to reproduce this with tools like ping or iperf3? Additionally, does setting bandwidth limits in the backup change something? Like this:

Code:

vzdump 101 --storage local --bwlimit 100

Proxygen · Dec 19, 2019

Dominic said:
Have you been able to reproduce this with tools like ping or iperf3? Additionally, does setting bandwidth limits in the backup change something? Like this:

Code:

vzdump 101 --storage local --bwlimit 100

I will try it out, but these two talking is so crucial I am going to have to wait for this to happen again with another machine.
Question is how do I set bwlimit to a backup job created via GUI?

Dominic · Dec 20, 2019

Backup jobs scheduled in the GUI will generate a cron entry in /etc/cron.d/vzdump. You can add the option there, but you have to be careful with the syntax.

Alternatively, you can edit /etc/vzdump.conf. This file controlles the global vzdump configuration. You can find more details in this chapter of our reference documentation. Attention, the section Bandwidth Limit (and thus also the example with pvesm set) in this chapter is a subsection of "Restore".

Is one of those options helpful for you?

Proxygen · Dec 20, 2019

This is helpful, thanks. But

You can use the ‘--bwlimit <integer>` option from the restore CLI commands to set up a restore job specific bandwidth limit. Kibit/s is used as unit for the limit, this means passing `10240’ will limit the read speed of the backup to 10 MiB/s, ensuring that the rest of the possible storage bandwidth is available for the already running virtual guests, and thus the backup does not impact their operations.

My concern is the write speed of the backup. Unless read refers to read speed of the source, not the target.

Dominic · Jan 2, 2020

My concern is the write speed of the backup.

The read limit indirectly affects the write limit, as we cannot write more than we read.

The exact behavior of the bwlimit option when creating a backup depends on the guest type. For virtual machines we use it as "speed" parameter for Qemu/QMP. For containers we use cstream with the -t option.

Proxygen · Jan 2, 2020

Dominic said:
The read limit indirectly affects the write limit, as we cannot write more than we read.

The exact behavior of the bwlimit option when creating a backup depends on the guest type. For virtual machines we use it as "speed" parameter for Qemu/QMP. For containers we use cstream with the -t option.

So it refers to read speed of the source. Thanks Dominic.

Proxygen · Jan 10, 2020

Code:

ionice: <integer> (0 - 8) (default = 7)
       Set CFQ ionice priority.

Assuming ionice here is --class 2/best-effort (it would be absurd to have it be realtime), why is it 0-8 when only 0-7 are priority levels?

Is this just a proxy to ionice, and 8 becomes --class 3/idle? If I can use idle I might as well not use --bwlimit

Proxygen · Jan 12, 2020

ionice 8 is actually class "idle", see https://forum.proxmox.com/threads/about-of-ionice-in-vzdump.16485/#post-84954

Dominic · Jan 13, 2020

Even though the thread you refer to is relatively old, Dietmars answer is still correct. The relevant part of the code is here.

Search

Search

LXC Backup Interrupts LAN

Proxygen

Active Member

Dominic

Proxmox Retired Staff

Proxygen

Active Member

Dominic

Proxmox Retired Staff

Proxygen

Active Member

Dominic

Proxmox Retired Staff

Proxygen

Active Member

Proxygen

Active Member

Proxygen

Active Member

Dominic

Proxmox Retired Staff

We value your privacy