I have a VM a that needs to be in constant connection to CT B.
VM A gets a daily backup:
And everything is still fine. Then the CT backups start. CT B is not even included in this backup job (it comes later in a separate job)
At 05:15:23 the connection between VM A and CT B shows as failed. Yesterday the same thing happened at the same time (less than 10 seconds after the first LXC backup starts). Starting an LXC backup manually also causes the connection to fail.
The services in both VM A and CT B use IP addresses to talk to each other (VM A reaches out to CT B every few seconds). The connection between the two doesn't stop, it is only a hiccup. But it is enough to cause the two services to fail. During the backup, the IO Delay gets to 40-45%, but no other machines have a problem with that (most of them perform a task every few seconds or at most every couple of minutes, which I would know if it failed to perform). This is the only scenario in my node where services depend on intra LAN communication aside from SMTP and DNS (which is sporadic).
Moving VM A to the same NVME disk where the CTs are, and out of the spinning disk that stores the backups, keeps the issue from happening. But I don't understand how I/O delay, a disk bottleneck, affects network communication, especially since VM A writes to disk very very little. I don't absolutely have to have it on the spinning disks, but it is a crucial service, with low disk usage, and I feel safer having it on ZFS-RAID10 than on a single NVME disk.
Extra info: VM A and CT B share a public IP. Pve-firewall is not active at any level (service is actually masked).
VM A gets a daily backup:
Code:
INFO: status = running
INFO: update VM A: -lock backup
INFO: include disk 'scsi0' '________-disk-0'
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/dump/vzdump-qemu-___-2019_12_13-05_00_02.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'xxxxxxxxxxxxx'
And everything is still fine. Then the CT backups start. CT B is not even included in this backup job (it comes later in a separate job)
Code:
INFO: starting new backup job: vzdump ___ ___ ___ ___ ___ ___ ___ ___ ___ -storage dir_zfs-dir --quiet 1 --node node2 --mode snapshot --mailnotification failure --compress lzo
INFO: filesystem type on dumpdir is 'zfs' -using /var/tmp/vzdumptmp9156 for temporary files
INFO: Starting Backup of VM ___ (lxc)
INFO: Backup started at 2019-12-13 05:15:03
INFO: status = running
INFO: CT Name: example.com
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
Logical volume "snap_vm-___-disk-0_vzdump" created.
INFO: creating archive '/mnt/dump/vzdump-lxc-___-2019_12_13-05_15_03.tar.lzo'
INFO: Total bytes written: 23857827840 (23GiB, 195MiB/s)
INFO: archive file size: 19.61GB
INFO: delete old backup '/mnt/dump/vzdump-lxc-___-2019_12_11-05_15_02.tar.lzo'
INFO: remove vzdump snapshot
Logical volume "snap_vm-___-disk-0_vzdump" successfully removed
INFO: Finished Backup of VM ___ (00:02:04)
At 05:15:23 the connection between VM A and CT B shows as failed. Yesterday the same thing happened at the same time (less than 10 seconds after the first LXC backup starts). Starting an LXC backup manually also causes the connection to fail.
The services in both VM A and CT B use IP addresses to talk to each other (VM A reaches out to CT B every few seconds). The connection between the two doesn't stop, it is only a hiccup. But it is enough to cause the two services to fail. During the backup, the IO Delay gets to 40-45%, but no other machines have a problem with that (most of them perform a task every few seconds or at most every couple of minutes, which I would know if it failed to perform). This is the only scenario in my node where services depend on intra LAN communication aside from SMTP and DNS (which is sporadic).
Moving VM A to the same NVME disk where the CTs are, and out of the spinning disk that stores the backups, keeps the issue from happening. But I don't understand how I/O delay, a disk bottleneck, affects network communication, especially since VM A writes to disk very very little. I don't absolutely have to have it on the spinning disks, but it is a crucial service, with low disk usage, and I feel safer having it on ZFS-RAID10 than on a single NVME disk.
Extra info: VM A and CT B share a public IP. Pve-firewall is not active at any level (service is actually masked).