Hey folks. Im gonna need some assistance in debugging and possibly fixing this issue.
I have an Proxmox on a Lenovo SR550 host, with an LXC running OL 8.5.
The LXC has an Bacula Server on it.
We are observing:
1. Slow network transfer speeds of bacula - maximum 250Mbps.
2. Network failuers (timeouts/connection reset by peer) when running big network file transfers.
3. Affects ONLY ingress traffic towards the Bacula LXC. Egress traffic from it is mostly fine (or at least not 10x slower).
4. Network transfer speeds may start high (near 1Gbps), but go down pretty fast.
Host information:
1x occupied 1Gbps ethernet port to Access Mode.
PVE information:
Storage information:
2x4TB SAS HDD, running in RAID1.
PVE OS and LVM storage are both located on the RAID1.
LXC Contaier info:
Performance tests:
Disk tests:
run directly on the LXC container. Performance is the same as the host.
Network performance tests were run from a VM located in a vCenter cluster. The VM has both public and private IP addresses on a single interface. Runs CentOS 7.9.
Network tests with IPERF3:
no issues when testing between vCenter VM > PVE host or vCenter VM > LXC, regardless of direction. Both were able to reach almost a full 1Gbps (around 950Mbps). Test time was 120s (so 120 samples and 2 min dur.). Tested with both interfaces/networks.
Network tests with SCP:
Here is where the big difference comes. Transferring with SCP between the vCenter VM > LXC on either interfaces comes with huge slowdown compared to a Ubuntu 22.04 (kernel 6.5) VM running on the same Proxmox host, with 2 interfaces both in the same networks as the LXC and vCenter VM.
vCenter VM > Proxmox host, private IP (the only thing the host has directly attached).
vCenter VM > LXC, private IP interface.
vCenter VM > LXC, public IP interface.
vCenter VM > Proxmox Ubuntu VM, private IP interface (virtio drives, LSI controller, poor disk perf on scsi drives).
vCenter VM > Proxmox Ubuntu VM, public IP interface (virtio drives, LSI controller, poor disk perf on scsi drives).
Things i tried:
Changing between firewall=0 and firewall=1, no difference.
Changing between mount=none and mount=cifs,nfs, no difference.
Changing kernel.tcp_keepalives_* on the vCenter VM, but no difference between values.
If we cant find a solution to this, we will have to move the Bacula service from the LXC to an VM. Which is undesirable, since its going to be more work.
I have an Proxmox on a Lenovo SR550 host, with an LXC running OL 8.5.
The LXC has an Bacula Server on it.
We are observing:
1. Slow network transfer speeds of bacula - maximum 250Mbps.
2. Network failuers (timeouts/connection reset by peer) when running big network file transfers.
3. Affects ONLY ingress traffic towards the Bacula LXC. Egress traffic from it is mostly fine (or at least not 10x slower).
4. Network transfer speeds may start high (near 1Gbps), but go down pretty fast.
Host information:
Bash:
root@pve:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
Stepping: 7
root@pve:~# free -m
total used free shared buff/cache available
Mem: 63996 21853 12703 847 29440 40573
1x occupied 1Gbps ethernet port to Access Mode.
PVE information:
Bash:
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)
root@pve:~# uname -a
Linux pve 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
root@pve:~# cat /etc/debian_version
11.6
Storage information:
2x4TB SAS HDD, running in RAID1.
PVE OS and LVM storage are both located on the RAID1.
LXC Contaier info:
Code:
arch: amd64
cores: 4
features: fuse=1,mount=nfs;cifs,nesting=1
hostname: bacula.ZZZ
memory: 4096
mp0: local-lvm:vm-108-disk-3,mp=/mnt/storage,size=1400G
net0: name=eth0,bridge=vmbr0,firewall=1,gw=X.X.X.X.133,hwaddr=3E:D6:F9:B2:2D:68,ip=X.X.X.X.254/24,type=veth
net1: name=eth1,bridge=vmbr0,firewall=1,hwaddr=6E:98:5D:BA:8E:5D,ip=Y.Y.Y.89/16,type=veth
onboot: 1
ostype: centos
rootfs: local-lvm:vm-108-disk-2,size=8G
swap: 1024
tags: ol8
Performance tests:
Disk tests:
run directly on the LXC container. Performance is the same as the host.
Bash:
[root@bacula /mnt/storage]# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4096k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --direct=1
test: (g=0): rw=randrw, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=64
fio-3.19
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [m(1)][94.4%][r=179MiB/s,w=59.8MiB/s][r=44,w=14 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=224836: Sat Jan 13 10:34:22 2024
read: IOPS=45, BW=183MiB/s (192MB/s)(3000MiB/16398msec)
bw ( KiB/s): min=130549, max=236621, per=98.92%, avg=185309.61, stdev=28990.91, samples=31
iops : min= 31, max= 57, avg=44.42, stdev= 7.10, samples=31
write: IOPS=16, BW=66.8MiB/s (70.1MB/s)(1096MiB/16398msec); 0 zone resets
bw ( KiB/s): min=24478, max=106071, per=98.84%, avg=67646.90, stdev=26073.74, samples=31
iops : min= 5, max= 25, avg=15.58, stdev= 6.39, samples=31
cpu : usr=0.28%, sys=8.03%, ctx=1917, majf=0, minf=7
IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.6%, 32=3.1%, >=64=93.8%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=750,274,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=183MiB/s (192MB/s), 183MiB/s-183MiB/s (192MB/s-192MB/s), io=3000MiB (3146MB), run=16398-16398msec
WRITE: bw=66.8MiB/s (70.1MB/s), 66.8MiB/s-66.8MiB/s (70.1MB/s-70.1MB/s), io=1096MiB (1149MB), run=16398-16398msec
Disk stats (read/write):
dm-14: ios=48814/17879, merge=0/0, ticks=10439368/5696676, in_queue=16136044, util=98.55%, aggrios=49643/18043, aggrmerge=0/0, aggrticks=10520540/5791300, aggrin_queue=16311840, aggrutil=99.93%
dm-4: ios=49643/18043, merge=0/0, ticks=10520540/5791300, in_queue=16311840, util=99.93%, aggrios=24822/9021, aggrmerge=0/0, aggrticks=5261990/2895632, aggrin_queue=8157622, aggrutil=99.94%
dm-2: ios=2/0, merge=0/0, ticks=3536/0, in_queue=3536, util=21.18%, aggrios=12731/4569, aggrmerge=36916/13531, aggrticks=2612828/1442517, aggrin_queue=4055346, aggrutil=99.96%
sda: ios=12731/4569, merge=36916/13531, ticks=2612828/1442517, in_queue=4055346, util=99.96%
dm-3: ios=49643/18043, merge=0/0, ticks=10520444/5791264, in_queue=16311708, util=99.94%
Network performance tests were run from a VM located in a vCenter cluster. The VM has both public and private IP addresses on a single interface. Runs CentOS 7.9.
Network tests with IPERF3:
no issues when testing between vCenter VM > PVE host or vCenter VM > LXC, regardless of direction. Both were able to reach almost a full 1Gbps (around 950Mbps). Test time was 120s (so 120 samples and 2 min dur.). Tested with both interfaces/networks.
Network tests with SCP:
Here is where the big difference comes. Transferring with SCP between the vCenter VM > LXC on either interfaces comes with huge slowdown compared to a Ubuntu 22.04 (kernel 6.5) VM running on the same Proxmox host, with 2 interfaces both in the same networks as the LXC and vCenter VM.
vCenter VM > Proxmox host, private IP (the only thing the host has directly attached).
Code:
[root@gitss backups]# scp registry.tar.gz root@Y.Y.Y.1:/root/
registry.tar.gz 100% 7056MB 111.1MB/s 01:03
vCenter VM > LXC, private IP interface.
Code:
[root@gitss backups]# scp registry.tar.gz root@Y.Y.Y.89:/mnt/storage/
registry.tar.gz 100% 7056MB 18.0MB/s 06:33
vCenter VM > LXC, public IP interface.
Code:
[root@gitss backups]# scp registry.tar.gz root@X.X.X.254:/mnt/storage/
registry.tar.gz 100% 7056MB 12.2MB/s 09:38
vCenter VM > Proxmox Ubuntu VM, private IP interface (virtio drives, LSI controller, poor disk perf on scsi drives).
Code:
[root@gitss backups]# scp registry.tar.gz ubuntu@Y.Y.Y.79:/home/ubuntu/
registry.tar.gz 100% 7056MB 103.7MB/s 01:08
vCenter VM > Proxmox Ubuntu VM, public IP interface (virtio drives, LSI controller, poor disk perf on scsi drives).
Code:
[root@gitss backups]# scp registry.tar.gz ubuntu@X.X.X.226:/home/ubuntu/
registry.tar.gz 100% 7056MB 110.3MB/s 01:03
Things i tried:
Changing between firewall=0 and firewall=1, no difference.
Changing between mount=none and mount=cifs,nfs, no difference.
Changing kernel.tcp_keepalives_* on the vCenter VM, but no difference between values.
If we cant find a solution to this, we will have to move the Bacula service from the LXC to an VM. Which is undesirable, since its going to be more work.
Last edited by a moderator: