PVE Upg. 7.1->7.2 Speed of a PASSED SAS card drops extremely

ChAoS

Member
Apr 29, 2021
31
4
8
41
Hi,

we are hosting a veeam backup server within proxmox for backing up pour primary VMwXre environment.

The Machine is a Lenovo SR650 with two intel 8 core CPUs and 96G of ram.
The machine consists of 2 Avago MegaRaid controller - one for booting up proxmox and the other passed into the VM for veeam storage.
Also there exists:
an old HP SmartArray P410 controller passed into the same VM for extending the storage (for other purposes)
a Fujitsu Avago CP400e SAS controller passed into the same VM for veean to connect to an LTO8 drive (daily Backup to a tape with around 300 Megabyte / second)

Additionally it has a Lenovo 10G / 25G SFP Card for LAN connection (not passed) and a fibrechannel QLOGIC card (actually not used).

The VM is a german Windows Server 2019 with the current version of veeam backup installed inside. All - even virtual CPUs were assigned to the VM (2 socets - each 16 cores) and 32 Gig of RAM were assigned. It is a q35 v 6.0 VM.

With PVE 7.0 and 7.1 it was working like a charm, the backup of our VMwXre environment took about one hour into the storage and about 5.5 hours to copy it to the tape (about 5 terrabytes)

After upgrading to 7.2 problems started (immediately after the upgrade):
Writing to tape starts with >200 Megabytes / second and drops after round about 15 minutes to solid 20 megabytes / second. While I am able to copy files between the 2 RAID controllers with more than 200 - 300 Megabyte / s the speed to the tape stays solid at 20 Megabyte / s.
Only after a reboot of the VM we gain the full speed for a few minutes again.

Read:
/var/log/syslog
and
dmesg shows no messages in the moment of the drop.
Windows Eventlog tells even nothing.
No other errors detected.

What is going wrong?

Do we need to reinstall the server with 7.1?

Thx
Dirk
 
the upgrade to 7.2 probably included both a new kernel version (possibly including new mitigations for "retbleed" that can cost a lot of performance) and Qemu 7.x (although I don't know of any open issue that would match your symptoms, it's not entirely impossible that some changes there affect your use case negatively). the kernel is easily testable if you can afford a reboot (either by booting the previous kernel or disabling the retbleed mitigations on the kernel cmdline). it's of course also possible that something is wrong with the tape or drive and it just co-incided with the upgrade.

could you maybe include
- pveversion -v
- qm config XXX (where XXX is the VMID)
- /var/log/apt/history.log entry for the upgrade(s)
 
hello and good morning fabian,

thx for your quick reply.

As we yesterday re-installed the node 3 times I am pretty sure, that it is NO hardware issue.

the history:
we have 6 nodes (5x very old, recycled hardware, 1x a one year old Lenovo SR650 v1). Proxmox last updated in spring 2022 with 7.1
all was running fine.
Upgraded the Proxmox to 7.2 via dist-upgrade and problems started on the Lenovo.
Tried many many things within the VM - nothing helps.
Yesterday We:
removed the node fom the datacenter (before we moved the 2019 vm to another node offline)
made a clean install with the 7.2 iso from the website
made NO upgrades
moved the VM back again
tested speed -> slow
removed the node fom the datacenter (before we moved the 2019 vm to another node offline)
made a clean install with the 7.1 iso from the website
made NO upgrades
moved the VM back again
tested speed -> normal!

AFTER this I read this in the changelog:
Certain systems may need to explicitly enable iommu=pt (SR-IOV pass-through) on the kernel command line. There are some reports for this to solve issues with Avago/LSI RAID controllers, for example in a Dell R340 Server booted in legacy mode

So we made an clean upgrade to the current 7.2 via dist-upgrade
added the option iommu=pt to the grub line and made "update-grub"
rebooted
moved the VM back again
tested speed -> slow
removed the node fom the datacenter (before we moved the 2019 vm to another node offline)
made a clean install with the 7.1 iso from the website
made NO upgrades
moved the VM back again
tested speed -> normal!

As we have an flapping state (sometimes after rebooting the VM the speed is higher and drops after a few minutes) I expicitely blacklisted the driver and vfio-pci the device:

/etc/modprobe.d/pve-blacklist.conf
Code:
options vfio-pci ids=103c:323a,1000:0097
blacklist nvidiafb
blacklist hpsa
blacklist mpt3sas

Code:
~# qm config 100
agent: 1
boot: order=virtio0;sata0
cores: 8
cpu: host
description: hostpci1%3A 0000%3A58%3A00,pcie=1%0Ahostpci2%3A 0000%3A06%3A00,pcie=1%0Ahostpci3%3A 0000%3Aaf%3A00,pcie=1%0Ahostpci4%3A 0000%3A86%3A00,pcie=1
hostpci0: 0000:06:00,pcie=1
hostpci1: 0000:86:00,pcie=1
hostpci2: 0000:2f:00,pcie=1
machine: pc-q35-5.1
memory: 32768
name: BACKUP-K-1
net0: virtio=56:36:58:DA:36:37,bridge=vmbr888,queues=4,tag=20
numa: 0
ostype: win10
sata0: none,media=cdrom
scsihw: virtio-scsi-pci
smbios1: uuid=bXXXXXXXXXXXXX05ceb0
sockets: 2
virtio0: VMs:vm-100-disk-0,cache=writeback,size=80G
vmgenid:xxxxxxx

Took from another server upgraded the same day:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

Took from the current "good running server":
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

I know, this are now basic questions but:
how can I start proxmox with an older kernel?
How to disable the retbleet extensions? (maybe in /etc/default/grub)?


Today I can not promise that I have enough time to test but will do it as soon as possible as I do not want to have a version mixed datacenter!

Thx
Dirk
 
Last edited:
Dear Fabian,

tested:
Upgraded to current version without touching anything:
Slow speed.
Started easily via grub to an older kernel:
fast speed
disabled mitigations via grub cmdline:
slow speed...

Current config:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-2
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

current lscpu
Code:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz
Stepping:                        7
CPU MHz:                         3200.000
BogoMIPS:                        6400.00
Virtualization:                  VT-x
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        16 MiB
L3 cache:                        22 MiB
NUMA node0 CPU(s):               0-7,16-23
NUMA node1 CPU(s):               8-15,24-31
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xt
                                 opology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra
                                 nd lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
                                 bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm
                                  ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

EDIT:
see attached file:
wondering about the errors coming with the new kernel (not coming with the old)

and wondering about the massive errors in dmesg:
Code:
[   16.075606] i40e 0000:09:00.2: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
[   16.100523] i40e 0000:09:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
 

Attachments

  • boot.png
    boot.png
    54.9 KB · Views: 3
Last edited:
which older kernel did you test?
 
tested the kernel shipped with the 7.1 iso:
5.13.19-2
working fine with proxmox 7.2

looking for defaulting this kernel in grub.

As this machine is only for backing up our VMwXre environment this machine is only importat at night. At day I can test and reinstall it as often, as the time allows.

If you have other ideas what I can test, please let me know
 
thanks. could you verify that your test with the mitigation disabled on 5.15 actually had the mitigation disabled? e.g., check the output of cat /sys/devices/system/cpu/vulnerabilities/retbleed ?
 
seems to be off:

Code:
root@xxx:~# cat /sys/devices/system/cpu/vulnerabilities/retbleed

Vulnerable

root@xxx:~#
 
  • Like
Reactions: fabian
if you are willing to do more tests - could you go back in the 5.15 kernel release series one by one and check when the performance regression started? if you want to cut it short, you can start with the earliest one (to see if it's a general 5.15 issue). don't forget to remove test kernels when you don't need them anymore to avoid running out of space on /boot or the ESP(s) ;)
 
Hi,

thx,
now I can sure tell, that the problem occurs between kernel:

5.15.5-1 (first 5.15 found, currently installed and Tests running fine)
and
5.15.12-1 (now tested - slow)
5.15.17-1 (now tested - slow)
5.15.30-2 (shipped with the 7.2 ISO and tests failed directly)...

Will test pve-kernel-5.15.17-1 next :)

EDIT:
I am now sure, that retbleet fix has nothing to do with my problem as:

Code:
root@xxx:~# cat /sys/devices/system/cpu/vulnerabilities/retbleed
cat: /sys/devices/system/cpu/vulnerabilities/retbleed: No such file or directory
root@xxx:~# uname -a
Linux PVE-BKP-K-1 5.15.17-1-pve #1 SMP PVE 5.15.17-1 (Mon, 31 Jan 2022 09:41:30 +0100) x86_64 GNU/Linux
and speed is SLOW!!!


edit now testing:
pve-kernel-5.15.12-1-pve EDIT: Slow!!!
pve-kernel-5.15.7-1-pve
 
Last edited:
  • Like
Reactions: fabian
EDIT:
I am now sure, that retbleet fix has nothing to do with my problem as:
one thing that can be done quite fast (if you have persistent journalling available):
* compare the kernel-messages between a known good kernel 5.15.5-1 and an known bad 5.15.17-1
* see man journalctl for details - but `journalctl -b0` gives the journal for the current boot, `journalctl -b-1` for the previous and `journalctl list-boots` the times of each reboot)

that might point to where the issue occurs
 
Hi together.

Now I figured out: the last good working Kernel was: 5.15.5-1

All kernels I could download after 5.15.5-1 (beginning from 5.15.7-1 ) seem to have this bug, I also tested the last recent 5.19.5-1 - all dropped immediately or after some time to 26-41 Megabyte per second.

Thx
Dirk
 
there are some changes with regards to TLB flushing and nested guests - is your windows VM using virtualization itself in any form?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!