I'm a relative novice with this stuff, so please excuse this noise if I am wrong, but that ^ smells a lot like: https://forum.proxmox.com/threads/a...-with-kernel-7-and-multiple-nic-types.183574/
Could be related but tailscale is specifically UDP and not TCP so similar problems with offloading but a different datapath. Could just be that the whole off the offloading featureset is somewhat broken in some NIC combinations.I'm a relative novice with this stuff, so please excuse this noise if I am wrong, but that ^ smells a lot like: https://forum.proxmox.com/threads/a...-with-kernel-7-and-multiple-nic-types.183574/
I've seen similar issues but with a realtek nic, I just figured it was realtek being realtek and turned off checksum offloading.Leaving a link to this here as well so others looking for problems can see it: https://bugzilla.proxmox.com/show_bug.cgi?id=7627
TCP checksum offloading for virtio is broken for at least some NIC types on latest kernel + qemu for later linux guests / windows guests.
See the bug, I can reproduce this on Broadcom and Intel NICs as well so I think its a wider problem. The problem exists when newer linux kernel features in 6.16+ interact with new Qemu versions 11.0+ which introduce new offloading scenarios. I don't know why these break if its a problem in Qemu or the host Kernel though.I've seen similar issues but with a realtek nic, I just figured it was realtek being realtek and turned off checksum offloading.
Hi,
with QEMU 10.2, there was a switch to using io_uring for the IO thread event loops and the IO pressure/wait accounting is set via the io_uring subsystem now. It's a different kernel subsystem from before, so it's not unexpected if it's different.
Yes. To be precise: a different way the IO wait metric is calculated.So, if I understood correctly, this is just a different way the graph is calculated after the update, and it does not necessarily mean that performance is affected, right?
Which driver version? This could indicate incompatibility between driver and kernel version.I am trying to install the NVidia host grid drivers on 7.0.2-7-pve and I am getting this error:
fatal error: os-interface.h: No such file or directory
I have these installed:
proxmox-headers-6.17.13-12-pve
proxmox-headers-7.0.2-7-pve
What am I missing?
Linux 6.15 or newer has no support for the EXTRA_CFLAGS variable in out-of-tree module Kbuild files, needed for 550.144.02.550.144.02 looks like the latest version a P100 and V100 supports.
Is this the correct way to set it to force TSC?
nano /etc/default/grub
and than change the line: GRUB_CMDLINE_LINUX_DEFAULT="quiet clocksource=tsc tsc=reliable"
Is there any risk to set this? Do I risk the host not booting at all?
[ 436.209061] pcieport 0000:80:1b.4: AER: Correctable error message received from 0000:80:1b.4
[ 436.209134] pcieport 0000:80:1b.4: device [8086:7f44] error status/mask=00300000/00000000
[ 436.209138] pcieport 0000:80:1b.4: [20] UnsupReq
[ 436.209140] pcieport 0000:80:1b.4: [21] ACSViol (First)
[ 437.238805] thunderbolt 0000:84:00.0: AER: can't recover (no error_detected callback)
[ 437.238815] xhci_hcd 0000:97:00.0: AER: can't recover (no error_detected callback)
[ 437.238832] pcieport 0000:80:1b.4: AER: device recovery failed
... (repeats continuously until host reboot)
Do you have any new updates or solutions other than reverting to the previous kernel version?I can confirm similar behavior on a 4-node cluster running Proxmox VE 9.2.2.
Cluster hardware:
Only the EPYC 3251 node is affected.
- 2x Intel Xeon E3-1220L v2
- 1x AMD EPYC 7551P
- 1x AMD EPYC 3251
Symptoms:
Important observations:
- Progressive performance degradation after ~2 days uptime on kernel 7.0.2-6-pve
- CPU usage gradually rises until the host reaches nearly 100% system CPU usage
- High load average with almost no IO wait
- All KVM guests are affected equally
- Host becomes nearly unusable
Additional notes:
- Current clocksource is already tsc
- read_hpet usage is minimal (~1%)
- RAM, swap and IO usage remain normal
- The issue appears related to virtualization syscalls / context switching / scheduler activity
- powertop shows very high tick_nohz_handler, sched(softirq) and APIC timer activity
- dbs_work_handler activity is also unusually high
Downgrading back to 6.8.12-15-pve restores normal behavior.
- The EPYC 7551P node running the same Proxmox/kernel version does NOT show the issue
- Changing CPU governor from ondemand to performance did not solve the problem
- Issue seems specific to the EPYC 3251 embedded platform
View attachment 97974
echo scan-time > /sys/kernel/mm/ksm/advisor_mode
We use essential cookies to make this site work, and optional cookies to enhance your experience.