Opt-in Linux 6.14 Kernel for Proxmox VE 8 available on test & no-subscription

6.14.0 is working fine on my homeserver with AMD EPYC 3251.
6.14.5 is not working.
I get 100% IO on loop0 after starting a lxc container.


Kernel log:
Code:
Jun 22 13:58:42.578293 kernel: INFO: task kworker/u65:3:1219 blocked for more than 368 seconds.
Jun 22 13:58:42.578399 kernel:       Tainted: P           O       6.14.5-1-bpo12-pve #1
Jun 22 13:58:42.578444 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 22 13:58:42.578460 kernel: task:kworker/u65:3   state:D stack:0     pid:1219  tgid:1219  ppid:2      task_flags:0x4208060 flags:0x00004000
Jun 22 13:58:42.578481 kernel: Workqueue: writeback wb_workfn (flush-7:0)
Jun 22 13:58:42.578495 kernel: Call Trace:
Jun 22 13:58:42.578509 kernel:  <TASK>
Jun 22 13:58:42.578524 kernel:  __schedule+0x495/0x13f0
Jun 22 13:58:42.578538 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.578553 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.578567 kernel:  schedule+0x29/0x130
Jun 22 13:58:42.578580 kernel:  io_schedule+0x4c/0x80
Jun 22 13:58:42.578593 kernel:  rq_qos_wait+0xbb/0x160
Jun 22 13:58:42.578607 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.578624 kernel:  ? __pfx_wbt_cleanup_cb+0x10/0x10
Jun 22 13:58:42.578638 kernel:  ? __pfx_rq_qos_wake_function+0x10/0x10
Jun 22 13:58:42.578652 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.578665 kernel:  wbt_wait+0xb5/0x130
Jun 22 13:58:42.578678 kernel:  __rq_qos_throttle+0x28/0x40
Jun 22 13:58:42.578689 kernel:  blk_mq_submit_bio+0x4d9/0x820
Jun 22 13:58:42.578714 kernel:  __submit_bio+0x75/0x290
Jun 22 13:58:42.578729 kernel:  ? aggsum_add+0x1ac/0x1d0 [zfs]
Jun 22 13:58:42.578744 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.578757 kernel:  submit_bio_noacct_nocheck+0x2ea/0x3b0
Jun 22 13:58:42.578772 kernel:  submit_bio_noacct+0x1a0/0x5b0
Jun 22 13:58:42.578786 kernel:  submit_bio+0xb1/0x110
Jun 22 13:58:42.578800 kernel:  submit_bh_wbc+0x164/0x1a0
Jun 22 13:58:42.578815 kernel:  __block_write_full_folio+0x1e3/0x420
Jun 22 13:58:42.578828 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Jun 22 13:58:42.578841 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Jun 22 13:58:42.578854 kernel:  block_write_full_folio+0x133/0x180
Jun 22 13:58:42.578868 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.578883 kernel:  ? writeback_iter+0x101/0x2d0
Jun 22 13:58:42.578896 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Jun 22 13:58:42.578909 kernel:  ? __pfx_block_write_full_folio+0x10/0x10
Jun 22 13:58:42.578922 kernel:  write_cache_pages+0x63/0xb0
Jun 22 13:58:42.578940 kernel:  blkdev_writepages+0x5b/0xa0
Jun 22 13:58:42.578954 kernel:  do_writepages+0x86/0x290
Jun 22 13:58:42.578967 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.578992 kernel:  ? select_task_rq_fair+0x176/0x2270
Jun 22 13:58:42.579006 kernel:  ? sched_clock_noinstr+0x9/0x10
Jun 22 13:58:42.579021 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.579034 kernel:  ? sched_clock+0x10/0x30
Jun 22 13:58:42.579048 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.579059 kernel:  __writeback_single_inode+0x44/0x350
Jun 22 13:58:42.579072 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.579085 kernel:  ? inode_to_bdi+0x3c/0x50
Jun 22 13:58:42.579099 kernel:  writeback_sb_inodes+0x252/0x540
Jun 22 13:58:42.579112 kernel:  __writeback_inodes_wb+0x54/0x100
Jun 22 13:58:42.579126 kernel:  ? queue_io+0x113/0x120
Jun 22 13:58:42.579139 kernel:  wb_writeback+0x1ad/0x320
Jun 22 13:58:42.579152 kernel:  ? get_nr_inodes+0x41/0x70
Jun 22 13:58:42.579165 kernel:  wb_workfn+0x351/0x400
Jun 22 13:58:42.579179 kernel:  process_one_work+0x17b/0x3b0
Jun 22 13:58:42.579194 kernel:  worker_thread+0x2b8/0x3e0
Jun 22 13:58:42.579207 kernel:  ? __pfx_worker_thread+0x10/0x10
Jun 22 13:58:42.579219 kernel:  kthread+0xfe/0x230
Jun 22 13:58:42.579232 kernel:  ? __pfx_kthread+0x10/0x10
Jun 22 13:58:42.579269 kernel:  ret_from_fork+0x47/0x70
Jun 22 13:58:42.579292 kernel:  ? __pfx_kthread+0x10/0x10
Jun 22 13:58:42.579305 kernel:  ret_from_fork_asm+0x1a/0x30
Jun 22 13:58:42.579319 kernel:  </TASK>
Jun 22 13:58:42.579334 kernel: INFO: task kmmpd-loop0:3699 blocked for more than 368 seconds.
Jun 22 13:58:42.579348 kernel:       Tainted: P           O       6.14.5-1-bpo12-pve #1
Jun 22 13:58:42.579363 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 22 13:58:42.579376 kernel: task:kmmpd-loop0     state:D stack:0     pid:3699  tgid:3699  ppid:2      task_flags:0x208040 flags:0x00004000
Jun 22 13:58:42.579390 kernel: Call Trace:
Jun 22 13:58:42.579403 kernel:  <TASK>
Jun 22 13:58:42.579415 kernel:  __schedule+0x495/0x13f0
Jun 22 13:58:42.579426 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.579437 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.579450 kernel:  schedule+0x29/0x130
Jun 22 13:58:42.579462 kernel:  io_schedule+0x4c/0x80
Jun 22 13:58:42.579473 kernel:  rq_qos_wait+0xbb/0x160
Jun 22 13:58:42.579486 kernel:  ? __pfx_wbt_cleanup_cb+0x10/0x10
Jun 22 13:58:42.579499 kernel:  ? __pfx_rq_qos_wake_function+0x10/0x10
Jun 22 13:58:42.579525 kernel:  ? __pfx_wbt_inflight_cb+0x10/0x10
Jun 22 13:58:42.579540 kernel:  wbt_wait+0xb5/0x130
Jun 22 13:58:42.579553 kernel:  __rq_qos_throttle+0x28/0x40
Jun 22 13:58:42.579566 kernel:  blk_mq_submit_bio+0x4d9/0x820
Jun 22 13:58:42.579582 kernel:  __submit_bio+0x75/0x290
Jun 22 13:58:42.579595 kernel:  ? srso_return_thunk+0x5/0x5f
Jun 22 13:58:42.579610 kernel:  submit_bio_noacct_nocheck+0x2ea/0x3b0
Jun 22 13:58:42.579624 kernel:  submit_bio_noacct+0x1a0/0x5b0
Jun 22 13:58:42.579637 kernel:  submit_bio+0xb1/0x110
Jun 22 13:58:42.579650 kernel:  submit_bh_wbc+0x164/0x1a0
Jun 22 13:58:42.579664 kernel:  submit_bh+0x12/0x20
Jun 22 13:58:42.579677 kernel:  write_mmp_block_thawed.isra.0+0x5e/0xa0
Jun 22 13:58:42.579690 kernel:  write_mmp_block+0x4a/0xd0
Jun 22 13:58:42.579704 kernel:  kmmpd+0x1ab/0x420
Jun 22 13:58:42.579717 kernel:  ? __pfx_kmmpd+0x10/0x10
Jun 22 13:58:42.579728 kernel:  kthread+0xfe/0x230
Jun 22 13:58:42.579752 kernel:  ? __pfx_kthread+0x10/0x10
Jun 22 13:58:42.579766 kernel:  ret_from_fork+0x47/0x70
Jun 22 13:58:42.579780 kernel:  ? __pfx_kthread+0x10/0x10
Jun 22 13:58:42.579792 kernel:  ret_from_fork_asm+0x1a/0x30
Jun 22 13:58:42.579808 kernel:  </TASK>
Jun 22 13:58:42.579818 kernel: Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
 
Thank you for the update! I have one of the newer Intel 15th gen Arrow-lake CPUs and was struggling to get the iGPU detected. Can now see device /dev/dri/renderD128 after the update coming from stock install of Proxmox v8.4.

Device: Asus NUC Pro 15
CPU: Intel Core 5 Ultra 225H
RAM: 64GB DDR5 5600MHz
Storage: 1TB Samsung 990 Pro (M.2 2280 PCIe Gen 4.0)
 
Upgraded both of my prod EYPC nodes to 6.14 and the power dropped on both... no known regressions so far

Example on my second node. 128 x AMD EPYC 7713 64-Core Processor (1 Socket)

Before (203 - 315)
1750741354308.png
After (128 -36% low & 298 -5% max)
1750741370256.png
 

Attachments

  • 1750741328257.png
    1750741328257.png
    123.6 KB · Views: 11
Last edited:
I upgraded my Epyc node to 8.4 and opt in kernel 6.14. This caused my TrueNAS VM to fail to start with nothing in the logs.

"qm start 108" would simply hang forever. Shutdown of the host would never timeout and I had to power off. The VM would start if I removed the PCIe SATA controller PCIe hardware shared to it. Motherboard is an Asrock Rack SIENAD8-2L2T and PCIE7 slot is set to SATA where I have 5 disks that the TrueNAS VM uses normally.

I reverted to kernel 6.11 and the VM booted fine again.

Some more info here: https://forum.proxmox.com/threads/t...e-to-8-4-need-urgent-help.165189/#post-764699

Quoting this because same hardware; I also have a SIENAD8-2L2T (EPYC 8004 series) board, but I am running the latest BIOS/firmware.

RealPjotr, can you confirm which version BIOS you're using? I suspect it is the original version (1.13)

As for me, running 2.03 which includes AMD GenoaPI Version 1.0.0.D:

6.8 kernel boots fine, 6.11 boots fine, but 6.14 (proxmox-kernel-6.14.5-1-bpo12-pve-signed) results in the same error posted way back in April here: https://forum.proxmox.com/threads/o...est-no-subscription.164497/page-2#post-761316

Code:
Jul 01 10:25:00.680751 hakomini01 kernel: BERT: Error records from previous boot:
Jul 01 10:25:00.680758 hakomini01 kernel: [Hardware Error]: event severity: fatal
Jul 01 10:25:00.680766 hakomini01 kernel: [Hardware Error]:  Error 0, type: fatal
Jul 01 10:25:00.680774 hakomini01 kernel: [Hardware Error]:  fru_text: ProcessorError
Jul 01 10:25:00.680782 hakomini01 kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Jul 01 10:25:00.680792 hakomini01 kernel: [Hardware Error]:   Local APIC_ID: 0x32
Jul 01 10:25:00.680800 hakomini01 kernel: [Hardware Error]:   CPUID Info:
Jul 01 10:25:00.680808 hakomini01 kernel: [Hardware Error]:   00000000: 00aa0f02 00000000 32200800 00000000
Jul 01 10:25:00.680816 hakomini01 kernel: [Hardware Error]:   00000010: 76fa320b 00000000 178bfbff 00000000
Jul 01 10:25:00.680824 hakomini01 kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Jul 01 10:25:00.680831 hakomini01 kernel: [Hardware Error]:   Error Information Structure 0:
Jul 01 10:25:00.680839 hakomini01 kernel: [Hardware Error]:    Error Structure Type: cache error
Jul 01 10:25:00.680847 hakomini01 kernel: [Hardware Error]:    Check Information: 0x000000000602001f
Jul 01 10:25:00.680855 hakomini01 kernel: [Hardware Error]:     Transaction Type: 2, Generic
Jul 01 10:25:00.680863 hakomini01 kernel: [Hardware Error]:     Operation: 0, generic error
Jul 01 10:25:00.680871 hakomini01 kernel: [Hardware Error]:     Level: 0
Jul 01 10:25:00.680881 hakomini01 kernel: [Hardware Error]:     Processor Context Corrupt: true
Jul 01 10:25:00.680889 hakomini01 kernel: [Hardware Error]:     Uncorrected: true
Jul 01 10:25:00.680896 hakomini01 kernel: [Hardware Error]:   Context Information Structure 0:
Jul 01 10:25:00.680904 hakomini01 kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 01 10:25:00.680912 hakomini01 kernel: [Hardware Error]:    Register Array Size: 0x0050
Jul 01 10:25:00.680920 hakomini01 kernel: [Hardware Error]:    MSR Address: 0xc0002051
Jul 01 10:25:00.680928 hakomini01 kernel: [Hardware Error]:   Context Information Structure 1:
Jul 01 10:25:00.680936 hakomini01 kernel: [Hardware Error]:    Register Context Type: Unclassified Data
Jul 01 10:25:00.680944 hakomini01 kernel: [Hardware Error]:    Register Array Size: 0x0030
Jul 01 10:25:00.680952 hakomini01 kernel: [Hardware Error]:    Register Array:
Jul 01 10:25:00.680960 hakomini01 kernel: [Hardware Error]:    00000000: 00000010 00000000 20f80028 00000200
Jul 01 10:25:00.680968 hakomini01 kernel: [Hardware Error]:    00000010: 00000014 00000000 bb200334 00000000
Jul 01 10:25:00.680978 hakomini01 kernel: [Hardware Error]:    00000020: 00000015 00000000 bb300024 00000000
Jul 01 10:25:00.680985 hakomini01 kernel: BERT: Total records found: 1
Jul 01 10:25:00.680993 hakomini01 kernel: mce: [Hardware Error]: Machine check events logged
Jul 01 10:25:00.681015 hakomini01 kernel: PM:   Magic number: 1:938:433
Jul 01 10:25:00.681025 hakomini01 kernel: mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: aea0000001000108
Jul 01 10:25:00.681153 hakomini01 kernel: thermal cooling_device25: hash matches
Jul 01 10:25:00.681167 hakomini01 kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffffc04c91a1 MISC d0140ff600000000 PPIN 2b7259cbeedc031 SYND 4d0>
Jul 01 10:25:00.681176 hakomini01 kernel: mce: [Hardware Error]: PROCESSOR 2:aa0f02 TIME 1751379887 SOCKET 0 APIC 32 microcode aa00215



I also gave the latest 6.14 kernel (6.14.4-1) a try with my EPYC 9474F/Supermicro H13SSL setup, but it runs again in the bootloop because of ae4dma issues.

@Stoiko Ivanov is the "removing of deprecated PCI IDs (https://github.com/torvalds/linux/commit/b87c29c007e80f4737a056b3c5c21b5b5106b7f7)" finally in?

Is there any update on when we can expect this patch to be built into the 6.14 kernel yet?
 
Last edited:
It looks like 6.14.6 or 6.15 maybe released into test soon.

Does anyone have any updates when that’s happening??
 
FYI, the slower, older processors don’t have the issues reported and I believe do run without issue on 6.14.5.

However, I’m not sure if the older processors would have the 32-bit OS bug that’s in 6.14.5 or not.
 
I really hope you consider abandoning 6.14 and move straight to 6.15 even if it means extending the testing. I've had for months two Fedora Server VMs, v41 and v42 where 6.14 has been a nightmare because ballooning just didn't work. The machines would never get more than the starting amount and would never request extra memory. The smaller VM wasn't too bad, so I used it to test the later 6.14 releases, but it was 1GB/2GB and would only occasionally crash when it ran out - the other VM just couldn't get started, so I kept it on the final release of kernel 6.13. Now I've upgraded both to 6.15 kernels, and they run fine and even after stressing it with some large memory requests, the ballooning works fine.

Anyway, just thought I would document my experience with 6.14 although I still have my test host running it without issues - it just feels suspect to me that I had zero issues with Fedora until kernel 6.14 and suddenly 6.15 comes out and everything is now rosy.
 
I really hope you consider abandoning 6.14 and move straight to 6.15 even if it means extending the testing. I've had for months two Fedora Server VMs, v41 and v42 where 6.14 has been a nightmare because ballooning just didn't work. The machines would never get more than the starting amount and would never request extra memory. The smaller VM wasn't too bad, so I used it to test the later 6.14 releases, but it was 1GB/2GB and would only occasionally crash when it ran out - the other VM just couldn't get started, so I kept it on the final release of kernel 6.13. Now I've upgraded both to 6.15 kernels, and they run fine and even after stressing it with some large memory requests, the ballooning works fine.

Anyway, just thought I would document my experience with 6.14 although I still have my test host running it without issues - it just feels suspect to me that I had zero issues with Fedora until kernel 6.14 and suddenly 6.15 comes out and everything is now rosy.
I spoke too soon, although the two Fedora VMs now both on v42 did use some extra memory above their initial allocation, it wouldn't go far e.g. the VM with 4GB/10GB would go up to around 6GB it would never go above it and eventually the server ground to a halt. Have had to remove ballooning for both now.

I'll continue this on the original thread I started for this.
 
@t.lamprecht
Can you maybe include this fix in the 6.14 Kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?id=bc0b828ef6e5

i Stumbled on one Server against this, with an I219-LM that i passthrough to opnsense (Its a hetzner Server).
Its only an issue on Kernel 6.14 where this patch would be great, on 6.8 there are no reset issues with the i219-lm.

This is the only "Bug" i found so far and its not really a bug, its actually an i219-lm issue sadly...
Cheers
 
  • Like
Reactions: Johannes S
With the latest 6.14.8 kernel I have Vfio problems. VMs that use my X710 VF will not start. (Same with 6.8.12-12, 6.8.12-11 is ok)

Example:
➜ ~ qm start 100
GUEST HOOK: 100 pre-start
100 is starting, doing preparations.
kvm: -device vfio-pci,host=0000:03:02.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0,rombar=0: vfio 0000:03:02.0: error getting device from group 19: Permission denied
Verify all devices in group 19 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

Self compiled 6.14.8 works fine as does 6.15.6.
 
  • Like
Reactions: uzumo