Please open a new thread for this and post more details about your HW, e.g. the server vendor and model.The latest kernel update is very unstable on my machine(s) (see attached journalctl.log). The older kernel 6.14.11-4, and 6.14.8-2 don't have this issue.
View attachment 93239
intel_iommu=off as kernel parameter.root@triton:~# fastfetch .://:` `://:. root@triton `hMMMMMMd/ /dMMMMMMh` ----------- `sMMMMMMMd: :mMMMMMMMs` OS: Proxmox VE 9.1.1 x86_64`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-` Host: PowerEdge R630`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:` Kernel: Linux 6.14.11-4-pve `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Uptime: 22 mins ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. Packages: 890 (dpkg) .+ooooooo+-`oNMMMMNo`-+ooooooo+. Shell: bash 5.2.37 -+ooooooo/.`sMMs`./ooooooo+- Display (VGA-1): 1024x768 @ 60 Hz :oooooooo/`..`/oooooooo: Terminal: /dev/pts/0 :oooooooo/`..`/oooooooo: CPU: Intel(R) Xeon(R) E5-2643 v3 (12) @ 3.70 GHz -+ooooooo/.`sMMs`./ooooooo+- GPU: Matrox Electronics Systems Ltd. G200eR2 .+ooooooo+-`oNMMMMNo`-+ooooooo+. Memory: 3.86 GiB / 31.25 GiB (12%) ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. Swap: 0 B / 8.00 GiB (0%) `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Disk (/): 96.65 GiB / 641.23 GiB (15%) - zfs`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:` Disk (/rpool): 128.00 KiB / 544.58 GiB (0%) - zfs`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-` Local IP (vmbr0): 192.168.180.100/24 `sMMMMMMMm: :dMMMMMMMs` Locale: en_US.UTF-8 `hMMMMMMd/ /dMMMMMMh` `://:` `://:` see the known issues for PVE 9.1:Both servers are identical Dell R630 (see fastfetch below)
Yes!!! I am not crazy, or the only one that has had this happen to! I have a LXC container that I have to compile the kernel with my patch in it. Here is how to do it. Create a Debian 13 container, and then paste the steps into it,I'm also running Podman in LXC and ran into this issue and can't use 6.17 therefore. It's good that there's a fix but you didn't report this anywhere but here yet @jaminmc?
# 1. Update base system
apt update && apt upgrade -y
# 2. Add Proxmox repo and key
wget -q https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \
-O /etc/apt/trusted.gpg.d/proxmox-release-trixie.gpg
cat > /etc/apt/sources.list.d/pve-src.sources <<EOF
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /etc/apt/trusted.gpg.d/proxmox-release-trixie.gpg
EOF
# 3. Append Debian deb-src if missing
if ! grep -q "deb-src" /etc/apt/sources.list.d/debian.sources 2>/dev/null; then
cat >> /etc/apt/sources.list.d/debian.sources <<EOF
Types: deb-src
URIs: http://deb.debian.org/debian
Suites: trixie trixie-updates
Components: main contrib non-free non-free-firmware
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
Types: deb-src
URIs: http://security.debian.org/debian-security
Suites: trixie-security
Components: main contrib non-free non-free-firmware
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
EOF
fi
# 4. Update
apt update
# 5. Install build tools
apt install -y build-essential git git-email debhelper devscripts fakeroot \
libncurses-dev bison flex libssl-dev libelf-dev bc cpio kmod pahole dwarves \
rsync python3 python3-pip pve-doc-generator python-is-python3 dh-python \
sphinx-common quilt libtraceevent-dev libunwind-dev libzstd-dev pkg-config equivs
# 6. Clone and prepare repo
git clone https://git.proxmox.com/git/pve-kernel.git
cd pve-kernel
git checkout master # Latest kernel + patches
# 7. Prep and deps
make distclean
make build-dir-fresh
cd proxmox-kernel-*/ ; mk-build-deps -i -r -t "apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends -y" debian/control ; cd ..
# 8. Add patch
cat >> patches/kernel/0014-apparmor-fix-NULL-pointer-dereference-in-aa_file.patch <<'EOF'
diff --git a/security/apparmor/file.c b/security/apparmor/file.c
--- a/security/apparmor/file.c
+++ b/security/apparmor/file.c
@@ -777,6 +777,9 @@ static bool __unix_needs_revalidation(struct file *file, struct aa_label *label
return false;
if (request & NET_PEER_MASK)
return false;
+ /* sock and sock->sk can be NULL for sockets being set up or torn down */
+ if (!sock || !sock->sk)
+ return false;
if (sock->sk->sk_family == PF_UNIX) {
struct aa_sk_ctx *ctx = aa_sock(sock->sk);
EOF
make build-dir-fresh
# 9. Build
make
echo "=== BUILD COMPLETE ==="
ls -lh *.deb
# For updates,
git reset --hard HEAD
git clean -df
git pull
git submodule update --init --recursive
# Then do Step 8 & 9
apt --reinstall install /rpool/data/subvol-114-disk-0/root/pve-kernel/proxmox-{kernel,headers}-6.17.2-1-pve_6.17.2-1*.debThanks but compiling is not the issue. I'd rather have this issue fixed upstream so I don't have compile every new version after 6.14 from now on but for that it'd need to be reported or you just upstream your patch?Yes!!! I am not crazy, or the only one that has had this happen to! I have a LXC container that I have to compile the kernel with my patch in it. Here is how to do it. Create a Debian 13 container, and then paste the steps into it,
After it is all built, on my proxmox since the container is on ZFS, I just run this on my proxmox to update the kernel with the one I compiled:Bash:# 1. Update base system apt update && apt upgrade -y # 2. Add Proxmox repo and key wget -q https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \ -O /etc/apt/trusted.gpg.d/proxmox-release-trixie.gpg cat > /etc/apt/sources.list.d/pve-src.sources <<EOF Types: deb URIs: http://download.proxmox.com/debian/pve Suites: trixie Components: pve-no-subscription Signed-By: /etc/apt/trusted.gpg.d/proxmox-release-trixie.gpg EOF # 3. Append Debian deb-src if missing if ! grep -q "deb-src" /etc/apt/sources.list.d/debian.sources 2>/dev/null; then cat >> /etc/apt/sources.list.d/debian.sources <<EOF Types: deb-src URIs: http://deb.debian.org/debian Suites: trixie trixie-updates Components: main contrib non-free non-free-firmware Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg Types: deb-src URIs: http://security.debian.org/debian-security Suites: trixie-security Components: main contrib non-free non-free-firmware Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg EOF fi # 4. Update apt update # 5. Install build tools apt install -y build-essential git git-email debhelper devscripts fakeroot \ libncurses-dev bison flex libssl-dev libelf-dev bc cpio kmod pahole dwarves \ rsync python3 python3-pip pve-doc-generator python-is-python3 dh-python \ sphinx-common quilt libtraceevent-dev libunwind-dev libzstd-dev pkg-config equivs # 6. Clone and prepare repo git clone https://git.proxmox.com/git/pve-kernel.git cd pve-kernel git checkout master # Latest kernel + patches # 7. Prep and deps make distclean make build-dir-fresh cd proxmox-kernel-*/ ; mk-build-deps -i -r -t "apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends -y" debian/control ; cd .. # 8. Add patch cat >> patches/kernel/0014-apparmor-fix-NULL-pointer-dereference-in-aa_file.patch <<'EOF' diff --git a/security/apparmor/file.c b/security/apparmor/file.c --- a/security/apparmor/file.c +++ b/security/apparmor/file.c @@ -777,6 +777,9 @@ static bool __unix_needs_revalidation(struct file *file, struct aa_label *label return false; if (request & NET_PEER_MASK) return false; + /* sock and sock->sk can be NULL for sockets being set up or torn down */ + if (!sock || !sock->sk) + return false; if (sock->sk->sk_family == PF_UNIX) { struct aa_sk_ctx *ctx = aa_sock(sock->sk); EOF make build-dir-fresh # 9. Build make echo "=== BUILD COMPLETE ===" ls -lh *.deb # For updates, git reset --hard HEAD git clean -df git pull git submodule update --init --recursive # Then do Step 8 & 9
apt --reinstall install /rpool/data/subvol-114-disk-0/root/pve-kernel/proxmox-{kernel,headers}-6.17.2-1-pve_6.17.2-1*.deb
replace 114 with your container number. That will reinstall the kernel with the patched one. Or scp the created deb files to your proxmox servers, and then install them from there.
Check https://git.proxmox.com/?p=pve-kernel.git;a=summary for kernel updates
Prior to the release of 6.17, I raised this issue with the developers. Fiona even suggested that I submit the patches to https://bugzilla.proxmox.com/. However, in the subsequent post, it was mentioned that since I used Cursor for debugging and my 2nd patch in that post was fixable with an AppArmor change instead of the kernel level. But the main patch that I have been using cannot be fixed at the user level, and a NULL should not be able to takedown a whole system. Consequently, the issue has been ignored. I assume that @fiona is the same individual as https://git.proxmox.com/?p=pve-kernel.git;a=search;s=Fiona+Ebner;st=author in the git repository. I have recently submitted a bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=7083, so hopefully it will get the attention needed.Thanks but compiling is not the issue. I'd rather have this issue fixed upstream so I don't have compile every new version after 6.14 from now on but for that it'd need to be reported or you just upstream your patch?
Yes, thank you for this! Having clear reproducer steps and the issue filed stand-alone goes a long way. Like this, it's clear that it's not related to the other issue. And other developers that don't stumble upon it by chance in the forum will be aware of it too. Personally, I had too much other things to look at in the context of the Proxmox VE 9.1 release. AFAIK no other user has reported the same issue as of yet.I have recently submitted a bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=7083, so hopefully it will get the attention needed.
That was FAST! and the patch that Fabian Grünbichler came up with was only one line change, so I guess it is more efficientYes, thank you for this! Having clear reproducer steps and the issue filed stand-alone goes a long way. Like this, it's clear that it's not related to the other issue. And other developers that don't stumble upon it by chance in the forum will be aware of it too. Personally, I had too much other things to look at in the context of the Proxmox VE 9.1 release. AFAIK no other user has reported the same issue as of yet.
If you want to try it out yourself, then follow this post https://forum.proxmox.com/threads/o...le-on-test-no-subscription.173920/post-819545. But for step 8, do:Any chance we could get the extra PCI IDs for Alder Lake EDAC from https://lore.kernel.org/all/20250819161739.3241152-1-kyle@kylemanna.com/ backported in the next kernel release? I opened an issue on Ubuntu for it a while ago but it looks like they're not gonna do it https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/2127919
# 8. Add patch
cat >> patches/kernel/0014-alder-lake-more-EDAC-pci-ids.patch <<'EOF'
diff --git a/drivers/edac/ie31200_edac.c b/drivers/edac/ie31200_edac.c
index 5c1fa1c0d12e..5a080ab65476 100644
--- a/drivers/edac/ie31200_edac.c
+++ b/drivers/edac/ie31200_edac.c
@@ -99,6 +99,8 @@
/* Alder Lake-S */
#define PCI_DEVICE_ID_INTEL_IE31200_ADL_S_1 0x4660
+#define PCI_DEVICE_ID_INTEL_IE31200_ADL_S_2 0x4668 /* 8P+4E, e.g. i7-12700K */
+#define PCI_DEVICE_ID_INTEL_IE31200_ADL_S_3 0x4648 /* 6P+4E, e.g. i5-12600K */
/* Bartlett Lake-S */
#define PCI_DEVICE_ID_INTEL_IE31200_BTL_S_1 0x4639
@@ -761,6 +763,8 @@ static const struct pci_device_id ie31200_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_RPL_S_6), (kernel_ulong_t)&rpl_s_cfg},
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_RPL_HX_1), (kernel_ulong_t)&rpl_s_cfg},
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_ADL_S_1), (kernel_ulong_t)&rpl_s_cfg},
+ { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_ADL_S_2), (kernel_ulong_t)&rpl_s_cfg},
+ { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_ADL_S_3), (kernel_ulong_t)&rpl_s_cfg},
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_BTL_S_1), (kernel_ulong_t)&rpl_s_cfg},
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_BTL_S_2), (kernel_ulong_t)&rpl_s_cfg},
{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IE31200_BTL_S_3), (kernel_ulong_t)&rpl_s_cfg},
--
2.50.1
EOF
make build-dir-fresh
What does the panic log say - where does it occur? (if possible as test - else as screenshot)Brand new HP Proliant Gen 11,
INTEL(R) XEON(R) SILVER 4514Y, megaraid_sas LSI MegaRAID 12GSAS/PCIe Secure SAS39xx.
With 6.17 kernel, immediately a kernel panic only doing a simple proxmox-boot-tool refresh.
Serious matter.
I can confirm this: BOSS-S1 controllers with Intel SSDs do not work with the 6.17 kernel. In contrast, BOSS-S1/2 controllers with Micron SSDs work without any issues. Same behavior as described.Seeing I/O errors after upgrading from 6.14 to 6.17 on one Dell R740XD out of three identical units.
All three have Dell BOSS-S1 cards with two nVME drives in RAID1 configuration. The two units that do not have issues have Micron MTFDDAV480T drives and the one unit with errors has Intel SSDSCKKB480G8 drives installed.
The problematic machine works just fine on 6.14 and the drive diagnostics all pass.
Anybody else seeing this or just me?
143.971830] ata15.00: NCQ disabled due to excessive errors
143.971841] ata15.00: exception Emask 0x0 SAct 0x400 SErr 0x0 action 0x6 frozen
143.971854] ata15.00: failed command: READ FPDMA QUEUED
143.971860] ata15.00: cmd 60/00:50:20:d0:92/20:00:01:00:00/40 tag 10 ncq dma 4194304 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
143.971877] ata15.00: status: { DRDY }
143.971888] ata15: hard resetting link
145.007739] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
145.007925] ata15.00: Security Log not supported
145.008074] ata15.00: Security Log not supported
145.008083] ata15.00: configured for UDMA/133
145.008113] sd 14:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
145.008124] sd 14:0:0:0: [sda] tag#10 Sense Key : Aborted Command [current]
sd 14:0:0:0: [sda] tag#10 Add. Sense: No additional sense information
sd 14:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 01 92 d0 20 00 20 00 00
145.008144] 1/0 error, dev sda, sector 26398752 op 0x0: (READ) flags 0x84700 phys_seg 65 prio class 2
145.008230] ata15: EH complete
185.954538] ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
185.954555] ata15.00: failed command: READ DMA EXT
185.954561] ata15.00: cmd 25/00:00:f8:21:38/00:20:01:00:00/e0 tag 3 dma 4194304 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
185.954579] ata15.00: status: { DRDY }
185.954592] ata15: hard resetting link
ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
186.985727] ata15.00: Security Log not supported
186.985876] ata15.00: Security Log not supported
186.985885] ata15.00: configured for UDMA/133
186.985919] sd 14:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
186.985930] sd 14:0:0:0: [sda] tag#3 Sense Key : Aborted Command [current]
186.985936] sd 14:0:0:0: [sda] tag#3 Add. Sense: No additional sense information
186.985944] sd 14:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 01 38 21 f8 00 20 00 00
186.985949] 1/0 error, dev sda, sector 20455928 op 0x0: (READ) flags 0x84700 phys_seg 99 prio class 2
I can confirm this: BOSS-S1 controllers with Intel SSDs do not work with the 6.17 kernel. In contrast, BOSS-S1/2 controllers with Micron SSDs work without any issues. Same behavior as described.on 6.17.2-1-pve My system looks to boot normally but I see disk errors in dmesg and the system never seems to come online for network traffic(at least, not for proxmox. Ping works fine as does ssh, although login is generally not working due to random disk errors), and randomly allows login via console, although most of the time it refuses login. rebooting back into 6.14.11-4-pve the system boots normally, shows no disk errors, and goes right back into the ceph cluster as if no hardware issues existed. Of note, I also am using dell BOSS cards on my systems as was reported earlier in this thread. I suspect there is something in the kernel that is not playing nice with that storage controller.
Edit:
I updated one of my other nodes to test. It is a Dell C6420 with the same BOSS S1 card as the other node. It booted up on the newer kernel just fine. No errors or issues. The server that I had issues with is a R640. The R640 has Intel ssds in the BOSS card and the c6420 has sk hynix drives. Not sure where the issue lies at this point as I noticed the dmesg errors I was seeing were for other ssds connected to the HBA330. The c6420 uses an S140 controller.
We use essential cookies to make this site work, and optional cookies to enhance your experience.