**Setup:** Proxmox VE 9.1.9, `pve-manager/9.1.9/ee7bad0a3d1546c9`
**Kernel:** `7.0.0-3-pve #1 SMP PREEMPT_DYNAMIC PMX 7.0.0-3 (2026-04-21T22:56Z)`
**Hardware:** Mini-PC, Intel Tigerlake, 64 GB RAM, BIOS *AHWSA.1.22 03/12/2024*
**Repo:** pve-no-subscription (kernel 7.0 was pulled in automatically as a dependency of `proxmox-default-kernel`, no opt-in to test repo)
## Symptom
After ~10 days of uptime, the host became unreachable for SSH sessions and the Web UI showed it with `?` (status `unknown` from cluster API). Specifically:
- `ping` and TCP connect to `:22` / `:8006` worked fine
- `ssh` completed authentication, hung at `Entering interactive session` — no shell could be forked
- `https://<host>:8006/` returned the static landing page (HTTP 200, 7 ms)
- `pvesh get /nodes` from another node showed `status: unknown` for the affected host (no metrics)
- Cluster (corosync) still saw the node as quorate and ring-connected
- All KVM guests on the host kept running normally (network, disk I/O, agent ping all OK)
So: corosync (kernel UDP) and existing KVM threads survived, but anything in the management plane that needed `fork()` or filesystem traversal was stuck.
## What the kernel logged
Two consecutive Oops in `pvestatd`, ending in `Fixing recursive fault but reboot is needed!`:
```
May 02 21:45:02 boromir kernel: BUG: kernel NULL pointer dereference, address: 0000000000000001
May 02 21:45:02 boromir kernel: #PF: supervisor write access in kernel mode
May 02 21:45:02 boromir kernel: #PF: error_code(0x0002) - not-present page
May 02 21:45:02 boromir kernel: PGD 0 P4D 0
May 02 21:45:02 boromir kernel: Oops: Oops: 0002 [#1] SMP NOPTI
May 02 21:45:02 boromir kernel: CPU: 4 UID: 0 PID: 1318 Comm: pvestatd Tainted: P O 7.0.0-3-pve #1 PREEMPT(lazy)
May 02 21:45:02 boromir kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
May 02 21:45:02 boromir kernel: RIP: 0010:memset+0xb/0x20
May 02 21:45:02 boromir kernel: Call Trace:
May 02 21:45:02 boromir kernel: ? __kmalloc_cache_noprof+0x12f/0x470
May 02 21:45:02 boromir kernel: ? obj_cgroup_charge_account+0xd8/0x160
May 02 21:45:02 boromir kernel: ? __pfx_uptime_proc_show+0x10/0x10
May 02 21:45:02 boromir kernel: single_open+0x2f/0xd0
May 02 21:45:02 boromir kernel: ? __pfx_proc_reg_open+0x10/0x10
May 02 21:45:02 boromir kernel: proc_single_open+0x20/0x30
May 02 21:45:02 boromir kernel: proc_reg_open+0x52/0x190
May 02 21:45:02 boromir kernel: do_dentry_open+0x13c/0x4b0
May 02 21:45:02 boromir kernel: vfs_open+0x2a/0xf0
May 02 21:45:02 boromir kernel: path_openat+0x892/0x13f0
May 02 21:45:02 boromir kernel: do_filp_open+0xe0/0x1b0
May 02 21:45:02 boromir kernel: do_sys_openat2+0x7f/0xf0
May 02 21:45:02 boromir kernel: __x64_sys_openat+0x52/0xa0
May 02 21:45:02 boromir kernel: do_syscall_64+0x11c/0x14e0
[...]
```
`RDI = 0x1` at the `memset` call — `__kmalloc_cache_noprof` returned a junk pointer (or got passed one from `single_open`'s argument), and the kernel walked into `memset(0x1, 0, 0x20)`.
Immediately after, on `do_exit()` cleanup of the same PID, a second Oops:
```
Oops: Oops: 0002 [#2] SMP NOPTI
CPU: 4 PID: 1318 Comm: pvestatd Tainted: P D O 7.0.0-3-pve
RIP: 0010:memset+0xb/0x20
Call Trace:
? __kmalloc_noprof+0x1b7/0x560
inotify_handle_inode_event+0x97/0x270
inotify_ignored_and_remove_idr+0x26/0x60
inotify_freeing_mark+0xe/0x20
fsnotify_free_mark+0x4e/0x80
fsnotify_clear_marks_by_group+0x177/0x200
fsnotify_destroy_group+0x46/0x120
inotify_release+0x18/0x80
__fput+0xed/0x2d0
____fput+0x15/0x20
task_work_run+0x60/0xa0
do_exit+0x2d0/0xad0
make_task_dead+0x93/0xa0
rewind_stack_and_make_dead+0x16/0x20
[...]
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: pvestatd/1318/0x00000000
[...]
BUG: scheduling while atomic: pve-ha-lrm/1421/0x00000000
```
After this, `pvestatd` and `pve-ha-lrm` were dead, sshd could no longer fork shells, pveproxy worker pool degraded over hours, and ultimately the management plane was fully stuck. KVM guests continued running for the entire ~10 hours until I hard-power-cycled.
## Early-warning signal in the days before
Three times in the 36 hours leading up to the crash, `pveproxy` logged:
```
got inotify poll request in wrong process - disabling inotify
```
(May 01 20:41, May 02 16:42, May 02 22:36 — last one ~3 hours before the kernel BUG.) Same inotify subsystem that ultimately blew up. May be coincidence, but worth flagging.
## Cluster impact
6-node Proxmox VE cluster, all six nodes were on `7.0.0-3-pve` (pulled in automatically by `proxmox-default-kernel`). Only one node hit this so far over ~10 days of uptime per node. The other five had been running the same kernel for 1–2 days at the time of the crash.
After the crash I rolled all six nodes back via:
```
proxmox-boot-tool kernel pin 6.17.13-6-pve
proxmox-boot-tool refresh
systemctl reboot
```
All stable on 6.17.13-6-pve since.
## Questions
1. Has anyone else seen this signature on 7.0.0-3-pve? Specifically `memset+0xb` in `single_open` via `__kmalloc_cache_noprof`, plus the secondary Oops in `inotify_freeing_mark`?
2. Is there a known fix in pvetest-repo (7.0.0-4 or later)? `apt list --upgradable` on no-subscription shows nothing newer at the moment.
3. Given that `proxmox-default-kernel` quietly pulled in 7.0 (while the docs still describe it as opt-in / RC), is there guidance on how to keep production no-subscription clusters on 6.17 until 7.0 stabilises?
Happy to provide the full 391-line `journalctl -k` excerpt around the crash (registers, modules-linked-in, second Oops in full) if useful — included as attachment.
Thanks for any pointers.
**Kernel:** `7.0.0-3-pve #1 SMP PREEMPT_DYNAMIC PMX 7.0.0-3 (2026-04-21T22:56Z)`
**Hardware:** Mini-PC, Intel Tigerlake, 64 GB RAM, BIOS *AHWSA.1.22 03/12/2024*
**Repo:** pve-no-subscription (kernel 7.0 was pulled in automatically as a dependency of `proxmox-default-kernel`, no opt-in to test repo)
## Symptom
After ~10 days of uptime, the host became unreachable for SSH sessions and the Web UI showed it with `?` (status `unknown` from cluster API). Specifically:
- `ping` and TCP connect to `:22` / `:8006` worked fine
- `ssh` completed authentication, hung at `Entering interactive session` — no shell could be forked
- `https://<host>:8006/` returned the static landing page (HTTP 200, 7 ms)
- `pvesh get /nodes` from another node showed `status: unknown` for the affected host (no metrics)
- Cluster (corosync) still saw the node as quorate and ring-connected
- All KVM guests on the host kept running normally (network, disk I/O, agent ping all OK)
So: corosync (kernel UDP) and existing KVM threads survived, but anything in the management plane that needed `fork()` or filesystem traversal was stuck.
## What the kernel logged
Two consecutive Oops in `pvestatd`, ending in `Fixing recursive fault but reboot is needed!`:
```
May 02 21:45:02 boromir kernel: BUG: kernel NULL pointer dereference, address: 0000000000000001
May 02 21:45:02 boromir kernel: #PF: supervisor write access in kernel mode
May 02 21:45:02 boromir kernel: #PF: error_code(0x0002) - not-present page
May 02 21:45:02 boromir kernel: PGD 0 P4D 0
May 02 21:45:02 boromir kernel: Oops: Oops: 0002 [#1] SMP NOPTI
May 02 21:45:02 boromir kernel: CPU: 4 UID: 0 PID: 1318 Comm: pvestatd Tainted: P O 7.0.0-3-pve #1 PREEMPT(lazy)
May 02 21:45:02 boromir kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
May 02 21:45:02 boromir kernel: RIP: 0010:memset+0xb/0x20
May 02 21:45:02 boromir kernel: Call Trace:
May 02 21:45:02 boromir kernel: ? __kmalloc_cache_noprof+0x12f/0x470
May 02 21:45:02 boromir kernel: ? obj_cgroup_charge_account+0xd8/0x160
May 02 21:45:02 boromir kernel: ? __pfx_uptime_proc_show+0x10/0x10
May 02 21:45:02 boromir kernel: single_open+0x2f/0xd0
May 02 21:45:02 boromir kernel: ? __pfx_proc_reg_open+0x10/0x10
May 02 21:45:02 boromir kernel: proc_single_open+0x20/0x30
May 02 21:45:02 boromir kernel: proc_reg_open+0x52/0x190
May 02 21:45:02 boromir kernel: do_dentry_open+0x13c/0x4b0
May 02 21:45:02 boromir kernel: vfs_open+0x2a/0xf0
May 02 21:45:02 boromir kernel: path_openat+0x892/0x13f0
May 02 21:45:02 boromir kernel: do_filp_open+0xe0/0x1b0
May 02 21:45:02 boromir kernel: do_sys_openat2+0x7f/0xf0
May 02 21:45:02 boromir kernel: __x64_sys_openat+0x52/0xa0
May 02 21:45:02 boromir kernel: do_syscall_64+0x11c/0x14e0
[...]
```
`RDI = 0x1` at the `memset` call — `__kmalloc_cache_noprof` returned a junk pointer (or got passed one from `single_open`'s argument), and the kernel walked into `memset(0x1, 0, 0x20)`.
Immediately after, on `do_exit()` cleanup of the same PID, a second Oops:
```
Oops: Oops: 0002 [#2] SMP NOPTI
CPU: 4 PID: 1318 Comm: pvestatd Tainted: P D O 7.0.0-3-pve
RIP: 0010:memset+0xb/0x20
Call Trace:
? __kmalloc_noprof+0x1b7/0x560
inotify_handle_inode_event+0x97/0x270
inotify_ignored_and_remove_idr+0x26/0x60
inotify_freeing_mark+0xe/0x20
fsnotify_free_mark+0x4e/0x80
fsnotify_clear_marks_by_group+0x177/0x200
fsnotify_destroy_group+0x46/0x120
inotify_release+0x18/0x80
__fput+0xed/0x2d0
____fput+0x15/0x20
task_work_run+0x60/0xa0
do_exit+0x2d0/0xad0
make_task_dead+0x93/0xa0
rewind_stack_and_make_dead+0x16/0x20
[...]
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: pvestatd/1318/0x00000000
[...]
BUG: scheduling while atomic: pve-ha-lrm/1421/0x00000000
```
After this, `pvestatd` and `pve-ha-lrm` were dead, sshd could no longer fork shells, pveproxy worker pool degraded over hours, and ultimately the management plane was fully stuck. KVM guests continued running for the entire ~10 hours until I hard-power-cycled.
## Early-warning signal in the days before
Three times in the 36 hours leading up to the crash, `pveproxy` logged:
```
got inotify poll request in wrong process - disabling inotify
```
(May 01 20:41, May 02 16:42, May 02 22:36 — last one ~3 hours before the kernel BUG.) Same inotify subsystem that ultimately blew up. May be coincidence, but worth flagging.
## Cluster impact
6-node Proxmox VE cluster, all six nodes were on `7.0.0-3-pve` (pulled in automatically by `proxmox-default-kernel`). Only one node hit this so far over ~10 days of uptime per node. The other five had been running the same kernel for 1–2 days at the time of the crash.
After the crash I rolled all six nodes back via:
```
proxmox-boot-tool kernel pin 6.17.13-6-pve
proxmox-boot-tool refresh
systemctl reboot
```
All stable on 6.17.13-6-pve since.
## Questions
1. Has anyone else seen this signature on 7.0.0-3-pve? Specifically `memset+0xb` in `single_open` via `__kmalloc_cache_noprof`, plus the secondary Oops in `inotify_freeing_mark`?
2. Is there a known fix in pvetest-repo (7.0.0-4 or later)? `apt list --upgradable` on no-subscription shows nothing newer at the moment.
3. Given that `proxmox-default-kernel` quietly pulled in 7.0 (while the docs still describe it as opt-in / RC), is there guidance on how to keep production no-subscription clusters on 6.17 until 7.0 stabilises?
Happy to provide the full 391-line `journalctl -k` excerpt around the crash (registers, modules-linked-in, second Oops in full) if useful — included as attachment.
Thanks for any pointers.