Hi all,
I recently upgraded my cluster from PVE 8 to PVE 9 and I'm experiencing VM freezes that I've now partially diagnosed via the host journals. Posting for confirmation and to find a proper fix.
Environment
Note on Node 2
Node 2 is currently running our core production application on a Debian VM, which has been completely stable since the PVE 9 upgrade. However, since this is a critical production workload, I'm not willing to migrate the problematic VMs to Node 2 to test whether the EPYC 7402 is genuinely the differentiating factor — the risk of introducing instability there is too high. The node comparison is therefore observational only.
Symptoms
After an unpredictable period (minutes to hours), VMs become completely unresponsive:
Root cause finding — host journal
Both affected nodes show an identical sequence in
Immediately after the HPET fallback, a cascading perf interrupt storm follows — the sample rate gets throttled down progressively from ~40 000 to 1 000 over the next ~15 minutes, and hrtimer interrupts are reported taking >1 ms:
This same pattern appears independently on both Node 1 and Node 3. Node 2 (EPYC 7402) does not show this behaviour at all.
My interpretation: The TSC becomes skewed (possibly due to cross-socket drift or a NUMA sync issue on the EPYC 7401), the kernel falls back to HPET, and HPET's much higher interrupt latency then causes the perf subsystem to spiral into an interrupt storm that effectively starves QEMU processes and locks up the VMs.
VM configs (relevant parts)
pfSense:
Alpine (bastion-0):
Questions
I recently upgraded my cluster from PVE 8 to PVE 9 and I'm experiencing VM freezes that I've now partially diagnosed via the host journals. Posting for confirmation and to find a proper fix.
Environment
pve-manager/9.1.6/71482d1833ded40a- Kernel:
Linux 6.17.13-2-pve (2026-03-13T08:06Z) - Enterprise repository, multi-node cluster
| Node | CPU | Affected |
|---|---|---|
| proxmox-0 (Node 1) | 96 × AMD EPYC 7401 24-Core (2 Sockets) | |
| proxmox-2 (Node 3) | 96 × AMD EPYC 7401 24-Core (2 Sockets) | |
| Node 2 | 96 × AMD EPYC 7402 24-Core (2 Sockets) |
Note on Node 2
Node 2 is currently running our core production application on a Debian VM, which has been completely stable since the PVE 9 upgrade. However, since this is a critical production workload, I'm not willing to migrate the problematic VMs to Node 2 to test whether the EPYC 7402 is genuinely the differentiating factor — the risk of introducing instability there is too high. The node comparison is therefore observational only.
Symptoms
After an unpredictable period (minutes to hours), VMs become completely unresponsive:
- Proxmox console fails to connect or freezes on a single frame
- VM web interfaces (e.g. pfSense) become unreachable
- CPU usage for the affected VM spikes and stays elevated in the Proxmox UI
- PSI (pressure stall information) spikes simultaneously
- VM reset is unreliable (~50% success for pfSense); only a full node reboot reliably recovers
Root cause finding — host journal
Both affected nodes show an identical sequence in
journalctl -k immediately before each freeze:
Code:
kernel: clocksource: timekeeping watchdog on CPU16: Marking clocksource 'tsc' as unstable because the skew is too large:
kernel: clocksource: 'tsc' skewed 580600 ns (0 ms) over watchdog 'hpet' interval of 496033364 ns
kernel: tsc: Marking TSC unstable due to clocksource watchdog
kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
kernel: sched_clock: Marking unstable [...]
kernel: clocksource: Switched to clocksource hpet
Code:
kernel: perf: interrupt took too long (5557 > 4911), lowering kernel.perf_event_max_sample_rate to 35000
kernel: perf: interrupt took too long (7617 > 6946), lowering kernel.perf_event_max_sample_rate to 26000
...
kernel: hrtimer: interrupt took 1209720 ns
...
kernel: perf: interrupt took too long (115918 > 115180), lowering kernel.perf_event_max_sample_rate to 1000
This same pattern appears independently on both Node 1 and Node 3. Node 2 (EPYC 7402) does not show this behaviour at all.
My interpretation: The TSC becomes skewed (possibly due to cross-socket drift or a NUMA sync issue on the EPYC 7401), the kernel falls back to HPET, and HPET's much higher interrupt latency then causes the perf subsystem to spiral into an interrupt storm that effectively starves QEMU processes and locks up the VMs.
VM configs (relevant parts)
pfSense:
Code:
cpu: host
machine: pc-i440fx-8.1
sockets: 1
cores: 4
net0..net12: virtio, multiple VLANs with MTU 9000
scsi0: ceph-vm (iothread=1)
Code:
cpu: x86-64-v2-AES
sockets: 2
cores: 1
net0: virtio, mtu=1
scsi0: ceph-vm (iothread=1)
Questions
- Is there a known regression in kernel 6.17 or PVE 9 regarding TSC stability on EPYC Naples (Zen 1 / 7401)?
- Is adding
tsc=unstableto the host kernel boot parameters the right mitigation, or is there a better approach (e.g.clocksource=hpetfrom the start, disabling TSC entirely)? - Could the EPYC 7401's dual-socket TSC sync be the underlying cause? Both affected nodes are 2-socket systems and the drift appears cross-socket.
- Any known differences in how the 7401 and 7402 handle TSC synchronization that would explain why Node 2 is unaffected?
- Are there BIOS/firmware settings (TSC sync policy, C-states) known to help stabilise TSC on Naples-generation EPYC under newer PVE kernels?