Hi,
a PVE node reboots unexpectedly.
PVE Facts:
VMs on the PVE node:
Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).
Steps done to solve the problem:
The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...
Has anyone had similar problems and solved them? Any further ideas to find the root cause?
a PVE node reboots unexpectedly.
PVE Facts:
- Kernel: Linux 6.5.13-5-pve
- Storage: 2xSSD (ZFS Raid 1)
- CPU: AMD Ryzen 9 7950X3D (16Core)
- RAM: 128GB
VMs on the PVE node:
- 6x VM with 4 CPUs each (Processor type: host)
- 1x VM with 8 CPUs doing nested virtualization (Processor type: host)
Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).
Steps done to solve the problem:
- Enabled and tested kernel crash dumps: no crash dumps written
- Disabled reboot if kernel crash dump can't be written: still automatically reboots
- Provider ran a stress test on all components: no problems detected
- Added a custom CPU Type (x86-64-v4 with svm flag) (see https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/#post-642039): still crashes / reboots
- Reduced the total number of vCPUs to 8 (see https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-643308): PVE node doesn't crash anymore
The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...
Has anyone had similar problems and solved them? Any further ideas to find the root cause?