Random kernel panic/crashes/reboot

matthias77

Member
Aug 29, 2022
16
0
6
Hi all,

since October 10th, my Proxmox server is randomly crashing with a kernel panic about once a day.
Initially I was looking for hardware issues since I thought I had not done any updates prior to the first crash. However, I recently realized I had apparently forgot that, according to /var/log/apt I did some upgrades two days before the first crash.

I enabled kdump-tools to collect dmesg and reboot the server after a kernel panic. Rarely it sometimes just reboots according to syslog without causing a kernel panic. This is syslog from the most recent reboot this morning:

Code:
2023-11-24T08:52:56.268970+01:00 proxmox1 pveproxy[2505222]: worker exit
2023-11-24T08:52:56.296703+01:00 proxmox1 pveproxy[2707]: worker 2505222 finished
2023-11-24T08:52:56.296808+01:00 proxmox1 pveproxy[2707]: starting 1 worker(s)
2023-11-24T08:52:56.299943+01:00 proxmox1 pveproxy[2707]: worker 2528632 started
2023-11-24T08:55:03.029844+01:00 proxmox1 systemd-modules-load[424]: Inserted module 'vhost_net'
2023-11-24T08:55:03.029939+01:00 proxmox1 kernel: [    0.000000] Linux version 6.5.11-4-pve (fgruenbichler@yuna) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-4 (2023-11-20T10:19Z) ()
2023-11-24T08:55:03.029944+01:00 proxmox1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-4-pve root=/dev/mapper/pve-root ro quiet intel_idle.max_cstate=1 processor.max_cstate=1 crashkernel=384M-:256M
2023-11-24T08:55:03.029945+01:00 proxmox1 kernel: [    0.000000] KERNEL supported cpus:
2023-11-24T08:55:03.029945+01:00 proxmox1 kernel: [    0.000000]   Intel GenuineIntel
2023-11-24T08:55:03.029946+01:00 proxmox1 kernel: [    0.000000]   AMD AuthenticAMD
2023-11-24T08:55:03.029946+01:00 proxmox1 kernel: [    0.000000]   Hygon HygonGenuine
2023-11-24T08:55:03.029947+01:00 proxmox1 kernel: [    0.000000]   Centaur CentaurHauls
2023-11-24T08:55:03.029951+01:00 proxmox1 kernel: [    0.000000]   zhaoxin   Shanghai
2023-11-24T08:55:03.029944+01:00 proxmox1 dmeventd[443]: dmeventd ready for processing.
2023-11-24T08:55:03.029951+01:00 proxmox1 kernel: [    0.000000] BIOS-provided physical RAM map:
2023-11-24T08:55:03.029952+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
2023-11-24T08:55:03.029954+01:00 proxmox1 systemd[1]: Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
2023-11-24T08:55:03.029958+01:00 proxmox1 lvm[415]:   5 logical volume(s) in volume group "pve" monitored
2023-11-24T08:55:03.029953+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
2023-11-24T08:55:03.029967+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
2023-11-24T08:55:03.029967+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
2023-11-24T08:55:03.029965+01:00 proxmox1 systemd-udevd[444]: Using default interface naming scheme 'v252'.
2023-11-24T08:55:03.029970+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
2023-11-24T08:55:03.029970+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a200000-0x000000000a20efff] ACPI NVS
2023-11-24T08:55:03.029971+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a20f000-0x000000000affffff] usable
2023-11-24T08:55:03.029969+01:00 proxmox1 dmeventd[443]: Monitoring thin pool pve-data-tpool.
2023-11-24T08:55:03.029971+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved
2023-11-24T08:55:03.029972+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000000b020000-0x0000000098f90fff] usable
2023-11-24T08:55:03.029972+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000098f91000-0x000000009a6d0fff] reserved
2023-11-24T08:55:03.029973+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009a6d1000-0x000000009a70afff] ACPI data
2023-11-24T08:55:03.029975+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009a70b000-0x000000009c1bdfff] ACPI NVS
2023-11-24T08:55:03.029973+01:00 proxmox1 systemd[1]: Started systemd-udevd.service - Rule-based Manager for Device Events and Files.
2023-11-24T08:55:03.029976+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009c1be000-0x000000009cf76fff] reserved
2023-11-24T08:55:03.029976+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009cf77000-0x000000009cffefff] type 20
2023-11-24T08:55:03.029977+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009cfff000-0x000000009dffffff] usable
2023-11-24T08:55:03.029977+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000009e000000-0x00000000bfffffff] reserved
2023-11-24T08:55:03.029977+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
2023-11-24T08:55:03.029980+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fd200000-0x00000000fd2fffff] reserved
2023-11-24T08:55:03.029981+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fd600000-0x00000000fd6fffff] reserved
2023-11-24T08:55:03.029982+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fea00000-0x00000000fea0ffff] reserved
2023-11-24T08:55:03.029982+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000feb80000-0x00000000fec01fff] reserved
2023-11-24T08:55:03.029982+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
2023-11-24T08:55:03.029983+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fec30000-0x00000000fec30fff] reserved
2023-11-24T08:55:03.029983+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
2023-11-24T08:55:03.029978+01:00 proxmox1 udevadm[460]: systemd-udev-settle.service is deprecated. Please fix nut-driver@apc.service, zfs-import-scan.service, zfs-import-cache.service not to pull it in.
2023-11-24T08:55:03.029986+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved
2023-11-24T08:55:03.029986+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed8ffff] reserved
2023-11-24T08:55:03.029987+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fedc2000-0x00000000fedcffff] reserved
2023-11-24T08:55:03.029987+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fedd4000-0x00000000fedd5fff] reserved
2023-11-24T08:55:03.029986+01:00 proxmox1 systemd[1]: Finished systemd-udev-trigger.service - Coldplug All udev Devices.
2023-11-24T08:55:03.029988+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
2023-11-24T08:55:03.029988+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000183e2fffff] usable
2023-11-24T08:55:03.029990+01:00 proxmox1 kernel: [    0.000000] BIOS-e820: [mem 0x000000183e300000-0x000000183fffffff] reserved
2023-11-24T08:55:03.029991+01:00 proxmox1 kernel: [    0.000000] NX (Execute Disable) protection: active
2023-11-24T08:55:03.029991+01:00 proxmox1 kernel: [    0.000000] efi: EFI v2.7 by American Megatrends
2023-11-24T08:55:03.029989+01:00 proxmox1 systemd[1]: Starting ifupdown2-pre.service - Helper to synchronize boot up for ifupdown...
2023-11-24T08:55:03.029992+01:00 proxmox1 kernel: [    0.000000] efi: ACPI=0x9c1a7000 ACPI 2.0=0x9c1a7014 TPMFinalLog=0x9b171000 SMBIOS=0x9ce25000 SMBIOS 3.0=0x9ce24000 MEMATTR=0x95a0c018 ESRT=0x95a81698
2023-11-24T08:55:03.029992+01:00 proxmox1 kernel: [    0.000000] efi: Remove mem306: MMIO range=[0xf0000000-0xf7ffffff] (128MB) from e820 map
2023-11-24T08:55:03.029993+01:00 proxmox1 kernel: [    0.000000] e820: remove [mem 0xf0000000-0xf7ffffff] reserved
2023-11-24T08:55:03.029995+01:00 proxmox1 kernel: [    0.000000] efi: Remove mem307: MMIO range=[0xfd200000-0xfd2fffff] (1MB) from e820 map
2023-11-24T08:55:03.029994+01:00 proxmox1 systemd[1]: Starting systemd-udev-settle.service - Wait for udev To Complete Device Initialization...
2023-11-24T08:55:03.029996+01:00 proxmox1 kernel: [    0.000000] e820: remove [mem 0xfd200000-0xfd2fffff] reserved
2023-11-24T08:55:03.029996+01:00 proxmox1 kernel: [    0.000000] efi: Remove mem308: MMIO range=[0xfd600000-0xfd6fffff] (1MB) from e820 map
2023-11-24T08:55:03.029997+01:00 proxmox1 kernel: [    0.000000] e820: remove [mem 0xfd600000-0xfd6fffff] reserved
2023-11-24T08:55:03.029997+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem309: MMIO range=[0xfea00000-0xfea0ffff] (64KB) from e820 map
2023-11-24T08:55:03.029998+01:00 proxmox1 kernel: [    0.000000] efi: Remove mem310: MMIO range=[0xfeb80000-0xfec01fff] (0MB) from e820 map
2023-11-24T08:55:03.029998+01:00 proxmox1 kernel: [    0.000000] e820: remove [mem 0xfeb80000-0xfec01fff] reserved
2023-11-24T08:55:03.029998+01:00 proxmox1 systemd[1]: Finished systemd-journal-flush.service - Flush Journal to Persistent Storage.
2023-11-24T08:55:03.030000+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem311: MMIO range=[0xfec10000-0xfec10fff] (4KB) from e820 map
2023-11-24T08:55:03.030001+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem312: MMIO range=[0xfec30000-0xfec30fff] (4KB) from e820 map
2023-11-24T08:55:03.030001+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem313: MMIO range=[0xfed00000-0xfed00fff] (4KB) from e820 map
2023-11-24T08:55:03.030002+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem314: MMIO range=[0xfed40000-0xfed44fff] (20KB) from e820 map
2023-11-24T08:55:03.030002+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem315: MMIO range=[0xfed80000-0xfed8ffff] (64KB) from e820 map
2023-11-24T08:55:03.030002+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem316: MMIO range=[0xfedc2000-0xfedcffff] (56KB) from e820 map
2023-11-24T08:55:03.030005+01:00 proxmox1 kernel: [    0.000000] efi: Not removing mem317: MMIO range=[0xfedd4000-0xfedd5fff] (8KB) from e820 map
2023-11-24T08:55:03.030005+01:00 proxmox1 kernel: [    0.000000] efi: Remove mem318: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
2023-11-24T08:55:03.030006+01:00 proxmox1 kernel: [    0.000000] e820: remove [mem 0xff000000-0xffffffff] reserved
2023-11-24T08:55:03.030001+01:00 proxmox1 systemd[1]: Found device dev-pve-swap.device - /dev/pve/swap.
2023-11-24T08:55:03.030006+01:00 proxmox1 kernel: [    0.000000] secureboot: Secure boot could not be determined (mode 0)
2023-11-24T08:55:03.030006+01:00 proxmox1 kernel: [    0.000000] SMBIOS 3.3.0 present.
2023-11-24T08:55:03.030007+01:00 proxmox1 kernel: [    0.000000] DMI: To Be Filled By O.E.M. B550 Phantom Gaming 4/B550 Phantom Gaming 4, BIOS P3.20 09/27/2023
2023-11-24T08:55:03.030008+01:00 proxmox1 systemd[1]: Activating swap dev-pve-swap.swap - /dev/pve/swap...
2023-11-24T08:55:03.030013+01:00 proxmox1 kernel: [    0.000000] tsc: Fast TSC calibration using PIT
2023-11-24T08:55:03.030014+01:00 proxmox1 kernel: [    0.000000] tsc: Detected 3693.185 MHz processor
2023-11-24T08:55:03.030012+01:00 proxmox1 systemd[1]: Activated swap dev-pve-swap.swap - /dev/pve/swap.
2023-11-24T08:55:03.030014+01:00 proxmox1 kernel: [    0.000643] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
2023-11-24T08:55:03.030016+01:00 proxmox1 kernel: [    0.000645] e820: remove [mem 0x000a0000-0x000fffff] usable
2023-11-24T08:55:03.030017+01:00 proxmox1 kernel: [    0.000654] last_pfn = 0x183e300 max_arch_pfn = 0x400000000
2023-11-24T08:55:03.030017+01:00 proxmox1 kernel: [    0.000660] MTRR map: 5 entries (3 fixed + 2 variable; max 20), built from 9 variable MTRRs
2023-11-24T08:55:03.030017+01:00 proxmox1 kernel: [    0.000662] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
2023-11-24T08:55:03.030017+01:00 proxmox1 systemd[1]: Reached target swap.target - Swaps.
2023-11-24T08:55:03.030020+01:00 proxmox1 kernel: [    0.000850] e820: update [mem 0xc0000000-0xffffffff] usable ==> reserved
2023-11-24T08:55:03.030020+01:00 proxmox1 kernel: [    0.000858] last_pfn = 0x9e000 max_arch_pfn = 0x400000000


I attached the last three dmesg when kernel panics occurred.

So far I have not been able to find a pattern except that it always includes

Code:
BUG: kernel NULL pointer dereference, address: ...
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page

Searching for this I found this post (https://bbs.archlinux.org/viewtopic.php?id=288632&p=2) stating disabling c states helped so I tried that yesterday with no luck.
There is also a comment about nftables and my last upgrade prior to the first kernel panic did upgrade libnftables1 and nftables (cf. attached apt logs) but that may be just a coincidence as I found no signs of nftables in the kernel panic call traces. I'm no expert so I'm probably missing something here.

Soon after the first crash I booted with a 5.15 kernel with same result.

What I have done on the hardware side to diagnose:​

* replace multiple old storage SSDs with a new NVMe
* replace a HDD in a zfs mirror that showed bad SMART values (but was not indicated as failing)
* cloned the boot SSD to a different SSD
* ran memory test
* upgraded BIOS
* disconnected all USB devices (eg. UPS, RS232 adapter)
* moved boot SSD, storage SSD and the ZFS HDD pool (4 HDDs) to a different server case with same type of motherboard, CPU and PSU using other SATA cables

Since none of that helped I ruled out hardware issues.

Does anyone have ideas what I can do next?



1700816234031.png

pveversion -v
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-4-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.0.9
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
proxmox-kernel-6.5: 6.5.11-4
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.4
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.0.4-1
proxmox-backup-file-restore: 3.0.4-1
proxmox-kernel-helper: 8.0.9
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-1
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.2
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.0-pve3
 
Last edited:
Confirmed. I am experiencing the exact same issue. I am running multiple Hetzner servers (EX101 series), and while most of them are working fine, two of them are facing the same problem. I have ruled out hardware issues because I already replaced one of them (keeping only NVME drives), yet the problem persisted.

I noticed a kernel crash during the night, especially during inactive hours when the system has very little load and is mostly idle. I suspect this may be related to older hardware and newer kernels, particularly how they operate with power management, potentially causing PCIe bus crashes. However, I am not certain about this.

I am currently using Linux 5.15.131-1-pve #1 SMP PVE 5.15.131-1 (2023-11-09T09:29Z). Does anyone else have more information on this issue? I cannot afford to use these two servers in production, and I am concerned that a similar problem could affect my other servers as well.
 
Thanks for your reply, I'm glad I'm not the only one with this issue.

I stopped all VMs a few days ago and run only LXC containers and it's been without a crash for the longest time since the problem started:

1701860795970.png
 
Thank you for your reply. May I ask what kind of hardware do you have. I am interested in Motherboar/Bios? Thank you.
 
sure: ASRock B550 Phantom Gaming 4/B550 Phantom Gaming 4, BIOS P2.20 02/24/2022
with Kingston ECC RAM
 
How is it going with your server? Mine has crashed again tonight. I updated to the latest Linux kernel
Linux 5.15.131-2-pve #1 SMP PVE 5.15.131-3 (2023-12-01T13:42Z)
, but I am afraid it is not going to help. I am starting to lose hope. Also, I would like to ask about your uptime chart. Is it Proxmox summary, or do you have a standalone solution for monitoring?
 
Mine is still running for 7 days, 1 hour. That is almost 2 days longer than any previous run. I have been using VMs only rarely and always shutting them down or hibernating once I was finished. Luckily all services that need to run constantly are in LXC containers.

Are you not running Proxmox 8.x or why are you on 5.x linux kernels?

When did your problems start? Also in October?

I'm using proxmox built-in metrics data that is sent to influxdb and visualized with grafana. I'm also running a telegraf agent on the hypervisor to collect additional data like SMART data that is not included in the built-in metrics. I'm doing that since I was expecting this issue to be caused by faulty drives.
 
Mine is still running for 7 days, 1 hour. That is almost 2 days longer than any previous run. I have been using VMs only rarely and always shutting them down or hibernating once I was finished. Luckily all services that need to run constantly are in LXC containers.

Are you not running Proxmox 8.x or why are you on 5.x linux kernels?

When did your problems start? Also in October?

I'm using proxmox built-in metrics data that is sent to influxdb and visualized with grafana. I'm also running a telegraf agent on the hypervisor to collect additional data like SMART data that is not included in the built-in metrics. I'm doing that since I was expecting this issue to be caused by faulty drives.
Problems start at the middle of October. I am using proxmox 7.4-17
 
I'm suspecting that something I upgraded on 2023-10-08 is causing this. Do you have apt logs from around that time so we could compare? This is what I installed:
Code:
Start-Date: 2023-10-08  06:34:58
Commandline: apt upgrade
Upgrade: telnet:amd64 (0.17+2.4-2, 0.17+2.4-2+deb12u1), libcups2:amd64 (2.4.2-3+deb12u1, 2.4.2-3+deb12u4), libcurl4:amd64 (7.88.1-10+deb12u1, 7.88.1-10+deb12u3), libpam-runtime:amd64 (1.5.2-6, 1.5.2-6+deb12u1), udev:amd64 (252.12-pmx1, 252.17-1~deb12u1), krb5-locales:amd64 (1.20.1-2, 1.20.1-2+deb12u1), bind9-host:amd64 (1:9.18.16-1~deb12u1, 1:9.18.19-1~deb12u1), libgssapi-krb5-2:amd64 (1.20.1-2, 1.20.1-2+deb12u1), libcurl3-gnutls:amd64 (7.88.1-10+deb12u1, 7.88.1-10+deb12u3), openssh-client:amd64 (1:9.2p1-2, 1:9.2p1-2+deb12u1), dbus-user-session:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), libnftables1:amd64 (1.0.6-2+deb12u1, 1.0.6-2+deb12u2), libx11-xcb1:amd64 (2:1.8.4-2+deb12u1, 2:1.8.4-2+deb12u2), libldb2:amd64 (2:2.6.2+samba4.17.10+dfsg-0+deb12u1, 2:2.6.2+samba4.17.11+dfsg-0+deb12u1), libpam-systemd:amd64 (252.12-pmx1, 252.17-1~deb12u1), ssh:amd64 (1:9.2p1-2, 1:9.2p1-2+deb12u1), grub-pc-bin:amd64 (2.06-13, 2.06-13+deb12u1), dbus-daemon:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), libsystemd0:amd64 (252.12-pmx1, 252.17-1~deb12u1), libnss-systemd:amd64 (252.12-pmx1, 252.17-1~deb12u1), debianutils:amd64 (5.7-0.4, 5.7-0.5~deb12u1), libwbclient0:amd64 (2:4.17.10+dfsg-0+deb12u1, 2:4.17.11+dfsg-0+deb12u1), openssh-server:amd64 (1:9.2p1-2, 1:9.2p1-2+deb12u1), libdbus-1-3:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), dbus-bin:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), libkrb5support0:amd64 (1.20.1-2, 1.20.1-2+deb12u1), libsmbclient:amd64 (2:4.17.10+dfsg-0+deb12u1, 2:4.17.11+dfsg-0+deb12u1), libxpm4:amd64 (1:3.5.12-1.1, 1:3.5.12-1.1+deb12u1), systemd:amd64 (252.12-pmx1, 252.17-1~deb12u1), libudev1:amd64 (252.12-pmx1, 252.17-1~deb12u1), debian-archive-keyring:amd64 (2023.3, 2023.3+deb12u1), libx11-data:amd64 (2:1.8.4-2+deb12u1, 2:1.8.4-2+deb12u2), libc6:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), locales:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), libssl3:amd64 (3.0.9-1, 3.0.11-1~deb12u1), libkrb5-3:amd64 (1.20.1-2, 1.20.1-2+deb12u1), libpam-modules:amd64 (1.5.2-6, 1.5.2-6+deb12u1), inetutils-telnet:amd64 (2:2.4-2, 2:2.4-2+deb12u1), bind9-dnsutils:amd64 (1:9.18.16-1~deb12u1, 1:9.18.19-1~deb12u1), base-files:amd64 (12.4+deb12u1, 12.4+deb12u2), libk5crypto3:amd64 (1.20.1-2, 1.20.1-2+deb12u1), libunbound8:amd64 (1.17.1-2, 1.17.1-2+deb12u1), dbus-session-bus-common:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), python3-ldb:amd64 (2:2.6.2+samba4.17.10+dfsg-0+deb12u1, 2:2.6.2+samba4.17.11+dfsg-0+deb12u1), libpam-modules-bin:amd64 (1.5.2-6, 1.5.2-6+deb12u1), grub-efi-amd64-bin:amd64 (2.06-13, 2.06-13+deb12u1), grub2-common:amd64 (2.06-13, 2.06-13+deb12u1), smbclient:amd64 (2:4.17.10+dfsg-0+deb12u1, 2:4.17.11+dfsg-0+deb12u1), libc-dev-bin:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), openssh-sftp-server:amd64 (1:9.2p1-2, 1:9.2p1-2+deb12u1), libpam0g:amd64 (1.5.2-6, 1.5.2-6+deb12u1), libc-l10n:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), grub-common:amd64 (2.06-13, 2.06-13+deb12u1), bind9-libs:amd64 (1:9.18.16-1~deb12u1, 1:9.18.19-1~deb12u1), samba-libs:amd64 (2:4.17.10+dfsg-0+deb12u1, 2:4.17.11+dfsg-0+deb12u1), libc-bin:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), libsystemd-shared:amd64 (252.12-pmx1, 252.17-1~deb12u1), libc-devtools:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), libc6-dev:amd64 (2.36-9+deb12u1, 2.36-9+deb12u3), systemd-sysv:amd64 (252.12-pmx1, 252.17-1~deb12u1), samba-common:amd64 (2:4.17.10+dfsg-0+deb12u1, 2:4.17.11+dfsg-0+deb12u1), curl:amd64 (7.88.1-10+deb12u1, 7.88.1-10+deb12u3), libx11-6:amd64 (2:1.8.4-2+deb12u1, 2:1.8.4-2+deb12u2), dbus-system-bus-common:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), dbus:amd64 (1.14.8-2~deb12u1, 1.14.10-1~deb12u1), grub-pc:amd64 (2.06-13, 2.06-13+deb12u1), openssl:amd64 (3.0.9-1, 3.0.11-1~deb12u1), nftables:amd64 (1.0.6-2+deb12u1, 1.0.6-2+deb12u2), linux-libc-dev:amd64 (6.1.52-1, 6.1.55-1)
End-Date: 2023-10-08  06:35:41
 
I'm experiencing the same issue. Ever I did some major updates in October, my system keeps crashing every few days. Sometimes it doesn't crash for a week and then it crashes multiple times a day. And it happens at any time of the day, not related to load.

I was so far not able to collect any crash logs, the regular system log does not show any records of the crash. I have yet to set up kdump.

I've updated from kernel 5.15.39-2-pve to 6.2.16-15-pve (with the intermediate step to 5.15.116-1-pve) on October 10. I've updated again to 6.5.11-6-pve two days ago which did not fix the problem, the machine already has crashed once since.
 
I'm experiencing the same issue. Ever I did some major updates in October, my system keeps crashing every few days. Sometimes it doesn't crash for a week and then it crashes multiple times a day. And it happens at any time of the day, not related to load.

I was so far not able to collect any crash logs, the regular system log does not show any records of the crash. I have yet to set up kdump.

I've updated from kernel 5.15.39-2-pve to 6.2.16-15-pve (with the intermediate step to 5.15.116-1-pve) on October 10. I've updated again to 6.5.11-6-pve two days ago which did not fix the problem, the machine already has crashed once since.
I do have 5 identical servers. All of them have the same configuration, even the same kernel, but 2 of them crash regularly and 3 don't. I don't have any useful info in the logs, so I don't know what is going on. Also, those servers are in the DC, so I don't have any physical access to them. I am thinking about decommissioning them completely and ordering a totally different model. I have no clue and no better idea on how to fix this mess. Two servers are just impossible to use in production.
 
Last edited:
Another kernel version got released today (6.5.11-7-pve). Just installed it on the server in question. Let's see...
 
Hi everyone, I strongly believe the issue is related to power management, specifically p-states, c-states, and how the CPU driver manages power savings. Currently, I am conducting an experiment. I've changed the CPU driver from p-state to intel_cpufreq and forced the CPU to run at a constant frequency of 4.20GHz. Additionally, I've added the boot parameter idle=poll, effectively disabling the idle system. This will undoubtedly disappoint Gretta, but, at the very least, I aim to eliminate the possibility of power management issues from the equation.
 
I also installed the new kernel today and rebooted after it now ran for more than 11 days without crash. Although it's far from a proof but it looks like something is different now if you compare to the amount of reboots (most were crashes) before. The only thing I did was to keep all VMs powered off and only power them on for a few minutes when I needed them. I have no idea if that is even possible but maybe VMs are causing the power management issue you are expecting? I will now turn on VMs again and as soon as it crashes again also run your experiment (if I can figure out how, I have never even heard of changing CPU driver :)) or if I you want me to try something else feel free to let me know.
1702400660226.png
 
The new kernel did not fix the issue for me. The server already crashed the same night after installing it. I'm only running VMs, no containers. Turning them off is no option for me.

I've now followed collider18's idea and switched to intel_cpufreq. I have done that by setting p-states to passive:

Code:
echo "passive" > /sys/devices/system/cpu/intel_pstate/status

This of course will revert back to the default after a reboot, but I'll deal with a permanent solution later if this works.

cpufreq-info will confirm you're using intel_cpufreq as the driver.
 
crashed already 6 hours after I started one VM (Windows 10).
I'm using the following settings which are not defaults when creating windows VMs:
1702449734693.png
I changed them to the defaults for new Windows VMs and will test again:
1702449772850.png

I haven't yet applied changes to power management. I only have /sys/.../amd_pstate/... but I assume that would work the same as intel_pstate above.
 
crashed already 6 hours after I started one VM (Windows 10).
I'm using the following settings which are not defaults when creating windows VMs:
View attachment 59727
I changed them to the defaults for new Windows VMs and will test again:
View attachment 59728

I haven't yet applied changes to power management. I only have /sys/.../amd_pstate/... but I assume that would work the same as intel_pstate above.
The concept is the same; the only difference is that you have AMD. My server is still up and running, but I am not overly optimistic because it has only been 6 days. I will let you know in a few days.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!