Crashes since Proxmox v9.0

gstyle

New Member
Jan 26, 2024
7
0
1
Hi everybody,
since I updated to 9.0, my server occasionally crashes. Complete freeze. IPMI display frozen, not pingable, nothing in the logs.
ASROCK Rack Mainboard: E3C256D2I
Intel Xenon E-2356G
64GB Kingston ECC RAM

This was happening a few times per day but also ran 3 days in a row.
This was with the new 6.14 kernel.
I still had the 6.8.12-13-pve kernel installed. Booted into it now and since 2 days no crashes anymore.

Does anybody have a similar experience? Any idea to drill down the root cause? Quite difficult without anything in the logs.

Server is running several VMs (Debian, Ubuntu, OPNsense, Homeassistant, Raspberrymatic) and some LXC.
One USB passthrough, no PCI passthrough.

Cheers
Mario
 
If it happened during the previous boot you can look at the end of the journal like this:

journalctl -b -1 -p warning -e

For a description of "-b" etc. consult "man journalctl". You might post the last few dozen lines or so (depending on your findings) here - in [code]...[/code]-tags, please.

You system is update? Make sure it is.
 
This are the current messages, however this is from a boot with kernel 6.8.12.
Will do a boot with 6.14 and post it later.

Code:
Aug 15 08:44:25 proxmox kernel: spl: loading out-of-tree module taints kernel.
Aug 15 08:44:25 proxmox kernel: zfs: module license 'CDDL' taints kernel.
Aug 15 08:44:25 proxmox kernel: Disabling lock debugging due to kernel taint
Aug 15 08:44:25 proxmox kernel: zfs: module license taints kernel.
Aug 15 08:44:25 proxmox systemd-sysv-generator[661]: SysV service '/etc/init.d/openipmi' lacks a native systemd unit file, automatically generating a unit file for compatibility.
Aug 15 08:44:25 proxmox systemd-sysv-generator[661]: Please update package to include a native systemd unit file.
Aug 15 08:44:25 proxmox systemd-sysv-generator[661]: ! This compatibility logic is deprecated, expect removal soon. !
Aug 15 08:44:25 proxmox kernel: pstore: backend 'erst' already in use: ignoring 'efi_pstore'
Aug 15 08:44:25 proxmox systemd-journald[686]: File /var/log/journal/35670867229e442a97a1d2b566329d64/system.journal corrupted or uncleanly shut down, renaming and replacing.
Aug 15 08:44:25 proxmox (udev-worker)[749]: lo: Invalid network interface name, ignoring:
Aug 15 08:44:25 proxmox lvm[816]: /dev/zd16p3 excluded: device is rejected by filter config.
Aug 15 08:44:25 proxmox lvm[942]: /dev/zd80p3 excluded: device is rejected by filter config.
Aug 15 08:44:25 proxmox kernel: spi-nor spi0.0: supply vcc not found, using dummy regulator
Aug 15 08:44:25 proxmox kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap!
Aug 15 08:44:25 proxmox kernel: power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
Aug 15 08:44:30 proxmox blkmapd[1823]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
Aug 15 08:44:31 proxmox pmxcfs[2049]: [quorum] crit: quorum_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 15 08:44:31 proxmox pmxcfs[2049]: [quorum] crit: can't initialize service
Aug 15 08:44:31 proxmox pmxcfs[2049]: [confdb] crit: cmap_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 15 08:44:31 proxmox pmxcfs[2049]: [confdb] crit: can't initialize service
Aug 15 08:44:31 proxmox pmxcfs[2049]: [dcdb] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 15 08:44:31 proxmox pmxcfs[2049]: [dcdb] crit: can't initialize service
Aug 15 08:44:31 proxmox pmxcfs[2049]: [status] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 15 08:44:31 proxmox pmxcfs[2049]: [status] crit: can't initialize service
Aug 15 08:44:32 proxmox postfix/postfix-script[2164]: warning: /var/spool/postfix/etc/services and /etc/services differ
Aug 15 08:44:33 proxmox (corosync)[2191]: corosync.service: Referenced but unset environment variable evaluates to an empty string: COROSYNC_OPTIONS
Aug 15 08:44:33 proxmox upsd[2219]: Running as foreground process, not saving a PID file
Aug 15 08:44:33 proxmox corosync[2191]:   [WD    ] Watchdog not enabled by configuration
Aug 15 08:44:33 proxmox corosync[2191]:   [WD    ] resource load_15min missing a recovery key.
Aug 15 08:44:33 proxmox corosync[2191]:   [WD    ] resource memory_used missing a recovery key.
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:33 proxmox corosync[2191]:   [KNET  ] host: host: 2 has no active links
Aug 15 08:44:36 proxmox corosync-qdevice[2225]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Aug 15 08:44:41 proxmox corosync-qdevice[2225]: Connect timeout
Aug 15 08:45:09 proxmox pve-guests[2908]: removed left over backup lock from '105'!
Aug 15 08:45:10 proxmox kernel: kauditd_printk_skb: 115 callbacks suppressed
Aug 15 08:45:11 proxmox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Aug 15 08:45:19 proxmox pvescheduler[4005]: VM 105 qmp command failed - VM 105 qmp command 'guest-ping' failed - got timeout
Aug 15 08:45:19 proxmox pvescheduler[4005]: QEMU Guest Agent is not running - VM 105 qmp command 'guest-ping' failed - got timeout
Aug 15 10:28:04 proxmox systemd-sysv-generator[82511]: SysV service '/etc/init.d/openipmi' lacks a native systemd unit file, automatically generating a unit file for compatibility.
Aug 15 10:28:04 proxmox systemd-sysv-generator[82511]: Please update package to include a native systemd unit file.
Aug 15 10:28:04 proxmox systemd-sysv-generator[82511]: ! This compatibility logic is deprecated, expect removal soon. !
Aug 15 10:28:18 proxmox pveproxy[82630]: got inotify poll request in wrong process - disabling inotify
 
This are the current messages, however this is from a boot with kernel 6.8.12.
You said in your first post
I still had the 6.8.12-13-pve kernel installed. Booted into it now and since 2 days no crashes anymore.
So..., this logs is irrelevant, isn't it?

What does your storage look like? I see ZFS modules loaded. Do you have redundancy?

openipmi ist not in a recommended state.

I am not sure, but should those "Quorum" and "corosync" messages not only be there when a cluster is present?
 
Rebooted today in Kernel 6.14.
Some weeks ago, I also did some changes in c-state settings in bios. Set it to default. Maybe this changes something.


The server is part of a 2 node cluster with a raspberry as an additional quorum.
Storage is two SATA SSDs as a ZFS mirror for system and VMs and a two HDD SATA ZFS mirror for multimedia data.


This are the current messages:

Code:
Aug 17 16:00:06 proxmox sshd-session[2187556]: error: mm_reap: preauth child terminated by signal 15
Aug 17 16:00:06 proxmox blkmapd[1735]: exit on signal(15)
Aug 17 16:00:06 proxmox corosync-qdevice[2259]: Lost connection with heuristics worker
Aug 17 16:00:06 proxmox upsd[2333]: mainloop: Interrupted system call
Aug 17 16:00:06 proxmox systemd-logind[1656]: Failed to start session scope session-9644.scope: Transaction for session-9644.scope/start is destructive (reboot.target has 'start' job queued, but 'stop' is included in transaction).
Aug 17 16:00:06 proxmox sshd-session[2187584]: pam_systemd(sshd:session): Failed to create session: Transaction for session-9644.scope/start is destructive (reboot.target has 'start' job queued, but 'stop' is included in transaction).
Aug 17 16:00:06 proxmox sshd-session[2187707]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:07 proxmox sshd-session[2187719]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:07 proxmox sshd-session[2187742]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:08 proxmox sshd-session[2187758]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox kernel: apparmor mqueue disconnected TODO
Aug 17 16:00:08 proxmox sshd-session[2188460]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:09 proxmox sshd-session[2189059]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:10 proxmox sshd-session[2189169]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:10 proxmox sshd-session[2189186]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:11 proxmox sshd-session[2189197]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:11 proxmox sshd-session[2189265]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:12 proxmox sshd-session[2189518]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:13 proxmox sshd-session[2189596]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:13 proxmox sshd-session[2189624]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:13 proxmox sshd-session[2189655]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:14 proxmox sshd-session[2189666]: pam_systemd(sshd:session): Failed to create session: Transport endpoint is not connected
Aug 17 16:00:16 proxmox sshd-session[2189774]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:17 proxmox sshd-session[2189785]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:17 proxmox sshd-session[2189795]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:18 proxmox sshd-session[2189806]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:19 proxmox sshd-session[2189819]: pam_systemd(sshd:session): Failed to connect to system bus: Broken pipe
Aug 17 16:00:32 proxmox systemd[1]: lxcfs.service: Failed with result 'exit-code'.
Aug 17 16:00:39 proxmox pmxcfs[2087]: [confdb] crit: cmap_dispatch failed: 2
Aug 17 16:00:39 proxmox pmxcfs[2087]: [quorum] crit: quorum_dispatch failed: CS_ERR_LIBRARY
Aug 17 16:00:39 proxmox pmxcfs[2087]: [dcdb] crit: cpg_dispatch failed: CS_ERR_LIBRARY
Aug 17 16:00:39 proxmox pmxcfs[2087]: [dcdb] crit: cpg_leave failed: CS_ERR_LIBRARY
Aug 17 16:00:39 proxmox pmxcfs[2087]: [status] crit: cpg_dispatch failed: CS_ERR_LIBRARY
Aug 17 16:00:39 proxmox pmxcfs[2087]: [status] crit: cpg_leave failed: CS_ERR_LIBRARY
Aug 17 16:00:39 proxmox pmxcfs[2087]: [quorum] crit: quorum_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 17 16:00:39 proxmox pmxcfs[2087]: [quorum] crit: can't initialize service
Aug 17 16:00:39 proxmox pmxcfs[2087]: [confdb] crit: cmap_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 17 16:00:39 proxmox pmxcfs[2087]: [confdb] crit: can't initialize service
Aug 17 16:00:39 proxmox pmxcfs[2087]: [dcdb] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 17 16:00:39 proxmox pmxcfs[2087]: [dcdb] crit: can't initialize service
Aug 17 16:00:39 proxmox pmxcfs[2087]: [status] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
Aug 17 16:00:39 proxmox pmxcfs[2087]: [status] crit: can't initialize service
Aug 17 16:00:40 proxmox pmxcfs[2087]: [quorum] crit: quorum_finalize failed: CS_ERR_BAD_HANDLE
Aug 17 16:00:40 proxmox pmxcfs[2087]: [confdb] crit: cmap_track_delete nodelist failed: CS_ERR_BAD_HANDLE
Aug 17 16:00:40 proxmox pmxcfs[2087]: [confdb] crit: cmap_track_delete version failed: CS_ERR_BAD_HANDLE
Aug 17 16:00:40 proxmox pmxcfs[2087]: [confdb] crit: cmap_finalize failed: CS_ERR_BAD_HANDLE
Aug 17 16:00:41 proxmox kernel: watchdog: watchdog0: watchdog did not stop!