Hard reboots and lockups since PVE 8.2 => 8.3

blackpaw

Renowned Member
Nov 1, 2013
312
24
83
@mira

As per the title, my main server (part of a 3 node cluster) has been rebooting or locking up daily since upgrading to 8.3, it runs 8 containers and a windows vm. Headless, so I can't see whats on the console :(

Bash:
journalctl --since '2024-11-24'|grep -i panic
Nov 24 10:33:58 px-server kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Nov 24 11:15:21 px-server kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Nov 26 16:57:04 px-server kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
 
Hello,

Could you please show all the journal entries of the host from 2 minutes before the reboot to the reboot? The journal has a line starting with

Code:
-- Boot

or

Code:
-- Reboot

on new boots, you can use that to identify the messages prior to the shutdown of the previous boot.

Do you have guest with HA enabled in the cluster?
 
  • Like
Reactions: blackpaw
Do you have guest with HA enabled in the cluster?

No HA Enabled.

I did notice I had Quorum votes set to 2 for the main node, total of 4 for the cluster, a hangover from when I only had 2 nodes. Have set it back to 1 vote, dunno if that could have munged things?

Code:
Nov 25 23:16:14 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:17:01 px-server CRON[2991581]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 25 23:17:01 px-server CRON[2991582]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Nov 25 23:17:01 px-server CRON[2991581]: pam_unix(cron:session): session closed for user root
Nov 25 23:26:14 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:26:14 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash.px-server failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:26:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:36:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:36:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash.px-server failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:36:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:46:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:46:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.crash.px-server failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:46:15 px-server ceph-crash[1686]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-10-15T02:10:11.444255Z_68dc1539-7c3e-4ab7-b056-d14c2b9f4c89 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Nov 25 23:55:02 px-server systemd[1]: Started log2ram-daily.service - Daily Log2Ram writing activities.
Nov 25 23:55:02 px-server systemd[1]: Reloading log2ram.service - Log2Ram...
-- Boot 556f36b28d4e4fff930ca31fa30b2438 --
Nov 26 16:57:04 px-server kernel: Linux version 6.8.12-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z) ()
Nov 26 16:57:04 px-server kernel: Command line: initrd=\EFI\proxmox\6.8.12-4-pve\initrd.img-6.8.12-4-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=off mitigations=off usbcore.autosuspend=-1 net.ifnames=0 biosdevname=0
Nov 26 16:57:04 px-server kernel: KERNEL supported cpus:
Nov 26 16:57:04 px-server kernel:   Intel GenuineIntel
Nov 26 16:57:04 px-server kernel:   AMD AuthenticAMD
Nov 26 16:57:04 px-server kernel:   Hygon HygonGenuine
Nov 26 16:57:04 px-server kernel:   Centaur CentaurHauls
Nov 26 16:57:04 px-server kernel:   zhaoxin   Shanghai
 
Ceph would not explain a reboot/shutdown of the hosts. Having a node with more than 1 vote would definitively explain it if there are HA resources in the cluster. Please take a look at how fencing works at our documentation [1].

You can check the number of votes per host at /etc/pve/corosync.conf and using

Code:
pvecm status

The later command will also tell you how many votes the current node has. You can use

Code:
ha-manager status

to check if there are any HA services.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
 
  • Like
Reactions: blackpaw
Ceph would not explain a reboot/shutdown of the hosts. Having a node with more than 1 vote would definitively explain it if there are HA resources in the cluster. Please take a look at how fencing works at our documentation [1].

Thanks, yah I just noticed the quorum issue while checking things, have updated it.

Code:
 pvecm status
Cluster information
-------------------
Name:             blackpaw
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov 27 01:41:56 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.4de
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.50 (local)
0x00000002          1 192.168.1.201
0x00000003          1 192.168.1.219

I don't have any HA services running though.

I'll monitor the nodes for now, see if fixing the quorum votes makes a difference.

Thanks!
 
As mentioned in [0], do you have files in /var/lib/systemd/pstore/?

If so, those could help shed light on the crashes.
From what we can see in the logs, the server was actually hanging for a while until you reset it, right?
Code:
Nov 25 23:55:02 px-server systemd[1]: Reloading log2ram.service - Log2Ram...
-- Boot 556f36b28d4e4fff930ca31fa30b2438 --
Nov 26 16:57:04 px-server kernel: Linux version 6.8.12-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z) ()


[0] https://forum.proxmox.com/threads/proxmox-ve-8-3-released.157793/post-724089
 
As mentioned in [0], do you have files in /var/lib/systemd/pstore/

Sorry, yes I do

Would uploading the dmesg.tx be sufficient?

Bash:
root@px-server:/var/lib/systemd/pstore/173240759# ls -lah
total 227K
drwxr-xr-x 2 root root   45 Nov 24 10:33 .
drwxr-xr-x 3 root root    3 Nov 24 10:33 ..
-rw------- 1 root root 1.4K Nov 24 10:19 dmesg-efi_pstore-173240759601001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759602001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759603001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759604001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759605001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759606001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759607001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759608001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759609001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759610001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759611001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759612001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759613001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759614001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759615001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759616001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759617001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759618001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759619001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759620001
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759621001
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759701002
-rw------- 1 root root 1.1K Nov 24 10:19 dmesg-efi_pstore-173240759702002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759703002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759704002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759705002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759706002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759707002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759708002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759809002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759810002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759811002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759812002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759813002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759814002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759815002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759816002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759817002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759818002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759819002
-rw------- 1 root root 1.7K Nov 24 10:19 dmesg-efi_pstore-173240759820002
-rw------- 1 root root 1.6K Nov 24 10:19 dmesg-efi_pstore-173240759821002
-rw-r----- 1 root root  68K Nov 24 10:33 dmesg.txt
root@px-server:/var/lib/systemd/pstore/173240759#
 

Attachments

Yes, thank you!

Code:
<1>[201501.817138] BUG: kernel NULL pointer dereference, address: 0000000000000069
<1>[201501.817145] #PF: supervisor read access in kernel mode
<1>[201501.817147] #PF: error_code(0x0000) - not-present page
<6>[201501.817149] PGD 0 P4D 0
<4>[201501.817151] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[201501.817153] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O       6.8.12-4-pve #1
dmesg-efi_pstore-173240759704002:
Panic#2 Part4
<4>[201501.817156] Hardware name: Gigabyte Technology Co., Ltd. B560M DS3H PLUS/B560M DS3H PLUS, BIOS F5 03/25/2022
<4>[201501.817158] RIP: 0010:psi_task_change+0x76/0xd0
<4>[201501.817162] Code: ff 44 89 f7 e8 ab 3b ff ff 49 89 c7 66 90 48 8b 83 30 0e 00 00 48 c7 c3 e0 0f 08 a7 48 8b 80 80 00 00 00 48 8b 90 f8 00 00 00 <48> 83 7a 68 01 74 07 48 8b 98 88 04 00 00 48 89 df 41 b9 01 00 00
<4>[201501.817166] RSP: 0018:ffffffffa7003cf8 EFLAGS: 00010046
<4>[201501.817168] RAX: ffffffffa71c8280 RBX: ffffffffa7080fe0 RCX: 0000000000000000
<4>[201501.817170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[201501.817172] RBP: ffffffffa7003d20 R08: 0000000000000000 R09: 0000000000000000
<4>[201501.817174] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[201501.817175] R13: 0000000000000004 R14: 0000000000000000 R15: 0000b743cc3b6744
<4>[201501.817177] FS:  0000000000000000(0000) GS:ffff890040000000(0000) knlGS:0000000000000000
<4>[201501.817179] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[201501.817181] CR2: 0000000000000069 CR3: 0000000d88c5a001 CR4: 0000000000772ef0
<4>[201501.817183] PKRU: 55555554
<4>[201501.817184] Call Trace:
<4>[201501.817186]  <TASK>
<4>[201501.817189]  ? show_regs+0x6d/0x80
<4>[201501.817193]  ? __die+0x24/0x80
<4>[201501.817195]  ? page_fault_oops+0x176/0x500
<4>[201501.817200]  ? do_user_addr_fault+0x2ed/0x660
<4>[201501.817203]  ? exc_page_fault+0x83/0x1b0
<4>[201501.817207]  ? asm_exc_page_fault+0x27/0x30
<4>[201501.817210]  ? psi_task_change+0x76/0xd0
<4>[201501.817212]  ? psi_task_change+0x55/0xd0
<4>[201501.817214]  enqueue_task+0xd6/0x1a0
dmesg-efi_pstore-173240759703002:
Panic#2 Part3
<4>[201501.817217]  ttwu_do_activate+0x5f/0x250
<4>[201501.817220]  sched_ttwu_pending+0xf1/0x1a0
<4>[201501.817223]  __flush_smp_call_function_queue+0x143/0x450
<4>[201501.817226]  flush_smp_call_function_queue+0x3a/0x90
<4>[201501.817229]  do_idle+0x16f/0x260
<4>[201501.817232]  cpu_startup_entry+0x2a/0x30
<4>[201501.817234]  rest_init+0xd0/0xd0
<4>[201501.817236]  arch_call_rest_init+0xe/0x30
<4>[201501.817239]  start_kernel+0x729/0xb00
<4>[201501.817242]  x86_64_start_reservations+0x18/0x30
<4>[201501.817245]  x86_64_start_kernel+0xbf/0x110
<4>[201501.817248]  secondary_startup_64_no_verify+0x184/0x18b
<4>[201501.817252]  </TASK>
<4>[201501.817253] Modules linked in: tcp_diag inet_diag xt_connmark xt_mark iptable_mangle xt_comment wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel scsi_transport_iscsi nf_tables softdog nvme_fabrics nvme_core nvme_auth sunrpc xfs binfmt_misc bonding tls nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_hda_codec_realtek snd_sof_intel_hda_mlink soundwire_cadence snd_hda_codec_generic snd_sof_intel_hda xe snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils intel_rapl_msr intel_rapl_common snd_soc_hdac_hda snd_hda_ext_core intel_uncore_frequency intel_uncore_frequency_common
dmesg-efi_pstore-173240759702002:
Panic#2 Part2
<4>[201501.817287]  snd_soc_acpi_intel_match drm_gpuvm x86_pkg_temp_thermal snd_soc_acpi drm_exec soundwire_generic_allocation intel_powerclamp gpu_sched soundwire_bus drm_suballoc_helper drm_ttm_helper snd_soc_core kvm_intel snd_compress ac97_bus ppdev i915 snd_pcm_dmaengine kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_core ghash_clmulni_intel sha256_ssse3 sha1_ssse3 snd_hwdep aesni_intel drm_buddy mei_hdcp mei_pxp snd_pcm ttm crypto_simd jc42 drm_display_helper cryptd cmdlinepart snd_timer snd intel_cstate mei_me cec spi_nor gigabyte_wmi rc_core intel_wmi_thunderbolt pcspkr soundcore wmi_bmof mei mtd i2c_algo_bit ee1004 intel_pmc_core parport_pc intel_vsec pmt_telemetry parport pmt_class acpi_pad acpi_tad mac_hid vhost_net vhost vhost_iotlb tap coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbhid uas
dmesg-efi_pstore-173240759701002:
Panic#2 Part1
<4>[201501.817339]  hid usb_storage mpt3sas xhci_pci xhci_pci_renesas crc32_pclmul r8169 i2c_i801 ahci spi_intel_pci xhci_hcd raid_class i2c_smbus realtek spi_intel scsi_transport_sas libahci video wmi pinctrl_tigerlake
<4>[201501.817361] CR2: 0000000000000069
<4>[201501.817363] ---[ end trace 0000000000000000 ]---
<4>[201501.936084] RIP: 0010:psi_task_change+0x76/0xd0
<4>[201501.936096] Code: ff 44 89 f7 e8 ab 3b ff ff 49 89 c7 66 90 48 8b 83 30 0e 00 00 48 c7 c3 e0 0f 08 a7 48 8b 80 80 00 00 00 48 8b 90 f8 00 00 00 <48> 83 7a 68 01 74 07 48 8b 98 88 04 00 00 48 89 df 41 b9 01 00 00
<4>[201501.936100] RSP: 0018:ffffffffa7003cf8 EFLAGS: 00010046
<4>[201501.936103] RAX: ffffffffa71c8280 RBX: ffffffffa7080fe0 RCX: 0000000000000000
<4>[201501.936106] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[201501.936107] RBP: ffffffffa7003d20 R08: 0000000000000000 R09: 0000000000000000
<4>[201501.936109] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[201501.936111] R13: 0000000000000004 R14: 0000000000000000 R15: 0000b743cc3b6744
<4>[201501.936112] FS:  0000000000000000(0000) GS:ffff890040000000(0000) knlGS:0000000000000000
<4>[201501.936115] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[201501.936116] CR2: 0000000000000069 CR3: 0000000d88c5a001 CR4: 0000000000772ef0
<4>[201501.936118] PKRU: 55555554
<0>[201501.936120] Kernel panic - not syncing: Attempted to kill the idle task!
<0>[201502.982818] Shutting down cpus with NMI
<0>[201502.982825] Kernel Offset: 0x23e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

It seems the `swapper` process did a kernel NULL pointer dereference.

Is there a newer BIOS version available for your mainboard?
Do you have the latest amd-microcode package installed?

Could you run a memtest just to make sure there's no memory error involved?
 
Is there a newer BIOS version available for your mainboard?

Updated to the latest earlier this year

Do you have the latest amd-microcode package installed?

No I do not, though my cpu is a i5-11500, would the intel microcode be the one to use?

Could you run a memtest just to make sure there's no memory error involved?

I'll schedule some downtime for that.

Thanks!
 
Oh sorry! I mixed up the chipset names. There are `B550` and `B650` chipsets for AMD, but no `B560`.
Yes, for Intel it's the `intel-microcode` package. If you don't have it installed, please install it.

Can you also check your APT history (/var/log/apt/history.log) logs if there were any kernel changes when you updated?
If so, you could try booting the previous kernel to see if there are still issues.
 
Oh sorry! I mixed up the chipset names. There are `B550` and `B650` chipsets for AMD, but no `B560`.
Yes, for Intel it's the `intel-microcode` package. If you don't have it installed, please install it.

No worries! I ended up installing both :) Also fixed up my quorum votes.

I've had no issues since then, so hopefully one of those fixed the issue.

Thanks!