One node in a cluster of 3 constantly restarting every ~2-5 minutes

HenryX

Member
Oct 30, 2022
8
2
8
Hi,

today I set up a new node in my proxmox cluster. Before that I had two nodes, now I have three nodes.
I moved all running containers from node A to the new node (Node C).

Now node A is constantly restarting and I have no clue why.
Can you help me out?

Here is the output of journalctl -b -1 -e
--> I uploaded it to pastebin, because the forum software here would not let me use more than x characters: https://pastebin.com/2S9N20Tm
 
Does nobody have an idea?

Here's also the output of
journalctl -p err |tail -n 20
Code:
root@pve:~# journalctl -p err |tail -n 20
Jan 10 00:45:05 pve smartd[714]: Device: /dev/nvme0, number of Error Log entries increased from 2924 to 2926
Jan 10 00:45:06 pve pmxcfs[959]: [quorum] crit: quorum_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [quorum] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [confdb] crit: cmap_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [confdb] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [dcdb] crit: cpg_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [dcdb] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [status] crit: cpg_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [status] crit: can't initialize service
-- Boot 120b51980cf1430d972f6808090373e7 --
Jan 10 00:47:19 pve kernel: x86/cpu: SGX disabled by BIOS.
Jan 10 00:47:22 pve smartd[723]: Device: /dev/nvme0, number of Error Log entries increased from 2926 to 2928
Jan 10 00:47:23 pve pmxcfs[956]: [quorum] crit: quorum_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [quorum] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [confdb] crit: cmap_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [confdb] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [dcdb] crit: cpg_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [dcdb] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [status] crit: cpg_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [status] crit: can't initialize service


And the output of dmesg --level=err,warn:
Code:
[    0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[    0.000000] secureboot: Secure boot could not be determined (mode 0)
[    0.012784] secureboot: Secure boot could not be determined (mode 0)
[    0.135024] x86/cpu: SGX disabled by BIOS.
[    0.445331] hpet_acpi_add: no address or irqs in _CRS
[    0.477783] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    0.477913] platform eisa.0: EISA: Cannot allocate resource for mainboard
[    0.477915] platform eisa.0: Cannot allocate resource for EISA slot 1
[    0.477918] platform eisa.0: Cannot allocate resource for EISA slot 2
[    0.477920] platform eisa.0: Cannot allocate resource for EISA slot 3
[    0.477922] platform eisa.0: Cannot allocate resource for EISA slot 4
[    0.477924] platform eisa.0: Cannot allocate resource for EISA slot 5
[    0.477926] platform eisa.0: Cannot allocate resource for EISA slot 6
[    0.477928] platform eisa.0: Cannot allocate resource for EISA slot 7
[    0.477930] platform eisa.0: Cannot allocate resource for EISA slot 8
[    0.806269] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    1.141012] acpi PNP0C14:02: duplicate WMI GUID 2B814318-4BE8-4707-9D84-A190A859B5D0 (first instance was on PNP0C14:00)
[    1.141022] acpi PNP0C14:02: duplicate WMI GUID 41227C2D-80E1-423F-8B8E-87E32755A0EB (first instance was on PNP0C14:00)
[    1.141025] wmi_bus wmi_bus-PNP0C14:02: WQZZ data block query control method not found
[    1.174320] r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
[    1.493834] ata2.00: supports DRM functions and may not be fully accessible
[    1.497311] ata2.00: supports DRM functions and may not be fully accessible
[    5.551022] systemd-journald[436]: File /var/log/journal/cc84ca0daa7b4d9e8f095ebff0d8c78c/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    6.033842] hp_wmi: query 0x4 returned error 0x5
[    6.579609] i915 0000:00:02.0: Direct firmware load for i915/gvt/vid_0x8086_did_0x3e92_rid_0x00.golden_hw_state failed with error -2
[    6.590756] spl: loading out-of-tree module taints kernel.
[    6.624149] zfs: module license 'CDDL' taints kernel.
[    6.624153] Disabling lock debugging due to kernel taint
[    6.624178] zfs: module license taints kernel.


----> On top of that, the node does not seem to restart when I remove the ethernet cable...
 
Last edited:
Hi,
what do you mean by restarting? Is it shuting down and rebooting or is it a reset without shuting down services first?
If it is the latter one, then do some hardware checking to eventually identify the culprit.
 
Hi,
what do you mean by restarting? Is it shuting down and rebooting or is it a reset without shuting down services first?
If it is the latter one, then do some hardware checking to eventually identify the culprit.

I am not sure, but based on the journalctl lines "File /var/log/journal/cc84ca0daa7b4d9e8f095ebff0d8c78c/system.journal corrupted or uncleanly shut down, renaming and replacing." I suspect a reset/crash without shutting down properly.

The strange thing is that it does only seem to occur when the node has network access?

How exactly can I check my hardware, I think the only problems could be the SSD or the RAM? But smartmontools says the SSD passed without problems.
 
stop the cluster services when it's disconnected. so you can identify the service causing the issue
pmxcfs is shown in your logs.
 
Code:
Jan 10 00:47:22 pve smartd[723]: Device: /dev/nvme0, number of Error Log entries increased from 2926 to 2928

that does sound like a potentially broken disk.. but you can try running "journalctl -f" over SSH and wait for the next crash, maybe something is printed there that doesn't make it to the persisted logs on the disk..
 
Ok, I Made a very interesting observation.
The crashes/reboots every few minutes only occur when no display is connected!

When a display is connected to the server, there are no crashes at all.

I verified this multiple times. How can that be? Especially since the only thing I did with the server was upgrading it to Proxmox 8, and stopping all running Containers? There's also a container which has access to the integrated graphics of the CPU. But all Containers are stopped.

How can a connected display lead to No Crashes? The server was running fine for more than a year without display previously.
 
Last edited:
Does anybody have any clue why the heck the node is only crashing all the time when no display is connected? I have no idea why this is happening, also I see absolutely nothing in the logs which tell me why the crashes occur...
 
Ok after a lot of trying I was FINALLY able to get at least the kernel crash log with netconsole!
What does this tell me?


Code:
[  122.788239] mce: CPUs not responding to MCE broadcast (may include false positives): 0-3,5
[  122.788243] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[  123.830910] Shutting down cpus with NMI
[  123.841330] Kernel Offset: 0x5800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  123.842054] ------------[ cut here ]------------
[  123.842054] WARNING: CPU: 4 PID: 173 at arch/x86/kernel/fpu/core.c:60 irq_fpu_usable+0x42/0x50
[  123.842059] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel scsi_transport_iscsi nf_tables nvme_fabrics netconsole bonding tls softdog zfs(PO) sunrpc spl(O) vhost_net vhost vhost_iotlb tap snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic binfmt_misc nfnetlink_log nfnetlink kvmgt mdev snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi intel_rapl_msr soundwire_generic_allocation intel_rapl_common intel_uncore_frequency soundwire_bus intel_uncore_frequency_common intel_pmc_core_pltdrv intel_pmc_core snd_soc_core intel_vsec pmt_telemetry pmt_class snd_compress intel_tcc_cooling ac97_bus x86_pkg_temp_thermal snd_pcm_dmaengine intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg mei_pxp mei_hdcp snd_intel_sdw_acpi
[  123.842095]  kvm_intel crct10dif_pclmul polyval_clmulni snd_hda_codec polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep cmdlinepart snd_pcm spi_nor snd_timer rapl hp_wmi snd sparse_keymap mtd platform_profile intel_cstate ee1004 wmi_bmof pcspkr soundcore mei_me mei intel_pch_thermal input_leds acpi_pad joydev mac_hid i915 drm_buddy ttm drm_display_helper cec rc_core i2c_algo_bit kvm vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq hid_logitech_hidpp hid_logitech_dj hid_generic usbmouse usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme xhci_pci spi_intel_pci xhci_pci_renesas nvme_core crc32_pclmul r8169 spi_intel ahci realtek i2c_i801 xhci_hcd nvme_auth i2c_smbus libahci video wmi
[  123.842136] CPU: 4 PID: 173 Comm: kworker/4:2 Tainted: P           O       6.8.12-5-pve #1
[  123.842138] Hardware name: HP HP ProDesk 400 G4 DM/83F3, BIOS Q23 Ver. 02.29.00 07/16/2024
[  123.842140] Workqueue: events output_poll_execute
[  123.842143] RIP: 0010:irq_fpu_usable+0x42/0x50
[  123.842145] Code: 65 8a 0d 99 d7 7b 79 31 c0 84 c9 75 13 b8 01 00 00 00 f7 c2 00 00 0f 00 74 06 80 e6 ff 0f 94 c0 5d 31 d2 31 c9 c3 cc cc cc cc <0f> 0b 31 c0 5d 31 d2 31 c9 c3 cc cc cc cc 90 90 90 90 90 90 90 90
[  123.842147] RSP: 0018:fffffe3c09bfe978 EFLAGS: 00010006
[  123.842148] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[  123.842149] RDX: 0000000080110004 RSI: 0000000000000000 RDI: 0000000000000003
[  123.842150] RBP: fffffe3c09bfe978 R08: 0000000000000000 R09: 0000000000000000
[  123.842151] R10: 0000000000000000 R11: 00000001b0472067 R12: fffffe3c09bfea18
[  123.842152] R13: fffffe3c09bfea20 R14: fffffe3c09bfea28 R15: fffffe3c09bfead0
[  123.842153] FS:  0000000000000000(0000) GS:ffff90dfd7600000(0000) knlGS:0000000000000000
[  123.842155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.842156] CR2: 000062496a186308 CR3: 00000001afc36002 CR4: 00000000003706f0
[  123.842157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.842158] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  123.842159] Call Trace:
[  123.842550]  intel_runtime_resume+0xc3/0x2e0 [i915]
[  123.842643]  pci_pm_runtime_resume+0xa0/0x100
[  123.842646]  __rpm_callback+0x4d/0x170
[  123.842648]  ? __rq_qos_issue+0x26/0x50
[  123.842652]  rpm_callback+0x6d/0x80
[  123.842654]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  123.843299] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel scsi_transport_iscsi nf_tables nvme_fabrics netconsole bonding tls softdog zfs(PO) sunrpc spl(O) vhost_net vhost vhost_iotlb tap snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic binfmt_misc nfnetlink_log nfnetlink kvmgt mdev snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi intel_rapl_msr soundwire_generic_allocation intel_rapl_common intel_uncore_frequency soundwire_bus intel_uncore_frequency_common intel_pmc_core_pltdrv intel_pmc_core snd_soc_core intel_vsec pmt_telemetry pmt_class snd_compress intel_tcc_cooling ac97_bus x86_pkg_temp_thermal snd_pcm_dmaengine intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg mei_pxp mei_hdcp snd_intel_sdw_acpi
[  123.843326]  kvm_intel crct10dif_pclmul polyval_clmulni snd_hda_codec polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep cmdlinepart snd_pcm spi_nor snd_timer rapl hp_wmi snd sparse_keymap mtd platform_profile intel_cstate ee1004 wmi_bmof pcspkr soundcore mei_me mei intel_pch_thermal input_leds acpi_pad joydev mac_hid i915 drm_buddy ttm drm_display_helper cec rc_core i2c_algo_bit kvm vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq hid_logitech_hidpp hid_logitech_dj hid_generic usbmouse usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme xhci_pci spi_intel_pci xhci_pci_renesas nvme_core crc32_pclmul r8169 spi_intel ahci realtek i2c_i801 xhci_hcd nvme_auth i2c_smbus libahci video wmi
[  123.843359] CPU: 4 PID: 173 Comm: kworker/4:2 Tainted: P        W  O       6.8.12-5-pve #1
[  123.843360] Hardware name: HP HP ProDesk 400 G4 DM/83F3, BIOS Q23 Ver. 02.29.00 07/16/2024
[  123.843361] Workqueue: events output_poll_execute
[  123.843363] RIP: 0010:kernel_fpu_begin_mask+0xb5/0xd0
[  123.843365] Code: f8 c9 31 c0 31 ff c3 cc cc cc cc 48 8b 07 f6 c4 40 75 ba f0 80 4f 01 40 48 81 c7 00 25 00 00 e8 91 fe ff ff eb a7 db e3 eb c4 <0f> 0b e9 77 ff ff ff 0f 0b e9 7b ff ff ff e8 c8 32 0e 01 0f 1f 84
[  123.843367] RSP: 0018:fffffe3c09bfe988 EFLAGS: 00010046
[  123.843368] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[  123.843369] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[  123.843370] RBP: fffffe3c09bfe9a0 R08: 0000000000000000 R09: 0000000000000000
[  123.843370] R10: 0000000000000000 R11: 00000001b0472067 R12: fffffe3c09bfea18
[  123.843371] R13: fffffe3c09bfea20 R14: fffffe3c09bfea28 R15: fffffe3c09bfead0
[  123.843372] FS:  0000000000000000(0000) GS:ffff90dfd7600000(0000) knlGS:0000000000000000
[  123.843374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.843375] CR2: 000062496a186308 CR3: 00000001afc36002 CR4: 00000000003706f0
[  123.843376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.843377] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  123.843378] Call Trace:
[  123.843378]  <#MC>
[  123.843379]  ? show_regs+0x6d/0x80
[  123.843382]  ? __warn+0x89/0x160
[  123.843384]  ? kernel_fpu_begin_mask+0xb5/0xd0
[  123.843386]  ? report_bug+0x17e/0x1b0
[  123.843388]  ? handle_bug+0x46/0x90
[  123.843390]  ? exc_invalid_op+0x18/0x80
[  123.843392]  ? asm_exc_invalid_op+0x1b/0x20
[  123.843636] RAX: 0000000000000000 RBX: ffff90dcd37b1cd0 RCX: 0000000000000000
[  123.843637] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90dcd37b1cd0
[  123.843638] RBP: ffffbb67803d7ad8 R08: 0000000000000000 R09: 0000000000000000
[  123.843638] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  123.843639] R13: ffff90dcd37b0000 R14: 0000000000000003 R15: ffffffff870df6a0
[  123.843640]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  123.843644]  </#MC>
[  123.843645]  <TASK>
[  123.843646]  intel_uncore_unclaimed_mmio+0x39/0x60 [i915]
[  123.843744]  intel_runtime_resume+0xc3/0x2e0 [i915]
[  123.843838]  pci_pm_runtime_resume+0xa0/0x100
[  123.843840]  __rpm_callback+0x4d/0x170
[  123.843842]  ? __rq_qos_issue+0x26/0x50
[  123.843845]  rpm_callback+0x6d/0x80
[  123.843847]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  123.843849]  rpm_resume+0x594/0x7e0
[  123.843851]  ? __blk_mq_get_tag+0x3b/0x110
[  123.843854]  __pm_runtime_resume+0x4e/0x80
[  123.843856]  __intel_runtime_pm_get+0x23/0xb0 [i915]
[  123.843953]  intel_runtime_pm_get+0x13/0x20 [i915]
[  123.844049]  intel_display_power_get+0x29/0x70 [i915]
[  123.844192]  intel_digital_port_connected+0x36/0xa0 [i915]
[  123.844321]  intel_dp_detect+0xbf/0x6e0 [i915]
[  123.844446]  ? ww_mutex_lock+0x19/0xa0
[  123.844450]  drm_helper_probe_detect_ctx+0x57/0x120
[  123.844454]  output_poll_execute+0x17a/0x280
 
This looks like it has something to do with power management.

ETA: Yeah, intel_runtime_resume() is in i915_drv.c and seems to be the handler for resuming the display from suspend. Is there perhaps a setting in the BIOS to tell it there is no display? Another thing to look at would be any sleep settings.
 
Last edited:
  • Like
Reactions: HenryX
This looks like it has something to do with power management.
And that is a fantastic guess. I just solved it by adding i915.enable_dc=0 to to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, which disables the power saving of the i915 integrated graphics when no display is connected.

Wow, that was one of my harder IT problems. Made it extra difficult to not see anything in the logs, and connecting a display did not help because it made the problem go away... :D
 
Last edited:
  • Like
Reactions: fabian and fba