One node in a cluster of 3 constantly restarting every ~2-5 minutes

HenryX

Member
Oct 30, 2022
4
0
6
Hi,

today I set up a new node in my proxmox cluster. Before that I had two nodes, now I have three nodes.
I moved all running containers from node A to the new node (Node C).

Now node A is constantly restarting and I have no clue why.
Can you help me out?

Here is the output of journalctl -b -1 -e
--> I uploaded it to pastebin, because the forum software here would not let me use more than x characters: https://pastebin.com/2S9N20Tm
 
Does nobody have an idea?

Here's also the output of
journalctl -p err |tail -n 20
Code:
root@pve:~# journalctl -p err |tail -n 20
Jan 10 00:45:05 pve smartd[714]: Device: /dev/nvme0, number of Error Log entries increased from 2924 to 2926
Jan 10 00:45:06 pve pmxcfs[959]: [quorum] crit: quorum_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [quorum] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [confdb] crit: cmap_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [confdb] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [dcdb] crit: cpg_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [dcdb] crit: can't initialize service
Jan 10 00:45:06 pve pmxcfs[959]: [status] crit: cpg_initialize failed: 2
Jan 10 00:45:06 pve pmxcfs[959]: [status] crit: can't initialize service
-- Boot 120b51980cf1430d972f6808090373e7 --
Jan 10 00:47:19 pve kernel: x86/cpu: SGX disabled by BIOS.
Jan 10 00:47:22 pve smartd[723]: Device: /dev/nvme0, number of Error Log entries increased from 2926 to 2928
Jan 10 00:47:23 pve pmxcfs[956]: [quorum] crit: quorum_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [quorum] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [confdb] crit: cmap_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [confdb] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [dcdb] crit: cpg_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [dcdb] crit: can't initialize service
Jan 10 00:47:23 pve pmxcfs[956]: [status] crit: cpg_initialize failed: 2
Jan 10 00:47:23 pve pmxcfs[956]: [status] crit: can't initialize service


And the output of dmesg --level=err,warn:
Code:
[    0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[    0.000000] secureboot: Secure boot could not be determined (mode 0)
[    0.012784] secureboot: Secure boot could not be determined (mode 0)
[    0.135024] x86/cpu: SGX disabled by BIOS.
[    0.445331] hpet_acpi_add: no address or irqs in _CRS
[    0.477783] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    0.477913] platform eisa.0: EISA: Cannot allocate resource for mainboard
[    0.477915] platform eisa.0: Cannot allocate resource for EISA slot 1
[    0.477918] platform eisa.0: Cannot allocate resource for EISA slot 2
[    0.477920] platform eisa.0: Cannot allocate resource for EISA slot 3
[    0.477922] platform eisa.0: Cannot allocate resource for EISA slot 4
[    0.477924] platform eisa.0: Cannot allocate resource for EISA slot 5
[    0.477926] platform eisa.0: Cannot allocate resource for EISA slot 6
[    0.477928] platform eisa.0: Cannot allocate resource for EISA slot 7
[    0.477930] platform eisa.0: Cannot allocate resource for EISA slot 8
[    0.806269] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    1.141012] acpi PNP0C14:02: duplicate WMI GUID 2B814318-4BE8-4707-9D84-A190A859B5D0 (first instance was on PNP0C14:00)
[    1.141022] acpi PNP0C14:02: duplicate WMI GUID 41227C2D-80E1-423F-8B8E-87E32755A0EB (first instance was on PNP0C14:00)
[    1.141025] wmi_bus wmi_bus-PNP0C14:02: WQZZ data block query control method not found
[    1.174320] r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
[    1.493834] ata2.00: supports DRM functions and may not be fully accessible
[    1.497311] ata2.00: supports DRM functions and may not be fully accessible
[    5.551022] systemd-journald[436]: File /var/log/journal/cc84ca0daa7b4d9e8f095ebff0d8c78c/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    6.033842] hp_wmi: query 0x4 returned error 0x5
[    6.579609] i915 0000:00:02.0: Direct firmware load for i915/gvt/vid_0x8086_did_0x3e92_rid_0x00.golden_hw_state failed with error -2
[    6.590756] spl: loading out-of-tree module taints kernel.
[    6.624149] zfs: module license 'CDDL' taints kernel.
[    6.624153] Disabling lock debugging due to kernel taint
[    6.624178] zfs: module license taints kernel.


----> On top of that, the node does not seem to restart when I remove the ethernet cable...
 
Last edited:
Hi,
what do you mean by restarting? Is it shuting down and rebooting or is it a reset without shuting down services first?
If it is the latter one, then do some hardware checking to eventually identify the culprit.
 
Hi,
what do you mean by restarting? Is it shuting down and rebooting or is it a reset without shuting down services first?
If it is the latter one, then do some hardware checking to eventually identify the culprit.

I am not sure, but based on the journalctl lines "File /var/log/journal/cc84ca0daa7b4d9e8f095ebff0d8c78c/system.journal corrupted or uncleanly shut down, renaming and replacing." I suspect a reset/crash without shutting down properly.

The strange thing is that it does only seem to occur when the node has network access?

How exactly can I check my hardware, I think the only problems could be the SSD or the RAM? But smartmontools says the SSD passed without problems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!