all pve of datacenter reboot on production !!

synergy-it · Feb 20, 2023

Hello,

I'have 2 proxmox servers fresh installed and after 2 weeks of good state we have put it on production.
from 17 féb. 2023 they have 6 VM on charge.

Today they reboot and crash all production VM at ~17:03 and we are affraid of the next with this solution.
We don't understand why they crash ! The both !

We are not familiar with proxmox log and would like to call help from experts to analyse our logs.

Can a charitable soul help us ?


Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3
Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3
Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3
Feb 20 17:02:53 yoda ceph-mgr[1509]: 2023-02-20T17:02:53.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:53.028010+0100)
Feb 20 17:02:54 yoda ceph-mgr[1509]: 2023-02-20T17:02:54.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:54.028119+0100)
Feb 20 17:02:55 yoda ceph-mgr[1509]: 2023-02-20T17:02:55.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:55.028256+0100)
Feb 20 17:02:56 yoda ceph-mgr[1509]: 2023-02-20T17:02:56.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:56.028434+0100)
Feb 20 17:02:57 yoda ceph-mgr[1509]: 2023-02-20T17:02:57.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:57.028581+0100)
Feb 20 17:02:58 yoda ceph-mgr[1509]: 2023-02-20T17:02:58.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:58.028734+0100)
Feb 20 17:02:59 yoda ceph-mgr[1509]: 2023-02-20T17:02:59.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:59.028902+0100)
Feb 20 17:03:00 yoda ceph-mgr[1509]: 2023-02-20T17:03:00.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:00.029048+0100)
Feb 20 17:03:01 yoda ceph-mgr[1509]: 2023-02-20T17:03:01.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:01.029224+0100)
Feb 20 17:03:02 yoda ceph-mgr[1509]: 2023-02-20T17:03:02.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:02.029367+0100)
Feb 20 17:03:03 yoda ceph-mgr[1509]: 2023-02-20T17:03:03.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:03.029512+0100)
Feb 20 17:03:04 yoda ceph-mgr[1509]: 2023-02-20T17:03:04.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:04.029686+0100)
Feb 20 17:03:05 yoda ceph-mgr[1509]: 2023-02-20T17:03:05.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:05.029859+0100)
Feb 20 17:03:06 yoda ceph-mgr[1509]: 2023-02-20T17:03:06.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:06.030041+0100)
Feb 20 17:03:07 yoda ceph-mgr[1509]: 2023-02-20T17:03:07.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:07.030209+0100)
Feb 20 17:03:08 yoda ceph-mgr[1509]: 2023-02-20T17:03:08.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:08.030379+0100)
Feb 20 17:03:41 yoda corosync[1511]:   [KNET  ] link: host: 2 link: 0 is down
Feb 20 17:03:41 yoda corosync[1511]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 20 17:03:41 yoda corosync[1511]:   [KNET  ] host: host: 2 has no active links
Feb 20 17:03:42 yoda corosync[1511]:   [TOTEM ] Token has not been received in 2250 ms
Feb 20 17:03:43 yoda corosync[1511]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Feb 20 17:03:46 yoda corosync[1511]:   [QUORUM] Sync members[1]: 1
Feb 20 17:03:46 yoda corosync[1511]:   [QUORUM] Sync left[1]: 2
Feb 20 17:03:46 yoda corosync[1511]:   [TOTEM ] A new membership (1.6c) was formed. Members left: 2
Feb 20 17:03:46 yoda corosync[1511]:   [TOTEM ] Failed to receive the leave message. failed: 2
Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485
Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: members: 1/1485
Feb 20 17:03:46 yoda corosync[1511]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 20 17:03:46 yoda corosync[1511]:   [QUORUM] Members[1]: 1
Feb 20 17:03:46 yoda corosync[1511]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: node lost quorum
Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: received write while not quorate - trigger resync
Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: leaving CPG group
Feb 20 17:03:46 yoda pve-ha-lrm[1626]: lost lock 'ha_agent_yoda_lock - cfs lock update failed - Operation not permitted
Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] notice: start cluster connection
Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: cpg_join failed: 14
Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: can't initialize service
Feb 20 17:03:48 yoda pve-ha-crm[1615]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Feb 20 17:03:51 yoda pve-ha-lrm[1626]: status change active => lost_agent_lock
Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485
Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: all data is up to date
Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change master => lost_manager_lock
Feb 20 17:03:53 yoda pve-ha-crm[1615]: watchdog closed (disabled)
Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change lost_manager_lock => wait_for_quorum
Feb 20 17:04:04 yoda kernel: split_lock_warn: 14 callbacks suppressed
Feb 20 17:04:04 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0xfffff8019af8334c
Feb 20 17:04:09 yoda pvescheduler[2811971]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 20 17:04:09 yoda pvescheduler[2811970]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Feb 20 17:04:37 yoda watchdog-mux[1072]: client watchdog expired - disable watchdog updates
-- Reboot --:mad:
Feb 20 17:07:12 yoda kernel: Linux version 5.15.85-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) ()
Feb 20 17:07:12 yoda kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.85-1-pve root=/dev/mapper/pve-root ro quiet
Feb 20 17:07:12 yoda kernel: KERNEL supported cpus:
Feb 20 17:07:12 yoda kernel:   Intel GenuineIntel
Feb 20 17:07:12 yoda kernel:   AMD AuthenticAMD
Feb 20 17:07:12 yoda kernel:   Hygon HygonGenuine
Feb 20 17:07:12 yoda kernel:   Centaur CentaurHauls
Feb 20 17:07:12 yoda kernel:   zhaoxin   Shanghai 
Feb 20 17:07:12 yoda kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
Feb 20 17:07:12 yoda kernel: x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.
Feb 20 17:07:12 yoda kernel: signal: max sigframe size: 3632
Feb 20 17:07:12 yoda kernel: BIOS-provided physical RAM map:
Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved[ICODE][ICODE]

[/ICODE]
[/ICODE]

spirit · Feb 20, 2023

you need 3 nodes to use HA.

(and cluster need 3 nodes minimum to keep quorum)

synergy-it · Feb 28, 2023

Hello,

Thx for your reply but it's not the reason that's 2 servers down in the same time !
I know, if a server down the quorum is not available to move VM's.
The problem where I am confronted is the full cluster down for one reason, but i don't understand why.

kr,

tuxis · Feb 28, 2023

How do you expect a single node with HA resources to determine if he is the node that should keep running?

You need three votes for a HA setup. If there is no quorum, /etc/pve is read-only, and HA nodes will probably reboot.

Maybe you should take a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

synergy-it · Mar 1, 2023

Thanks for your reply !

OK so for you, if 2 servers reboot unfortunately at this time is maybe because 1 are down for (why ? I can't fin in log) one reason and the second are reboot because it is the only server node up .

I will add server to HA and cross my fingers !

thx

Search

Search

all pve of datacenter reboot on production !!

synergy-it

New Member

spirit

Distinguished Member

synergy-it

New Member

tuxis

Famous Member

synergy-it

New Member

We value your privacy