all pve of datacenter reboot on production !!

synergy-it

New Member
Feb 3, 2023
4
0
1
Hello,

I'have 2 proxmox servers fresh installed and after 2 weeks of good state we have put it on production.
from 17 féb. 2023 they have 6 VM on charge.

Today they reboot and crash all production VM at ~17:03 and we are affraid of the next with this solution.
We don't understand why they crash ! The both !

We are not familiar with proxmox log and would like to call help from experts to analyse our logs.:)
Can a charitable soul help us ?

Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:53 yoda ceph-mgr[1509]: 2023-02-20T17:02:53.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:53.028010+0100) Feb 20 17:02:54 yoda ceph-mgr[1509]: 2023-02-20T17:02:54.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:54.028119+0100) Feb 20 17:02:55 yoda ceph-mgr[1509]: 2023-02-20T17:02:55.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:55.028256+0100) Feb 20 17:02:56 yoda ceph-mgr[1509]: 2023-02-20T17:02:56.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:56.028434+0100) Feb 20 17:02:57 yoda ceph-mgr[1509]: 2023-02-20T17:02:57.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:57.028581+0100) Feb 20 17:02:58 yoda ceph-mgr[1509]: 2023-02-20T17:02:58.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:58.028734+0100) Feb 20 17:02:59 yoda ceph-mgr[1509]: 2023-02-20T17:02:59.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:59.028902+0100) Feb 20 17:03:00 yoda ceph-mgr[1509]: 2023-02-20T17:03:00.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:00.029048+0100) Feb 20 17:03:01 yoda ceph-mgr[1509]: 2023-02-20T17:03:01.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:01.029224+0100) Feb 20 17:03:02 yoda ceph-mgr[1509]: 2023-02-20T17:03:02.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:02.029367+0100) Feb 20 17:03:03 yoda ceph-mgr[1509]: 2023-02-20T17:03:03.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:03.029512+0100) Feb 20 17:03:04 yoda ceph-mgr[1509]: 2023-02-20T17:03:04.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:04.029686+0100) Feb 20 17:03:05 yoda ceph-mgr[1509]: 2023-02-20T17:03:05.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:05.029859+0100) Feb 20 17:03:06 yoda ceph-mgr[1509]: 2023-02-20T17:03:06.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:06.030041+0100) Feb 20 17:03:07 yoda ceph-mgr[1509]: 2023-02-20T17:03:07.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:07.030209+0100) Feb 20 17:03:08 yoda ceph-mgr[1509]: 2023-02-20T17:03:08.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:08.030379+0100) Feb 20 17:03:41 yoda corosync[1511]: [KNET ] link: host: 2 link: 0 is down Feb 20 17:03:41 yoda corosync[1511]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Feb 20 17:03:41 yoda corosync[1511]: [KNET ] host: host: 2 has no active links Feb 20 17:03:42 yoda corosync[1511]: [TOTEM ] Token has not been received in 2250 ms Feb 20 17:03:43 yoda corosync[1511]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus. Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Sync members[1]: 1 Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Sync left[1]: 2 Feb 20 17:03:46 yoda corosync[1511]: [TOTEM ] A new membership (1.6c) was formed. Members left: 2 Feb 20 17:03:46 yoda corosync[1511]: [TOTEM ] Failed to receive the leave message. failed: 2 Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485 Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: members: 1/1485 Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Members[1]: 1 Feb 20 17:03:46 yoda corosync[1511]: [MAIN ] Completed service synchronization, ready to provide service. Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: node lost quorum Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: received write while not quorate - trigger resync Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: leaving CPG group Feb 20 17:03:46 yoda pve-ha-lrm[1626]: lost lock 'ha_agent_yoda_lock - cfs lock update failed - Operation not permitted Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] notice: start cluster connection Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: cpg_join failed: 14 Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: can't initialize service Feb 20 17:03:48 yoda pve-ha-crm[1615]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Feb 20 17:03:51 yoda pve-ha-lrm[1626]: status change active => lost_agent_lock Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485 Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: all data is up to date Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change master => lost_manager_lock Feb 20 17:03:53 yoda pve-ha-crm[1615]: watchdog closed (disabled) Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change lost_manager_lock => wait_for_quorum Feb 20 17:04:04 yoda kernel: split_lock_warn: 14 callbacks suppressed Feb 20 17:04:04 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0xfffff8019af8334c Feb 20 17:04:09 yoda pvescheduler[2811971]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum! Feb 20 17:04:09 yoda pvescheduler[2811970]: replication: cfs-lock 'file-replication_cfg' error: no quorum! Feb 20 17:04:37 yoda watchdog-mux[1072]: client watchdog expired - disable watchdog updates -- Reboot --:mad: Feb 20 17:07:12 yoda kernel: Linux version 5.15.85-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) () Feb 20 17:07:12 yoda kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.85-1-pve root=/dev/mapper/pve-root ro quiet Feb 20 17:07:12 yoda kernel: KERNEL supported cpus: Feb 20 17:07:12 yoda kernel: Intel GenuineIntel Feb 20 17:07:12 yoda kernel: AMD AuthenticAMD Feb 20 17:07:12 yoda kernel: Hygon HygonGenuine Feb 20 17:07:12 yoda kernel: Centaur CentaurHauls Feb 20 17:07:12 yoda kernel: zhaoxin Shanghai Feb 20 17:07:12 yoda kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers' Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[5]: 832, xstate_sizes[5]: 64 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[6]: 896, xstate_sizes[6]: 512 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]: 8 Feb 20 17:07:12 yoda kernel: x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format. Feb 20 17:07:12 yoda kernel: signal: max sigframe size: 3632 Feb 20 17:07:12 yoda kernel: BIOS-provided physical RAM map: Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved[ICODE][ICODE][/ICODE]
[/ICODE]
 
Hello,

Thx for your reply but it's not the reason that's 2 servers down in the same time !
I know, if a server down the quorum is not available to move VM's.
The problem where I am confronted is the full cluster down for one reason, but i don't understand why.

kr,
 
Thanks for your reply !

OK so for you, if 2 servers reboot unfortunately at this time is maybe because 1 are down for (why ? I can't fin in log) one reason and the second are reboot because it is the only server node up .

I will add server to HA and cross my fingers !

thx