all pve of datacenter reboot on production !!

synergy-it

New Member
Feb 3, 2023
4
0
1
Hello,

I'have 2 proxmox servers fresh installed and after 2 weeks of good state we have put it on production.
from 17 féb. 2023 they have 6 VM on charge.

Today they reboot and crash all production VM at ~17:03 and we are affraid of the next with this solution.
We don't understand why they crash ! The both !

We are not familiar with proxmox log and would like to call help from experts to analyse our logs.:)
Can a charitable soul help us ?

Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:52 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0x7730dfa3 Feb 20 17:02:53 yoda ceph-mgr[1509]: 2023-02-20T17:02:53.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:53.028010+0100) Feb 20 17:02:54 yoda ceph-mgr[1509]: 2023-02-20T17:02:54.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:54.028119+0100) Feb 20 17:02:55 yoda ceph-mgr[1509]: 2023-02-20T17:02:55.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:55.028256+0100) Feb 20 17:02:56 yoda ceph-mgr[1509]: 2023-02-20T17:02:56.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:56.028434+0100) Feb 20 17:02:57 yoda ceph-mgr[1509]: 2023-02-20T17:02:57.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:57.028581+0100) Feb 20 17:02:58 yoda ceph-mgr[1509]: 2023-02-20T17:02:58.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:58.028734+0100) Feb 20 17:02:59 yoda ceph-mgr[1509]: 2023-02-20T17:02:59.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:02:59.028902+0100) Feb 20 17:03:00 yoda ceph-mgr[1509]: 2023-02-20T17:03:00.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:00.029048+0100) Feb 20 17:03:01 yoda ceph-mgr[1509]: 2023-02-20T17:03:01.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:01.029224+0100) Feb 20 17:03:02 yoda ceph-mgr[1509]: 2023-02-20T17:03:02.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:02.029367+0100) Feb 20 17:03:03 yoda ceph-mgr[1509]: 2023-02-20T17:03:03.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:03.029512+0100) Feb 20 17:03:04 yoda ceph-mgr[1509]: 2023-02-20T17:03:04.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:04.029686+0100) Feb 20 17:03:05 yoda ceph-mgr[1509]: 2023-02-20T17:03:05.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:05.029859+0100) Feb 20 17:03:06 yoda ceph-mgr[1509]: 2023-02-20T17:03:06.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:06.030041+0100) Feb 20 17:03:07 yoda ceph-mgr[1509]: 2023-02-20T17:03:07.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:07.030209+0100) Feb 20 17:03:08 yoda ceph-mgr[1509]: 2023-02-20T17:03:08.026+0100 7ff7e4efa700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-02-20T16:03:08.030379+0100) Feb 20 17:03:41 yoda corosync[1511]: [KNET ] link: host: 2 link: 0 is down Feb 20 17:03:41 yoda corosync[1511]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Feb 20 17:03:41 yoda corosync[1511]: [KNET ] host: host: 2 has no active links Feb 20 17:03:42 yoda corosync[1511]: [TOTEM ] Token has not been received in 2250 ms Feb 20 17:03:43 yoda corosync[1511]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus. Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Sync members[1]: 1 Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Sync left[1]: 2 Feb 20 17:03:46 yoda corosync[1511]: [TOTEM ] A new membership (1.6c) was formed. Members left: 2 Feb 20 17:03:46 yoda corosync[1511]: [TOTEM ] Failed to receive the leave message. failed: 2 Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485 Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: members: 1/1485 Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Feb 20 17:03:46 yoda corosync[1511]: [QUORUM] Members[1]: 1 Feb 20 17:03:46 yoda corosync[1511]: [MAIN ] Completed service synchronization, ready to provide service. Feb 20 17:03:46 yoda pmxcfs[1485]: [status] notice: node lost quorum Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: received write while not quorate - trigger resync Feb 20 17:03:46 yoda pmxcfs[1485]: [dcdb] crit: leaving CPG group Feb 20 17:03:46 yoda pve-ha-lrm[1626]: lost lock 'ha_agent_yoda_lock - cfs lock update failed - Operation not permitted Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] notice: start cluster connection Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: cpg_join failed: 14 Feb 20 17:03:47 yoda pmxcfs[1485]: [dcdb] crit: can't initialize service Feb 20 17:03:48 yoda pve-ha-crm[1615]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Feb 20 17:03:51 yoda pve-ha-lrm[1626]: status change active => lost_agent_lock Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: members: 1/1485 Feb 20 17:03:53 yoda pmxcfs[1485]: [dcdb] notice: all data is up to date Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change master => lost_manager_lock Feb 20 17:03:53 yoda pve-ha-crm[1615]: watchdog closed (disabled) Feb 20 17:03:53 yoda pve-ha-crm[1615]: status change lost_manager_lock => wait_for_quorum Feb 20 17:04:04 yoda kernel: split_lock_warn: 14 callbacks suppressed Feb 20 17:04:04 yoda kernel: x86/split lock detection: #AC: CPU 0/KVM/2613595 took a split_lock trap at address: 0xfffff8019af8334c Feb 20 17:04:09 yoda pvescheduler[2811971]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum! Feb 20 17:04:09 yoda pvescheduler[2811970]: replication: cfs-lock 'file-replication_cfg' error: no quorum! Feb 20 17:04:37 yoda watchdog-mux[1072]: client watchdog expired - disable watchdog updates -- Reboot --:mad: Feb 20 17:07:12 yoda kernel: Linux version 5.15.85-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) () Feb 20 17:07:12 yoda kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.85-1-pve root=/dev/mapper/pve-root ro quiet Feb 20 17:07:12 yoda kernel: KERNEL supported cpus: Feb 20 17:07:12 yoda kernel: Intel GenuineIntel Feb 20 17:07:12 yoda kernel: AMD AuthenticAMD Feb 20 17:07:12 yoda kernel: Hygon HygonGenuine Feb 20 17:07:12 yoda kernel: Centaur CentaurHauls Feb 20 17:07:12 yoda kernel: zhaoxin Shanghai Feb 20 17:07:12 yoda kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' Feb 20 17:07:12 yoda kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers' Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[5]: 832, xstate_sizes[5]: 64 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[6]: 896, xstate_sizes[6]: 512 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024 Feb 20 17:07:12 yoda kernel: x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]: 8 Feb 20 17:07:12 yoda kernel: x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format. Feb 20 17:07:12 yoda kernel: signal: max sigframe size: 3632 Feb 20 17:07:12 yoda kernel: BIOS-provided physical RAM map: Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable Feb 20 17:07:12 yoda kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved[ICODE][ICODE][/ICODE]
[/ICODE]
 
Hello,

Thx for your reply but it's not the reason that's 2 servers down in the same time !
I know, if a server down the quorum is not available to move VM's.
The problem where I am confronted is the full cluster down for one reason, but i don't understand why.

kr,
 
Thanks for your reply !

OK so for you, if 2 servers reboot unfortunately at this time is maybe because 1 are down for (why ? I can't fin in log) one reason and the second are reboot because it is the only server node up .

I will add server to HA and cross my fingers !

thx
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!