Proxmox mystery reboot

linuxteam

New Member
Nov 29, 2024
8
0
1
Hello,

We have a proxmox cluster of 4 physical servers running on pve 8.2.2. we have observed an incident whereby all the four server mysteriously rebooted at same time.

I have checked proxmox syslog and related logs however could not find any reason for the same.

let me know is this a bug with the version or l am missing something.

Thanks and Regards
LinuxTeam
 
Hello,

Please take a look at fencing at our documentation [1].

If a node loses corosync quorum then it will fence itself. If the entire network used by corosync fails the entire cluster will fence. Hence why it is important to have a stable dedicated low-latency network for corosync and add extra connections to corosync to switch over in cases there are issues, please see our documentation about corosync redundancy [2].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
  • Like
Reactions: UdoB
HI Max,

I have found below log in journalctl log file, is that can be the reason of all cluster server restart?

Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7eb) was formed. Members
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:22 proxmox-corp4 watchdog-mux[2630]: client watchdog expired - disable watchdog updates
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7ef) was formed. Members
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[4]: 1 2 3 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync joined[2]: 2 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7f3) was formed. Members joined: 2 4
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: starting data syncronisation
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: starting data syncronisation
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] This node is within the primary component and will provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Members[4]: 1 2 3 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: node has quorum
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received sync request (epoch 1/797133/0000010C)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] crit: ignore sync request from wrong member 2/1170960
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received sync request (epoch 2/1170960/000000F7)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received sync request (epoch 1/797133/000000F4)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] crit: ignore sync request from wrong member 2/1170960
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received sync request (epoch 2/1170960/000000F5)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received all states
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: leader is 1/797133
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: synced members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: all data is up to date
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: dfsm_deliver_queue: queue length 6
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received all states
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: all data is up to date
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: dfsm_deliver_queue: queue length 44
Nov 26 21:18:25 proxmox-corp4 pvestatd[3098]: storage 'SLVtone' is not online
Nov 26 21:18:26 proxmox-corp4 pve-ha-lrm[3151]: successfully acquired lock 'ha_agent_proxmox-corp4_lock'
Nov 26 21:18:26 proxmox-corp4 watchdog-mux[2630]: exit watchdog-mux with active connections
Nov 26 21:18:26 proxmox-corp4 pve-ha-lrm[3151]: status change lost_agent_lock => active
Nov 26 21:18:26 proxmox-corp4 systemd-journald[1755]: Received client request to sync journal.
Nov 26 21:18:26 proxmox-corp4 kernel: watchdog: watchdog0: watchdog did not stop!
Nov 26 21:18:26 proxmox-corp4 systemd[1]: watchdog-mux.service: Deactivated successfully.
Nov 26 21:18:26 proxmox-corp4 systemd[1]: watchdog-mux.service: Consumed 1min 5.551s CPU time.
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: successfully acquired lock 'ha_manager_lock'
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: ERROR: unable to open watchdog socket - No such file or directory
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: server received shutdown request
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: server stopped
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/EXCEPTION
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'.
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Consumed 10min 2.405s CPU time.
 
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7eb) was formed. Members
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:22 proxmox-corp4 watchdog-mux[2630]: client watchdog expired - disable watchdog updates
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7ef) was formed. Members
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[4]: 1 2 3 4

Seem that hosts couldn't each other ( 2 nodes / 4 , so no quorum, so reboot).
Maybe check that you didn't have network problem at this time. (maybe network saturation ?) . are you using dedicated nic for the cluster ? (if not, be carefull of backup for example, they could use all the network bandwidth)