Proxmox mystery reboot

linuxteam

New Member
Nov 29, 2024
4
0
1
Hello,

We have a proxmox cluster of 4 physical servers running on pve 8.2.2. we have observed an incident whereby all the four server mysteriously rebooted at same time.

I have checked proxmox syslog and related logs however could not find any reason for the same.

let me know is this a bug with the version or l am missing something.

Thanks and Regards
LinuxTeam
 
Hello,

Please take a look at fencing at our documentation [1].

If a node loses corosync quorum then it will fence itself. If the entire network used by corosync fails the entire cluster will fence. Hence why it is important to have a stable dedicated low-latency network for corosync and add extra connections to corosync to switch over in cases there are issues, please see our documentation about corosync redundancy [2].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
  • Like
Reactions: UdoB
HI Max,

I have found below log in journalctl log file, is that can be the reason of all cluster server restart?

Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7eb) was formed. Members
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:22 proxmox-corp4 watchdog-mux[2630]: client watchdog expired - disable watchdog updates
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7ef) was formed. Members
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[4]: 1 2 3 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync joined[2]: 2 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7f3) was formed. Members joined: 2 4
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: starting data syncronisation
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: starting data syncronisation
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] This node is within the primary component and will provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Members[4]: 1 2 3 4
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: node has quorum
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received sync request (epoch 1/797133/0000010C)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] crit: ignore sync request from wrong member 2/1170960
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received sync request (epoch 2/1170960/000000F7)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received sync request (epoch 1/797133/000000F4)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] crit: ignore sync request from wrong member 2/1170960
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received sync request (epoch 2/1170960/000000F5)
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: received all states
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: leader is 1/797133
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: synced members: 1/797133, 2/1170960, 3/4047203, 4/525954
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: all data is up to date
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [dcdb] notice: dfsm_deliver_queue: queue length 6
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: received all states
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: all data is up to date
Nov 26 21:18:24 proxmox-corp4 pmxcfs[4047203]: [status] notice: dfsm_deliver_queue: queue length 44
Nov 26 21:18:25 proxmox-corp4 pvestatd[3098]: storage 'SLVtone' is not online
Nov 26 21:18:26 proxmox-corp4 pve-ha-lrm[3151]: successfully acquired lock 'ha_agent_proxmox-corp4_lock'
Nov 26 21:18:26 proxmox-corp4 watchdog-mux[2630]: exit watchdog-mux with active connections
Nov 26 21:18:26 proxmox-corp4 pve-ha-lrm[3151]: status change lost_agent_lock => active
Nov 26 21:18:26 proxmox-corp4 systemd-journald[1755]: Received client request to sync journal.
Nov 26 21:18:26 proxmox-corp4 kernel: watchdog: watchdog0: watchdog did not stop!
Nov 26 21:18:26 proxmox-corp4 systemd[1]: watchdog-mux.service: Deactivated successfully.
Nov 26 21:18:26 proxmox-corp4 systemd[1]: watchdog-mux.service: Consumed 1min 5.551s CPU time.
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: successfully acquired lock 'ha_manager_lock'
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: ERROR: unable to open watchdog socket - No such file or directory
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: server received shutdown request
Nov 26 21:18:26 proxmox-corp4 pve-ha-crm[3141]: server stopped
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/EXCEPTION
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'.
Nov 26 21:18:26 proxmox-corp4 systemd[1]: pve-ha-crm.service: Consumed 10min 2.405s CPU time.
 
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7eb) was formed. Members
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:18 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:22 proxmox-corp4 watchdog-mux[2630]: client watchdog expired - disable watchdog updates
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [TOTEM ] A new membership (1.7ef) was formed. Members
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [QUORUM] Members[2]: 1 3
Nov 26 21:18:23 proxmox-corp4 corosync[4689]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 21:18:24 proxmox-corp4 corosync[4689]: [QUORUM] Sync members[4]: 1 2 3 4

Seem that hosts couldn't each other ( 2 nodes / 4 , so no quorum, so reboot).
Maybe check that you didn't have network problem at this time. (maybe network saturation ?) . are you using dedicated nic for the cluster ? (if not, be carefull of backup for example, they could use all the network bandwidth)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!