Proxmox VE cluster experiences unexpected restart.

sungl

New Member
Oct 18, 2023
19
0
1
My environment is Proxmox VE version 7.4-3, composed of a cluster of 36 servers. On October 16, 2023, at the afternoon, the physical machine triggered a restart operation, with 27 nodes restarting. This phenomenon is very strange, I request assistance in identifying the reason for the restart. Thank you. The log for node 1 has been attached.
 

Attachments

Last edited:
we'd need more information about your network and corosync setup as well as the full logs (journal) covering the time period in question.

from the log you posted:

Code:
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] link: host: 36 link: 0 is down
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 (passive) best link: 0 (pri: 1)
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 has no active links
Oct 16 17:50:05 node1 corosync[3756]:   [TOTEM ] Token has not been received in 18825 ms 
Oct 16 17:50:11 node1 corosync[3756]:   [TOTEM ] A processor failed, forming new configuration: token timed out (25100ms), waiting 30120ms for consensus.
Oct 16 17:50:13 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pvedaemon[880512]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pveproxy[1020215]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 380.
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1015570 finished
Oct 16 17:50:38 node1 pveproxy[4911]: starting 1 worker(s)
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1022437 started
Oct 16 17:50:38 node1 pveproxy[1022437]: Clearing outdated entries from certificate cache
Oct 16 17:50:39 node1 pveproxy[1022436]: got inotify poll request in wrong process - disabling inotify
Oct 16 17:50:40 node1 watchdog-mux[2664]: client watchdog expired - disable watchdog updates
Oct 16 17:50:40 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:40 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync left[1]: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] A new membership (1.24f) was formed. Members left: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] Failed to receive the leave message. failed: 36
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: starting data syncronisation
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: cpg_send_message retried 1 times
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: starting data syncronisation
Oct 16 17:50:42 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:44 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'

so this particular node noticed one node going down (at 17:49:56) and it took almost a minute to re-establish cluster communications (until 17:50:41). that usually indicates some severe network problem or misarchitecture. in between, the watchdog expired and caused the node to fence itself.

note that 36 nodes are quite a lot, so both your network architecture and your node resources need to be up for that in order for the cluster to be stable.
 
Thank you for your response. Currently, I can only provide the corosync configuration. The configuration file is attached.
 

Attachments

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!