Proxmox VE cluster experiences unexpected restart.

sungl

New Member
Oct 18, 2023
23
1
3
My environment is Proxmox VE version 7.4-3, composed of a cluster of 36 servers. On October 16, 2023, at the afternoon, the physical machine triggered a restart operation, with 27 nodes restarting. This phenomenon is very strange, I request assistance in identifying the reason for the restart. Thank you. The log for node 1 has been attached.
 

Attachments

Last edited:
we'd need more information about your network and corosync setup as well as the full logs (journal) covering the time period in question.

from the log you posted:

Code:
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] link: host: 36 link: 0 is down
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 (passive) best link: 0 (pri: 1)
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 has no active links
Oct 16 17:50:05 node1 corosync[3756]:   [TOTEM ] Token has not been received in 18825 ms 
Oct 16 17:50:11 node1 corosync[3756]:   [TOTEM ] A processor failed, forming new configuration: token timed out (25100ms), waiting 30120ms for consensus.
Oct 16 17:50:13 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pvedaemon[880512]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pveproxy[1020215]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 380.
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1015570 finished
Oct 16 17:50:38 node1 pveproxy[4911]: starting 1 worker(s)
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1022437 started
Oct 16 17:50:38 node1 pveproxy[1022437]: Clearing outdated entries from certificate cache
Oct 16 17:50:39 node1 pveproxy[1022436]: got inotify poll request in wrong process - disabling inotify
Oct 16 17:50:40 node1 watchdog-mux[2664]: client watchdog expired - disable watchdog updates
Oct 16 17:50:40 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:40 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync left[1]: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] A new membership (1.24f) was formed. Members left: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] Failed to receive the leave message. failed: 36
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: starting data syncronisation
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: cpg_send_message retried 1 times
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: starting data syncronisation
Oct 16 17:50:42 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:44 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'

so this particular node noticed one node going down (at 17:49:56) and it took almost a minute to re-establish cluster communications (until 17:50:41). that usually indicates some severe network problem or misarchitecture. in between, the watchdog expired and caused the node to fence itself.

note that 36 nodes are quite a lot, so both your network architecture and your node resources need to be up for that in order for the cluster to be stable.
 
Thank you for your response. Currently, I can only provide the corosync configuration. The configuration file is attached.
 

Attachments