Proxmox VE cluster experiences unexpected restart.

sungl · Oct 18, 2023

My environment is Proxmox VE version 7.4-3, composed of a cluster of 36 servers. On October 16, 2023, at the afternoon, the physical machine triggered a restart operation, with 27 nodes restarting. This phenomenon is very strange, I request assistance in identifying the reason for the restart. Thank you. The log for node 1 has been attached.

fabian · Oct 18, 2023

we'd need more information about your network and corosync setup as well as the full logs (journal) covering the time period in question.

from the log you posted:

Code:

Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] link: host: 36 link: 0 is down
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 (passive) best link: 0 (pri: 1)
Oct 16 17:49:56 node1 corosync[3756]:   [KNET  ] host: host: 36 has no active links
Oct 16 17:50:05 node1 corosync[3756]:   [TOTEM ] Token has not been received in 18825 ms 
Oct 16 17:50:11 node1 corosync[3756]:   [TOTEM ] A processor failed, forming new configuration: token timed out (25100ms), waiting 30120ms for consensus.
Oct 16 17:50:13 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pvedaemon[880512]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:34 node1 pveproxy[1020215]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 380.
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1015570 finished
Oct 16 17:50:38 node1 pveproxy[4911]: starting 1 worker(s)
Oct 16 17:50:38 node1 pveproxy[4911]: worker 1022437 started
Oct 16 17:50:38 node1 pveproxy[1022437]: Clearing outdated entries from certificate cache
Oct 16 17:50:39 node1 pveproxy[1022436]: got inotify poll request in wrong process - disabling inotify
Oct 16 17:50:40 node1 watchdog-mux[2664]: client watchdog expired - disable watchdog updates
Oct 16 17:50:40 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:40 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Sync left[1]: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] A new membership (1.24f) was formed. Members left: 36
Oct 16 17:50:41 node1 corosync[3756]:   [TOTEM ] Failed to receive the leave message. failed: 36
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: starting data syncronisation
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Oct 16 17:50:41 node1 corosync[3756]:   [QUORUM] Members[35]: 25 26 27 28 29 30 31 32 33 34 35
Oct 16 17:50:41 node1 corosync[3756]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 17:50:41 node1 pmxcfs[3637]: [dcdb] notice: cpg_send_message retried 1 times
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: members: 1/3637, 2/3475, 3/3293, 4/3355, 5/3769, 6/2936, 7/3004, 8/1151, 9/2615, 10/1666, 11/1208, 12/4838, 13/2408, 14/3435, 15/3322, 16/3526, 17/4100, 18/1172, 19/8776, 20/8900, 21/9111, 22/9357, 23/9356, 24/10090, 25/10616, 26/13931, 27/13718, 28/6665, 29/1694094, 30/1693344, 31/1499748, 32/1503354, 33/1501822, 34/1502402, 35/1247331
Oct 16 17:50:41 node1 pmxcfs[3637]: [status] notice: starting data syncronisation
Oct 16 17:50:42 node1 pvedaemon[917133]: <root@pam> successful auth for user 'root@pam'
Oct 16 17:50:44 node1 pvedaemon[900688]: <root@pam> successful auth for user 'root@pam'

so this particular node noticed one node going down (at 17:49:56) and it took almost a minute to re-establish cluster communications (until 17:50:41). that usually indicates some severe network problem or misarchitecture. in between, the watchdog expired and caused the node to fence itself.

note that 36 nodes are quite a lot, so both your network architecture and your node resources need to be up for that in order for the cluster to be stable.

sungl · Oct 18, 2023

Thank you for your response. Currently, I can only provide the corosync configuration. The configuration file is attached.

Search

Search

Proxmox VE cluster experiences unexpected restart.

sungl

New Member

Attachments

fabian

Proxmox Staff Member

sungl

New Member

Attachments