[SOLVED] Large delay on "pvecm status", webui unresponsive, node failed to rejoin cluster

Mōsō · Sep 17, 2021

Hello,

The story:
Node hard crashes due do motherboard-related failure
We setup a new node and move all lxc containers and vms with "mv /etc/pve/nodes/{old-node}/lxc/* /etc/pve/nodes/{new-node}/lxc" and "mv /etc/pve/nodes/{old-node}/qemu-server/* /etc/pve/nodes/{qemu-server}/lxc"
Everything is fine at this point, webui is responsive, pvecm status works quickly as expected
Hardware vendor does maintenance on server, all checks are green etc.
We turn on the server
Now pvecm status takes 10+ seconds. and shows what follows:

On a node that hasn't crashed:

Code:

Cluster information
-------------------
Name:             cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Sep 17 09:32:01 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2e357
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.5 (local)
0x00000002          1 10.10.10.6
0x00000003          1 10.10.10.1

And on a node that crashed:

Code:

Cluster information
-------------------
Name:             cluster
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Sep 17 08:46:09 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.2f8f8
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.10.10.12 (local)

For the record, I ran systemctl status corosync to check whether corosync was running on the restored node, here's the output of the command:

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-09-17 08:18:32 CEST; 25min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 4319 (corosync)
    Tasks: 9 (limit: 9830)
   Memory: 150.5M
   CGroup: /system.slice/corosync.service
           └─4319 /usr/sbin/corosync -f

Sep 17 08:43:51 shigi corosync[4319]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 17 08:43:53 shigi corosync[4319]:   [TOTEM ] A new membership (4.2f73b) was formed. Members
Sep 17 08:43:53 shigi corosync[4319]:   [QUORUM] Members[1]: 4
Sep 17 08:43:53 shigi corosync[4319]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 17 08:43:56 shigi corosync[4319]:   [TOTEM ] A new membership (4.2f743) was formed. Members
Sep 17 08:43:56 shigi corosync[4319]:   [QUORUM] Members[1]: 4
Sep 17 08:43:56 shigi corosync[4319]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 17 08:43:58 shigi corosync[4319]:   [TOTEM ] A new membership (4.2f74b) was formed. Members
Sep 17 08:43:58 shigi corosync[4319]:   [QUORUM] Members[1]: 4
Sep 17 08:43:58 shigi corosync[4319]:   [MAIN  ] Completed service synchronization, ready to provide service.

And here's the output on one of the alive nodes:

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-07-13 18:55:30 CEST; 2 months 4 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 5386 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 253.4M
   CGroup: /system.slice/corosync.service
           └─5386 /usr/sbin/corosync -f

wrz 17 08:47:01 tonbo corosync[5386]:   [TOTEM ] A new membership (1.2f9a8) was formed. Members
wrz 17 08:47:03 tonbo corosync[5386]:   [TOTEM ] Token has not been received in 1726 ms
wrz 17 08:47:03 tonbo corosync[5386]:   [TOTEM ] A new membership (1.2f9b0) was formed. Members
wrz 17 08:47:05 tonbo corosync[5386]:   [TOTEM ] Token has not been received in 1727 ms
wrz 17 08:47:06 tonbo corosync[5386]:   [TOTEM ] A new membership (1.2f9b8) was formed. Members
wrz 17 08:47:07 tonbo corosync[5386]:   [TOTEM ] Token has not been received in 1727 ms
wrz 17 08:47:09 tonbo corosync[5386]:   [KNET  ] link: host: 4 link: 0 is down
wrz 17 08:47:09 tonbo corosync[5386]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
wrz 17 08:47:09 tonbo corosync[5386]:   [KNET  ] host: host: 4 has no active links
wrz 17 08:47:11 tonbo corosync[5386]:   [TOTEM ] A new membership (1.2f9c0) was formed. Members

Now the whole cluster looks unresponsive

Mōsō · Sep 17, 2021

Got it,

Shutting down corosync service on all nodes, stopping pve-cluster service, lazy umounting /etc/pve on affected machines and restarting pve-cluster fixed the issue that happened after reconnecting the borked node.

I'll have to check whether this issue was caused by the new node taking the ID of an old node.

Search

Search

[SOLVED] Large delay on "pvecm status", webui unresponsive, node failed to rejoin cluster

Mōsō

Member

Mōsō

Member