Hello,
The story:
Node hard crashes due do motherboard-related failure
We setup a new node and move all lxc containers and vms with "mv /etc/pve/nodes/{old-node}/lxc/* /etc/pve/nodes/{new-node}/lxc" and "mv /etc/pve/nodes/{old-node}/qemu-server/* /etc/pve/nodes/{qemu-server}/lxc"
Everything is fine at this point, webui is responsive, pvecm status works quickly as expected
Hardware vendor does maintenance on server, all checks are green etc.
We turn on the server
Now pvecm status takes 10+ seconds. and shows what follows:
On a node that hasn't crashed:
And on a node that crashed:
For the record, I ran
And here's the output on one of the alive nodes:
Now the whole cluster looks unresponsive
The story:
Node hard crashes due do motherboard-related failure
We setup a new node and move all lxc containers and vms with "mv /etc/pve/nodes/{old-node}/lxc/* /etc/pve/nodes/{new-node}/lxc" and "mv /etc/pve/nodes/{old-node}/qemu-server/* /etc/pve/nodes/{qemu-server}/lxc"
Everything is fine at this point, webui is responsive, pvecm status works quickly as expected
Hardware vendor does maintenance on server, all checks are green etc.
We turn on the server
Now pvecm status takes 10+ seconds. and shows what follows:
On a node that hasn't crashed:
Code:
Cluster information
-------------------
Name: cluster
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 17 09:32:01 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2e357
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.5 (local)
0x00000002 1 10.10.10.6
0x00000003 1 10.10.10.1
And on a node that crashed:
Code:
Cluster information
-------------------
Name: cluster
Config Version: 5
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 17 08:46:09 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 4.2f8f8
Quorate: No
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000004 1 10.10.10.12 (local)
For the record, I ran
systemctl status corosync
to check whether corosync was running on the restored node, here's the output of the command:
Code:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-09-17 08:18:32 CEST; 25min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 4319 (corosync)
Tasks: 9 (limit: 9830)
Memory: 150.5M
CGroup: /system.slice/corosync.service
└─4319 /usr/sbin/corosync -f
Sep 17 08:43:51 shigi corosync[4319]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 17 08:43:53 shigi corosync[4319]: [TOTEM ] A new membership (4.2f73b) was formed. Members
Sep 17 08:43:53 shigi corosync[4319]: [QUORUM] Members[1]: 4
Sep 17 08:43:53 shigi corosync[4319]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 17 08:43:56 shigi corosync[4319]: [TOTEM ] A new membership (4.2f743) was formed. Members
Sep 17 08:43:56 shigi corosync[4319]: [QUORUM] Members[1]: 4
Sep 17 08:43:56 shigi corosync[4319]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 17 08:43:58 shigi corosync[4319]: [TOTEM ] A new membership (4.2f74b) was formed. Members
Sep 17 08:43:58 shigi corosync[4319]: [QUORUM] Members[1]: 4
Sep 17 08:43:58 shigi corosync[4319]: [MAIN ] Completed service synchronization, ready to provide service.
And here's the output on one of the alive nodes:
Code:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-07-13 18:55:30 CEST; 2 months 4 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 5386 (corosync)
Tasks: 9 (limit: 4915)
Memory: 253.4M
CGroup: /system.slice/corosync.service
└─5386 /usr/sbin/corosync -f
wrz 17 08:47:01 tonbo corosync[5386]: [TOTEM ] A new membership (1.2f9a8) was formed. Members
wrz 17 08:47:03 tonbo corosync[5386]: [TOTEM ] Token has not been received in 1726 ms
wrz 17 08:47:03 tonbo corosync[5386]: [TOTEM ] A new membership (1.2f9b0) was formed. Members
wrz 17 08:47:05 tonbo corosync[5386]: [TOTEM ] Token has not been received in 1727 ms
wrz 17 08:47:06 tonbo corosync[5386]: [TOTEM ] A new membership (1.2f9b8) was formed. Members
wrz 17 08:47:07 tonbo corosync[5386]: [TOTEM ] Token has not been received in 1727 ms
wrz 17 08:47:09 tonbo corosync[5386]: [KNET ] link: host: 4 link: 0 is down
wrz 17 08:47:09 tonbo corosync[5386]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
wrz 17 08:47:09 tonbo corosync[5386]: [KNET ] host: host: 4 has no active links
wrz 17 08:47:11 tonbo corosync[5386]: [TOTEM ] A new membership (1.2f9c0) was formed. Members
Now the whole cluster looks unresponsive
Last edited: