Unexpected node reboot

bjsko · Feb 8, 2024

This issue is somewhat related to this thread, but not really. Bear with me.

Setup:
Large (too large - and currently in the process of splitting the cluster) pve cluster containing 35 nodes running pve 7.4-16. Kernel pinned to 5.13.19-15.
3 corosync rings as in the related thread

No HA-enabled VM's/containers in the cluster.

Issue:
One pve node(11) had a memory DIMM error causing the node to reboot.
This node leaves the cluster:

Code:

Feb  7 09:11:19 pve270 corosync[2052]:   [QUORUM] Sync left[1]: 11
Feb  7 09:11:19 pve270 corosync[2052]:   [TOTEM ] A new membership (1.227f) was formed. Members left: 11

And then a second node reboots:

Code:

Feb  7 09:11:25 pve270 pvescheduler[2283914]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Feb  7 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds)
...
Feb  7 09:12:28 pve270 corosync[2052]:   [QUORUM] Sync left[1]: 26
Feb  7 09:12:28 pve270 corosync[2052]:   [TOTEM ] A new membership (1.2283) was formed. Members left: 26

I suspect this node (26) was the quorum master. In the related thread we saw that the quorum master would fence itself even if it had no HA-enabled services. We also got it confirmed that when there are no HA-enabled services in the cluster, the watchdogs are not armed an no fencing should occur.

I am just puzzled why node 26 choose to reboot. All nodes are stating "loop take too long". There are no apparent hardware issues on node 26.
My theory is that node 26 was the quorum master and rebooted for some reason, but there are noe trace of watchdog or fencing messages it the logs - which I guess makes sense if the assumption regarding no armed watchdogs is correct.

Any insights or theories most welcome. I can of course provide more logs if required.

BR
Bjørn

esi_y · Feb 8, 2024

There's no "quorum master" for corosync (that your log snippets are from), this is HA stack term and you can check what is what by ha-manager status.

But if you have no HA enabled, you should have nothing fencing, hence rebooting. It would be better if you could show more of the logs, not just snippets, especially what's in the logs of the rebooted machine at the end?

bjsko · Feb 12, 2024

Sorry for the delay, I have been away.

Attached are two log files:
1) cluster-member.log This log file is from one unaffected node in the cluster
2) reboot-node.log This is from node 26, the one that unexpectedly(?) rebooted

Many thanks

Bjørn

esi_y · Feb 12, 2024

bjsko said:
No HA-enabled VM's/containers in the cluster.

Ok, let's see:

Code:

ha-manager status
ha-manager config

bjsko · Feb 12, 2024

tempacc346235 said:
Ok, let's see:

Code:

ha-manager status ha-manager config

Code:

root@pve271:~# ha-manager status
quorum OK
master pve271 (idle, Wed Feb  7 09:11:25 2024) - static load CRS
lrm pve201 (idle, Mon Feb 12 19:19:37 2024)
lrm pve202 (idle, Mon Feb 12 19:19:38 2024)
lrm pve203 (idle, Mon Feb 12 19:19:37 2024)
lrm pve204 (idle, Mon Feb 12 19:19:38 2024)
lrm pve205 (idle, Mon Feb 12 19:19:39 2024)
lrm pve206 (idle, Mon Feb 12 19:19:36 2024)
lrm pve207 (idle, Mon Feb 12 19:19:34 2024)
lrm pve208 (idle, Mon Feb 12 19:19:34 2024)
lrm pve209 (idle, Mon Feb 12 19:19:36 2024)
lrm pve210 (idle, Mon Feb 12 19:19:38 2024)
lrm pve211 (idle, Mon Feb 12 19:19:34 2024)
lrm pve212 (idle, Mon Feb 12 19:19:36 2024)
lrm pve213 (idle, Mon Feb 12 19:19:36 2024)
lrm pve214 (idle, Mon Feb 12 19:19:38 2024)
lrm pve215 (idle, Mon Feb 12 19:19:39 2024)
lrm pve216 (idle, Mon Feb 12 19:19:34 2024)
lrm pve217 (idle, Mon Feb 12 19:19:37 2024)
lrm pve240 (idle, Mon Feb 12 19:19:34 2024)
lrm pve241 (idle, Mon Feb 12 19:19:37 2024)
lrm pve242 (idle, Mon Feb 12 19:19:36 2024)
lrm pve243 (idle, Mon Feb 12 19:19:34 2024)
lrm pve244 (idle, Mon Feb 12 19:19:36 2024)
lrm pve245 (idle, Mon Feb 12 19:19:38 2024)
lrm pve246 (idle, Mon Feb 12 19:19:34 2024)
lrm pve247 (idle, Mon Feb 12 19:19:35 2024)
lrm pve248 (idle, Mon Feb 12 19:19:34 2024)
lrm pve249 (idle, Mon Feb 12 19:19:34 2024)
lrm pve250 (idle, Mon Feb 12 19:19:35 2024)
lrm pve251 (idle, Mon Feb 12 19:19:38 2024)
lrm pve270 (idle, Mon Feb 12 19:19:38 2024)
lrm pve271 (idle, Mon Feb 12 19:19:36 2024)
lrm pve272 (idle, Mon Feb 12 19:19:37 2024)
lrm pve273 (idle, Mon Feb 12 19:19:34 2024)
lrm pve290 (idle, Mon Feb 12 19:19:34 2024)
lrm pve291 (idle, Mon Feb 12 19:19:37 2024)

root@pve271:~# ha-manager config

esi_y · Feb 12, 2024

bjsko said:

Code:

root@pve271:~# ha-manager status
quorum OK
master pve271 (idle, Wed Feb  7 09:11:25 2024) - static load CRS
lrm pve201 (idle, Mon Feb 12 19:19:37 2024)
lrm pve202 (idle, Mon Feb 12 19:19:38 2024)
lrm pve203 (idle, Mon Feb 12 19:19:37 2024)
lrm pve204 (idle, Mon Feb 12 19:19:38 2024)
lrm pve205 (idle, Mon Feb 12 19:19:39 2024)
lrm pve206 (idle, Mon Feb 12 19:19:36 2024)
lrm pve207 (idle, Mon Feb 12 19:19:34 2024)
lrm pve208 (idle, Mon Feb 12 19:19:34 2024)
lrm pve209 (idle, Mon Feb 12 19:19:36 2024)
lrm pve210 (idle, Mon Feb 12 19:19:38 2024)
lrm pve211 (idle, Mon Feb 12 19:19:34 2024)
lrm pve212 (idle, Mon Feb 12 19:19:36 2024)
lrm pve213 (idle, Mon Feb 12 19:19:36 2024)
lrm pve214 (idle, Mon Feb 12 19:19:38 2024)
lrm pve215 (idle, Mon Feb 12 19:19:39 2024)
lrm pve216 (idle, Mon Feb 12 19:19:34 2024)
lrm pve217 (idle, Mon Feb 12 19:19:37 2024)
lrm pve240 (idle, Mon Feb 12 19:19:34 2024)
lrm pve241 (idle, Mon Feb 12 19:19:37 2024)
lrm pve242 (idle, Mon Feb 12 19:19:36 2024)
lrm pve243 (idle, Mon Feb 12 19:19:34 2024)
lrm pve244 (idle, Mon Feb 12 19:19:36 2024)
lrm pve245 (idle, Mon Feb 12 19:19:38 2024)
lrm pve246 (idle, Mon Feb 12 19:19:34 2024)
lrm pve247 (idle, Mon Feb 12 19:19:35 2024)
lrm pve248 (idle, Mon Feb 12 19:19:34 2024)
lrm pve249 (idle, Mon Feb 12 19:19:34 2024)
lrm pve250 (idle, Mon Feb 12 19:19:35 2024)
lrm pve251 (idle, Mon Feb 12 19:19:38 2024)
lrm pve270 (idle, Mon Feb 12 19:19:38 2024)
lrm pve271 (idle, Mon Feb 12 19:19:36 2024)
lrm pve272 (idle, Mon Feb 12 19:19:37 2024)
lrm pve273 (idle, Mon Feb 12 19:19:34 2024)
lrm pve290 (idle, Mon Feb 12 19:19:34 2024)
lrm pve291 (idle, Mon Feb 12 19:19:37 2024)

root@pve271:~# ha-manager config

So this is weird, just to sum up, the logs provided were:

cluster-member = pve270 (no issues with it, logging cluster activity)
reboot-node = pve271 = node 26 (happens to be HA master)

There was also hw-fail-node (node 11) which happened to reboot due to hardware fail.

The HA master reboot-node happens to reboot after the hw-fail-node did, but there's no HA services configured and it's pre-reboot log entries:

Code:

Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 0 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 1 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 2 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:43 pve271 corosync[2057]:   [TOTEM ] Token has not been received in 18337 ms
Feb 07 09:11:17 pve271 watchdog-mux[1403]: client watchdog expired - disable watchdog updates
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25\
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 26 27 28 29 30 31 32 33 34 35
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync left[1]: 11
Feb 07 09:11:19 pve271 corosync[2057]:   [TOTEM ] A new membership (1.227f) was formed. Members left: 11
Feb 07 09:11:19 pve271 corosync[2057]:   [TOTEM ] Failed to receive the leave message. failed: 11
Feb 07 09:11:19 pve271 pmxcfs[1923]: [dcdb] notice: members: 1/1943, 2/1926, 3/1942, 4/1959, 5/2001, 6/1955, 7/3058, 8/3092, 9/3106, 10/1950, 12/2884, 13/3064, 14/1953, 15/1974, 16/1950, 17/1977, 18/1964, 19/2919, 20/2951, 21/2920, 22/1944, 23/2904, 24/2923, 25/2894, 26/1923, 27/1919, 28/2328, 29/2336, 30/1289, 31/2305, 32/2331, 33/1274, 34/4741, 35/9149
Feb 07 09:11:19 pve271 pmxcfs[1923]: [dcdb] notice: starting data syncronisation
Feb 07 09:11:19 pve271 pmxcfs[1923]: [status] notice: members: 1/1943, 2/1926, 3/1942, 4/1959, 5/2001, 6/1955, 7/3058, 8/3092, 9/3106, 10/1950, 12/2884, 13/3064, 14/1953, 15/1974, 16/1950, 17/1977, 18/1964, 19/2919, 20/2951, 21/2920, 22/1944, 23/2904, 24/2923, 25/2894, 26/1923, 27/1919, 28/2328, 29/2336, 30/1289, 31/2305, 32/2331, 33/1274, 34/4741, 35/9149
Feb 07 09:11:19 pve271 pmxcfs[1923]: [status] notice: starting data syncronisation
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Members[34]: 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25\
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Members[34]: 26 27 28 29 30 31 32 33 34 35
Feb 07 09:11:19 pve271 corosync[2057]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 07 09:11:22 pve271 pmxcfs[1923]: [dcdb] notice: received sync request (epoch 1/1943/0000004A)
Feb 07 09:11:22 pve271 pmxcfs[1923]: [status] notice: received sync request (epoch 1/1943/0000004A)
Feb 07 09:11:24 pve271 pmxcfs[1923]: [dcdb] notice: received all states
Feb 07 09:11:24 pve271 pmxcfs[1923]: [dcdb] notice: leader is 1/1943
Feb 07 09:11:24 pve271 pmxcfs[1923]: [dcdb] notice: synced members: 1/1943, 2/1926, 3/1942, 4/1959, 5/2001, 6/1955, 7/3058, 8/3092, 9/3106, 10/1950, 12/2884, 13/3064, 14/1953, 15/1974, 16/1950, 17/1977, 18/1964, 19/2919, 20/2951, 21/2920, 22/1944, 23/2904, 24/2923, 25/2894, 26/1923, 27/1919, 28/2328, 29/2336, 30/1289, 31/2305, 32/2331, 33/1274, 34/4741, 35/9149
Feb 07 09:11:24 pve271 pmxcfs[1923]: [dcdb] notice: all data is up to date
Feb 07 09:11:24 pve271 pmxcfs[1923]: [dcdb] notice: dfsm_deliver_queue: queue length 69
Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
Feb 07 09:11:24 pve271 pvescheduler[3236141]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 07 09:11:24 pve271 systemd[1]: watchdog-mux.service: Succeeded.
Feb 07 09:11:24 pve271 systemd[1]: watchdog-mux.service: Consumed 4min 22.932s CPU time.
Feb 07 09:11:24 pve271 pmxcfs[1923]: [status] notice: received all states
-- Boot a1b3a6fb387e4d478c680a25c37eb542 --

The softdog there had started:

Code:

Feb 07 09:15:49 pve271 kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Feb 07 09:15:49 pve271 kernel: softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0

Did I get the nodes and sequence correct? I know I am just repeating it, but I wanted to be sure.

bjsko · Feb 12, 2024

Yes, that pretty much sums it all up. I am not 100% sure that pve271(node 26) was HA master when it unexpectedly rebooted, but it is now as you can see. In any case, there was no HA services configured in the cluster.

The cluster used to have HA services enabled earlier, but I removed them some time ago. It shouldn't be required to restart any pve-{lrm,crm}-ha when removing (all) HA services?

From the logs I sent earlier:

on pve271/node 26 this is logged after node 11(HW-fail one) leaves the cluster :

Code:

Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
Feb 07 09:11:24 pve271 pvescheduler[3236141]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
Feb 07 09:11:24 pve271 pvescheduler[3236141]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!

That is the last it logs before rebooting.

On cluster-member pve270 this is logged before node 26 leaves the cluster:

Code:

Feb 07 09:11:24 pve270 pvescheduler[2283915]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received all states
Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: all data is up to date
Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: dfsm_deliver_queue: queue length 14778
Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received log
Feb 07 09:11:24 pve270 pmxcfs[1944]: [main] notice: ignore insert of duplicate cluster log
Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received log
Feb 07 09:11:24 pve270 pmxcfs[1944]: [main] notice: ignore insert of duplicate cluster log
Feb 07 09:11:25 pve270 pvescheduler[2283914]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Feb 07 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds)

Does
Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
and Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!
give any clues, or is that expected during a (un)expected reboot? (So just a consequence)

Is Feb 07 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds) a problem that could cause something like this?

Many thanks for your kind help

/bjørn

esi_y · Feb 12, 2024

bjsko said:
Yes, that pretty much sums it all up. I am not 100% sure that pve271(node 26) was HA master when it unexpectedly rebooted, but it is now as you can see. In any case, there was no HA services configured in the cluster.

The cluster used to have HA services enabled earlier, but I removed them some time ago. It shouldn't be required to restart any pve-{lrm,crm}-ha when removing (all) HA services?

From the logs I sent earlier:

on pve271/node 26 this is logged after node 11(HW-fail one) leaves the cluster :

Code:

Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections Feb 07 09:11:24 pve271 pvescheduler[3236141]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop! Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections Feb 07 09:11:24 pve271 pvescheduler[3236141]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!

That is the last it logs before rebooting.

On cluster-member pve270 this is logged before node 26 leaves the cluster:

Code:

Feb 07 09:11:24 pve270 pvescheduler[2283915]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received all states Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: all data is up to date Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: dfsm_deliver_queue: queue length 14778 Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received log Feb 07 09:11:24 pve270 pmxcfs[1944]: [main] notice: ignore insert of duplicate cluster log Feb 07 09:11:24 pve270 pmxcfs[1944]: [status] notice: received log Feb 07 09:11:24 pve270 pmxcfs[1944]: [main] notice: ignore insert of duplicate cluster log Feb 07 09:11:25 pve270 pvescheduler[2283914]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout Feb 07 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds)

Does
Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
and Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!
give any clues, or is that expected during a (un)expected reboot? (So just a consequence)

Is Feb 07 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds) a problem that could cause something like this?

Many thanks for your kind help

/bjørn

Thanks for confirming, ok so you used to have HA services before. When you look at the timing of those logs, the issue is that you have:

Feb 07 09:11:17 pve271 watchdog-mux[1403]: client watchdog expired - disable watchdog updates

Everything after is a consequence in my opinion. But that simply should not be happening in non-HA scenario.

Could you post full journalctl -b -1 (or whatever boot it was that captured this) of pve271 (reboot-node = node 26) so that we see what was going on before 9:10?

esi_y · Feb 12, 2024

bjsko said:
It shouldn't be required to restart any pve-{lrm,crm}-ha when removing (all) HA services?

Just so you know I have not ignored your question, I do not know at this point, either bear with me till tomorrow or someone else who knows off the cuff, but I do not see much why this node would need to fence even with the watchdog active. So that's another problem.

bjsko · Feb 13, 2024

tempacc346235 said:
Thanks for confirming, ok so you used to have HA services before. When you look at the timing of those logs, the issue is that you have:

Feb 07 09:11:17 pve271 watchdog-mux[1403]: client watchdog expired - disable watchdog updates

Everything after is a consequence in my opinion. But that simply should not be happening in non-HA scenario.

Could you post full journalctl -b -1 (or whatever boot it was that captured this) of pve271 (reboot-node = node 26) so that we see what was going on before 9:10?

The gzipped output of journalctl -b -1 is to large to upload, so shared it from here: https://drive.google.com/file/d/1AMzIl2IHuAfnpYDT3OHGkvGhmi5tgQnI/view?usp=sharing

/bjørn

esi_y · Feb 14, 2024

Hello again Bjørn and thanks for providing the whole thing back. I think you were onto something all along with all your prior research already.

I think I know why your watchdog was active on the node. What I do not know, however, is why exactly it triggered.

From the logs you provided regarding the reboot-node = pve271 = node 26, which indeed happened to be HA master at the time:

pve-ha-crm (insignificant lines removed to keep it sensible):

Code:

Oct 02 18:54:28 pve271 pve-ha-crm[2118]: starting server
Oct 02 18:54:28 pve271 pve-ha-crm[2118]: status change startup => wait_for_quorum
Oct 02 18:54:53 pve271 pve-ha-crm[2118]: status change wait_for_quorum => slave
Oct 03 14:12:13 pve271 pve-ha-crm[2118]: successfully acquired lock 'ha_manager_lock'
Oct 03 14:12:13 pve271 pve-ha-crm[2118]: watchdog active
Oct 03 14:12:13 pve271 pve-ha-crm[2118]: status change slave => master
Oct 03 14:12:13 pve271 pve-ha-crm[2118]: using scheduler mode 'static'

[... lots of crm testing till Nov 07 ...]

Nov 13 09:50:19 pve271 pve-ha-crm[2118]: removing stale service 'vm:2484' (no config)

[... and many more removed, presumably leaving the ha config empty followed by only ...]

Dec 20 10:41:33 pve271 pve-ha-crm[2118]: node 'pve273': state changed from 'unknown' => 'online'
Jan 30 15:33:27 pve271 pve-ha-crm[2118]: node 'pve206': state changed from 'online' => 'unknown'
Jan 30 15:51:20 pve271 pve-ha-crm[2118]: node 'pve206': state changed from 'unknown' => 'online'
Feb 06 16:13:35 pve271 pve-ha-crm[2118]: node 'pve206': state changed from 'online' => 'unknown'
Feb 06 16:43:45 pve271 pve-ha-crm[2118]: node 'pve206': state changed from 'unknown' => 'online'

[... and then Feb 07 reboot but nothing for pve-ha-crm ...]

Coincidentally, this node was never hosting any resources because:

pve-ha-lrm (full content for the same period):

Code:

Oct 02 18:54:29 pve271 pve-ha-lrm[2130]: starting server
Oct 02 18:54:29 pve271 pve-ha-lrm[2130]: status change startup => wait_for_agent_lock

Now about the reboot:

watchdog (full content for the same period):

Code:

Oct 02 18:54:21 pve271 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Oct 02 18:54:22 pve271 systemd[1]: Started Proxmox VE watchdog multiplexer.
Oct 02 18:54:22 pve271 watchdog-mux[1403]: Watchdog driver 'Software Watchdog', version 0
Oct 02 18:54:25 pve271 corosync[2057]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Oct 02 18:54:26 pve271 corosync[2057]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 03 14:12:13 pve271 pve-ha-crm[2118]: watchdog active
Feb 07 09:11:17 pve271 watchdog-mux[1403]: client watchdog expired - disable watchdog updates
Feb 07 09:11:24 pve271 watchdog-mux[1403]: exit watchdog-mux with active connections
Feb 07 09:11:24 pve271 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 07 09:11:24 pve271 systemd[1]: watchdog-mux.service: Succeeded.
Feb 07 09:11:24 pve271 systemd[1]: watchdog-mux.service: Consumed 4min 22.932s CPU time.

So the watchdog was activated because the node was a CRM. This is normal for HA scenarios, the watchdog is active on the nodes where the resources are and on CRM node, i.e. the master. Despite you removed all HA resources from the cluster, this watchdog was sticking around.

The LRM nodes go from active to idle after 60 rounds times 10 seconds of having no HA recources on them and actually deactivate the watchdog, but this did not happen here. For one, your CRM was not also LRM. Also, if you had looked at the ha-manager status before that reboot, you would have found that all the nodes were idle in terms of LRM, but this node was still as "active" as CRM, i.e. master.

I consider this a bug and will file it later on, include a link here. But unfortunately that does not address the other issue... (overflowing to the next post)

esi_y · Feb 14, 2024

The watchdog triggered because of something, but I can't unfortunately guess what it might have been.

corosync (full content for days of Feb 06 and Feb 07 till reboot)

Code:

Feb 06 16:13:31 pve271 corosync[2057]:   [CFG   ] Node 6 was shut down by sysadmin
Feb 06 16:13:31 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2\
Feb 06 16:13:31 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 25 26 27 28 29 30 31 32 33 34 35
Feb 06 16:13:31 pve271 corosync[2057]:   [QUORUM] Sync left[1]: 6
Feb 06 16:13:31 pve271 corosync[2057]:   [TOTEM ] A new membership (1.2276) was formed. Members left: 6
Feb 06 16:13:31 pve271 corosync[2057]:   [QUORUM] Members[34]: 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2\
Feb 06 16:13:31 pve271 corosync[2057]:   [QUORUM] Members[34]: 25 26 27 28 29 30 31 32 33 34 35
Feb 06 16:13:31 pve271 corosync[2057]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 06 16:13:33 pve271 corosync[2057]:   [KNET  ] link: host: 6 link: 2 is down
Feb 06 16:13:33 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 30)
Feb 06 16:13:34 pve271 corosync[2057]:   [KNET  ] link: host: 6 link: 0 is down
Feb 06 16:13:34 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 10)
Feb 06 16:13:38 pve271 corosync[2057]:   [KNET  ] link: host: 6 link: 1 is down
Feb 06 16:13:38 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 10)
Feb 06 16:13:38 pve271 corosync[2057]:   [KNET  ] host: host: 6 has no active links
Feb 06 16:23:28 pve271 corosync[2057]:   [KNET  ] link: host: 4 link: 0 is down
Feb 06 16:23:28 pve271 corosync[2057]:   [KNET  ] host: host: 4 (passive) best link: 2 (pri: 40)
Feb 06 16:24:04 pve271 corosync[2057]:   [KNET  ] rx: host: 4 link: 0 is up
Feb 06 16:24:04 pve271 corosync[2057]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Feb 06 16:24:04 pve271 corosync[2057]:   [KNET  ] host: host: 4 (passive) best link: 2 (pri: 40)
Feb 06 16:24:04 pve271 corosync[2057]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] rx: host: 6 link: 0 is up
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] rx: host: 6 link: 1 is up
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] link: Resetting MTU for link 1 because host 6 joined
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] rx: host: 6 link: 2 is up
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] link: Resetting MTU for link 2 because host 6 joined
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 2 (pri: 40)
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 2 (pri: 40)
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] host: host: 6 (passive) best link: 2 (pri: 40)
Feb 06 16:43:38 pve271 corosync[2057]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 06 16:43:42 pve271 corosync[2057]:   [QUORUM] Sync members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Feb 06 16:43:42 pve271 corosync[2057]:   [QUORUM] Sync members[35]: 25 26 27 28 29 30 31 32 33 34 35
Feb 06 16:43:42 pve271 corosync[2057]:   [QUORUM] Sync joined[1]: 6
Feb 06 16:43:42 pve271 corosync[2057]:   [TOTEM ] A new membership (1.227b) was formed. Members joined: 6
Feb 06 16:43:42 pve271 corosync[2057]:   [QUORUM] Members[35]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\
Feb 06 16:43:42 pve271 corosync[2057]:   [QUORUM] Members[35]: 25 26 27 28 29 30 31 32 33 34 35
Feb 06 16:43:42 pve271 corosync[2057]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 07 09:08:11 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 0 is down
Feb 07 09:08:11 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:08:29 pve271 corosync[2057]:   [KNET  ] rx: host: 11 link: 0 is up
Feb 07 09:08:29 pve271 corosync[2057]:   [KNET  ] link: Resetting MTU for link 0 because host 11 joined
Feb 07 09:08:29 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:08:32 pve271 corosync[2057]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 07 09:08:53 pve271 corosync[2057]:   [TOTEM ] Retransmit List: 24184f 241850 241851 241852 241853 241854 241855 241856 241857 241858 241859 24185a 24185c
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 0 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 1 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] link: host: 11 link: 2 is down
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 (passive) best link: 2 (pri: 40)
Feb 07 09:10:31 pve271 corosync[2057]:   [KNET  ] host: host: 11 has no active links
Feb 07 09:10:43 pve271 corosync[2057]:   [TOTEM ] Token has not been received in 18337 ms
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25\
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync members[34]: 26 27 28 29 30 31 32 33 34 35
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Sync left[1]: 11
Feb 07 09:11:19 pve271 corosync[2057]:   [TOTEM ] A new membership (1.227f) was formed. Members left: 11
Feb 07 09:11:19 pve271 corosync[2057]:   [TOTEM ] Failed to receive the leave message. failed: 11
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Members[34]: 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25\
Feb 07 09:11:19 pve271 corosync[2057]:   [QUORUM] Members[34]: 26 27 28 29 30 31 32 33 34 35
Feb 07 09:11:19 pve271 corosync[2057]:   [MAIN  ] Completed service synchronization, ready to provide service.

Based on the watchdog entries, everything was doomed to reboot after:
Feb 07 09:11:17 pve271 watchdog-mux[1403]: client watchdog expired - disable watchdog updates

But other than:
Feb 07 09:10:43 pve271 corosync[2057]: [TOTEM ] Token has not been received in 18337 ms

I cannot see anything suspicious there. I suspect the fact your node rebooted is related back to the issue referenced from your OP here, where you were discussing corosync and they'd rather have you split the cluster or it would need more log digging (from all other nodes).

I think finding what HA bug is good enough for this thread. FYI, after the reboot, since your HA services are idle, the watchdog indeed is not active either.

If you wanted to be sure for any particular node, you can look for /run/watchdog-mux.active on any particular machine.

esi_y · Feb 14, 2024

tempacc346235 said:
I consider this a bug and will file it later on, include a link here.

Filed: https://bugzilla.proxmox.com/show_bug.cgi?id=5243

tempacc346235 said:
But unfortunately that does not address the other issue...

bjsko · Feb 14, 2024

Thanks a lot for your time invested in this issue and for a super bug report. Hopefully someone will pick it up. I have learnt a lot from this troubleshooting!

tempacc346235 said:
If you wanted to be sure for any particular node, you can look for /run/watchdog-mux.active on any particular machine.

Thanks for this one as well. Ran through my clusters now and currently it makes sense everywhere.

BR
Bjørn

Search

Search

Unexpected node reboot

bjsko

Active Member

esi_y

Active Member

bjsko

Active Member

Attachments

esi_y

Active Member

bjsko

Active Member

esi_y

Active Member

bjsko

Active Member

esi_y

Active Member

esi_y

Active Member

bjsko

Active Member

esi_y

Active Member

esi_y

Active Member

esi_y

Active Member

bjsko

Active Member