Frequent Watchdog reboots

ashab.tariq

New Member
Jul 22, 2024
2
0
1
I am relatively new to Proxmox and have a cluster running with 3 nodes, everything is currently working fine, cluster is up HA is running fine. The issue I face currently is that in case the cluster link goes down for let's say more than 10s, watchdog kicks in and reboots the server, this causes the cluster to go down and other nodes become in-accessible.

Solution i am seeking:
  1. Is there a way to disable watchdog service or increase timeout so that it never reboots the server.
  2. Add redundant link to existing cluster so that even if one link goes down, cluster remains up.
Thank you, this is my first post in any forum so please ignore or correct me if anything is beyond rules
 
I am relatively new to Proxmox and have a cluster running with 3 nodes, everything is currently working fine, cluster is up HA is running fine. The issue I face currently is that in case the cluster link goes down for let's say more than 10s, watchdog kicks in and reboots the server, this causes the cluster to go down and other nodes become in-accessible.

Solution i am seeking:
  1. Is there a way to disable watchdog service or increase timeout so that it never reboots the server.
  2. Add redundant link to existing cluster so that even if one link goes down, cluster remains up.
Thank you, this is my first post in any forum so please ignore or correct me if anything is beyond rules

You contradict yourself a bit here - if you have the said situation, your HA is not exactly running fine. Also, it should not been a link going down for just 10 seconds - how do you know / measure that?

The watchdog is unfortunately (with hardcoded values) part of the HA stack, you can disable the HA. You could theoretically have that watchdog specifically never set, but then how would you expect the HA to work?

Also note, there's at least one known bug with HA stack to be aware of if you disable it:
https://bugzilla.proxmox.com/show_bug.cgi?id=5243

Anyhow, what you are looking for is:
https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
 
Hello @esi_y,

By HA i mean VMs do start migrations if one of the nodes go down. But the issue is watchdog restarts all the nodes as i have 3 nodes using the same cluster link, it goes down for a brief moment (have log entries on the IDRAC console) within 10s all of the servers are rebooted.

Thank you for sharing relevant resources, I'll check.
 
Hello @esi_y,

By HA i mean VMs do start migrations if one of the nodes go down. But the issue is watchdog restarts all the nodes as i have 3 nodes using the same cluster link, it goes down for a brief moment (have log entries on the IDRAC console) within 10s all of the servers are rebooted.

Thank you for sharing relevant resources, I'll check.

Hey!

I did not mean to mock you, but literally having a setup where the HA enabled causes the cluster to go on a reboot is ... not good. Specifically what I meant was that if you had the same with HA disabled, you would be more stable (because there's no reboots).

You may wish to read more on the fencing of nodes here:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

The other thing - I think you have a bigger problem for two reasons.

I don't know by heart now the exact limits, but I think the reboot should not be happening within at least a minute of an inquorate cluster, it's definitely not 10 seconds. I would guess there's connectivity problem before you even see the link (presumably you see a NIC going down in iDRAC) down.

The second thing is, you should not have all nodes going on a reboot at all. You at worst should see that one node rebooting that has NIC problems. Because the other two nodes should have quorum no problem.

I think it would be important to start troubleshooting what's wrong with that corosync links you have there first.

You may want to share your journal entries, similar as inquired in this thread:
https://forum.proxmox.com/threads/a...s-have-connectivity-issue.141864/#post-635891
 
For your question about adding a second link btw, this IS possible, but only either when you are creating the cluster (too late now) or through editing the config.
For the config-editing, the information about that can be found here:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

Also in general it is advised that (at least the primary) corosync-link is used for only that, no VM-traffic, no storage-traffic, not even Proxmox-management traffic if you can help it (of course you could reach Proxmox on this link/IP, but treat it as a backup-method).

Personally I prefer to also just use on-board adaptors for this, plus force the adaptors to have a specific name based on their mac-address, to make sure updates don't change them.
 
Also in general it is advised that (at least the primary) corosync-link is used for only that, no VM-traffic, no storage-traffic, not even Proxmox-management traffic if you can help it (of course you could reach Proxmox on this link/IP, but treat it as a backup-method).

I might be wrong but if all his 3 NICs (on 3 nodes) go down it does sound more like switching equipment has an issue than just link saturation.
 
  • Like
Reactions: justinclift
Indeed if the switch goes down and because of that the link on all devices get lost, that would explain this behavior, and in that case the first advice would be to either replace the switch or at least move the links to a different (seperate) switch
 
Yes, you can defang the watchdog so it doesn't reboot the host it's running on:

https://forum.proxmox.com/threads/n...cant-turn-on-vms-on-others.146944/post-663622
I really can only discourage from doing this! Make sure your Corosync is set up in a stable way with multiple links that are unlikely to completely fail at the same time.

Should only one node lose the corosync connection, it is expected that it will do a hard reset if it cannot reestablish it within one minute by letting the watchdog expire. This is to make sure that the guests on it are definitely powered off. Otherwise, if the watchdog runs out, but the host keeps running, your chances of having corrupt disk images are rather high since you will have two instances of the same guest running once the remaining nodes in the quorate cluster power on / recover the HA guests.
 
Yes, you can defang the watchdog so it doesn't reboot the host it's running on:

https://forum.proxmox.com/threads/n...cant-turn-on-vms-on-others.146944/post-663622

That's a fairly interesting way to do it! I used to just blacklist it.

I really can only discourage from doing this!

Of course you do. :D But for e.g. debugging it is helpful. And I found my earlier time when troubleshooting similar [1].

Should only one node lose the corosync connection, it is expected that it will do a hard reset if it cannot reestablish it within one minute by letting the watchdog expire.

So this is wrong, I was indeed myself wrong when read back [1] after myself. It is indeed 10 seconds in case e.g. something else also goes wrong with the services resetting it. I am surprised no one wants to see logs first what actually caused it fail resetting the softdog. Could it possibly be that system fails to reset the softdog and that in turns causes it reboot (which in a sequence makes the NIC appear down, etc.) instead? We do not know. All we know the OP is not hitting the 60 sec counter.

Otherwise, if the watchdog runs out, but the host keeps running, your chances of having corrupt disk images are rather high since you will have two instances of the same guest running once the remaining nodes in the quorate cluster power on / recover the HA guests.

I just want to agree with this so that it does not seem I am suggesting running HA without self-fencing. But it also depends on the storage solution (which we do know nothing about from the OP).

[1] https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602
 
  • Like
Reactions: justinclift
Of course you do. :D But for e.g. debugging it is helpful. And I found my earlier time when troubleshooting similar [1].
I need to mention it, so people know not to mess around with it, or at least know the ramifications. I already fear the support ticket or forum thread where someone has corrupted guest disk images, and only after some extensive troubleshooting we realize that they disabled/disarmed the watchdog. ;-)

For the actual issue at hand here, it would be interesting, if it is happening on a more or less regular basis, to disarm HA and observe the Corosync logs.

To disarm HA, set all HA guests to ignore. Then after 10 minutes, the LRMs should switch back to idle mode. Once in idle mode, they will not let the watchdog expire, even if the cluster connection is lost.

It the Corosync logs show logs that the connections to the other nodes is lost, and they all match timewise on all hosts, it is time to investigate what is causing this. The switch was already mentioned.
If they still hard reset, it might be that the nodes freeze or are under such high load, that the watchdog expires. Though I would be very surprised if that is the cause since it seems to be affecting all nodes at the same time AFAIU.
 
I already fear the support ticket or forum thread where someone has corrupted guest disk images, and only after some extensive troubleshooting we realize that they disabled/disarmed the watchdog. ;-)

I understand that, on the other hand I also believe it's good to tell people why something is so. Which you did, but lots of times it's just written off as "unsupported" etc.

For the actual issue at hand here, it would be interesting, if it is happening on a more or less regular basis, to disarm HA and observe the Corosync logs.

I actually would really think the logs (if flushed onto drive) from what was happening now would be even more insightful. It would be fairly funny to see nothing wrong with corosync with HA off and be at a loss because maybe the stuck machine failing to reset the softdog within 10 sec reboots and only then the corosync link is lost.

@ashab.tariq e.g. journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week (or similar)

To disarm HA, set all HA guests to ignore. Then after 10 minutes, the LRMs should switch back to idle mode. Once in idle mode, they will not let the watchdog expire, even if the cluster connection is lost.

May I just take the opportunity to bring up that no one from PVE staff ever probably read bugreport #5234 which causes issues exactly (even after the 10 minutes) which are hard to fathom when troubleshooting actually yet another problem? I.e. the CRM (even when "idle") unfortunately will reboot one last time after this due to the bug.

But I also suspect the top likely cause here is the switching/routing since this should not be happening on a 3 node setup in the fashion described - the only confusing part is the 10 sec timings ...
 
May I just take the opportunity to bring up that no one from PVE staff ever probably read bugreport #5234 which causes issues exactly (even after the 10 minutes) which are hard to fathom when troubleshooting actually yet another problem? I.e. the CRM (even when "idle") unfortunately will reboot one last time after this due to the bug.

Oops, 52*4*3, actually. I mean, it can't be that hard to fix. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!