Random reboots on nodes

klutc

New Member
Jan 15, 2024
3
2
3
Hello,
first of all sorry for my bad English.
i have a strange issue where i cannot realize why it is happening. I have a cluster with 5 nodes, using HA, CEPH is not enabled. Everything was working perfectly fine until i added the fifth node. I know that probably it is related to a latency issue while using corosync, but cannot modify the network for less latency.
No matter of that, problem is starting when one of the nodes (especially the last one that i already mentioned). When i reboot it randomly some other nodes gets rebooted. I was thinking that it is related to the timezone, or to systemd-timesyncd service that it was not installed by default. All nodes have set by default to 1 vote.

The whole scenario:
1. Rebooting node 5, while it is rebooting from the rest of the nodes logically it becomes red with X
2. When the server is back again, i see it in HA and then randomly on the other nodes and also on the rebooted one HA Status is becoming "old timestamp - dead?" and then active/idle again
3. All nodes becomes active/idle, but 1-2 mins after that some other node it is rebooted by itself.
This issue started appearing after adding the last node.

Sometimes nodes are rebooted randomly, probably when the network is little bit more loaded.

My first question is - what could be the issue. I bet that it is related to the latency which is sometimes high on 2 of the nodes, but i cannot do anything about it. It is just like that :(
My second question - how i can prevent that rebooting, can i configure somehow HA to not reboot randomly other hosts when one of them is with unhealthy network or it is rebooted, because the thing that concerns me is more why it is happening during reboot of the host. I was blaming the date and time settings and this systemd-timesyncd service, but it is installed now and it happened again.

It is not a problem for the setup if there's no network from time to time. For this reason each node have 2 NICs in bond. I just want to prevent rebooting the other nodes when other node is not in good condition (rebooting, network issues, etc.)

More info about the nodes:
Virtual Environment 8.1.3
ProxMox is installed on the top of a Debian 12 netinst.
All packages are up-to-date on all nodes.

Some stuff from syslog that can be helpful:

Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 4 link 0 but the other node is not acknowledging packets of this size. Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

Jan 15 10:07:30 node5 corosync[1983]: [KNET ] link: host: 3 link: 0 is down Jan 15 10:07:30 node5 corosync[1983]: [KNET ] link: host: 1 link: 0 is down Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 3 has no active links Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 1 has no active links Jan 15 10:07:33 node5 corosync[1983]: [KNET ] rx: host: 3 link: 0 is up Jan 15 10:07:33 node5 corosync[1983]: [KNET ] link: Resetting MTU for link 0 because host 3 joined Jan 15 10:07:33 node5 corosync[1983]: [KNET ] rx: host: 1 link: 0 is up Jan 15 10:07:33 node5 corosync[1983]: [KNET ] link: Resetting MTU for link 0 because host 1 joined Jan 15 10:07:33 node5 corosync[1983]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Jan 15 10:07:33 node5 corosync[1983]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Jan 15 10:08:53 node5 corosync[1983]: [TOTEM ] Retransmit List: 785d

Jan 15 09:23:42 node5 corosync[1983]: [CPG ] *** 0x560bf2c35010 can't mcast to group pve_dcdb_v1 state:1, error:12 Jan 15 09:23:42 node5 corosync[1983]: [CPG ] *** 0x560bf2c35010 can't mcast to group pve_dcdb_v1 state:1, error:12 Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] notice: start cluster connection Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] crit: cpg_join failed: 14 Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] crit: can't initialize service Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pve-ha-lrm[2153]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node5/lrm_status.tmp.2153' - Device or resource busy Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9 Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9

Thanks a lot!
 
i have a strange issue where i cannot realize why it is happening. I have a cluster with 5 nodes, using HA, CEPH is not enabled.

Can you afford to turn off the HA for the time being, you should be able to see the network errors but not have it go on the rebooting spree.

Everything was working perfectly fine until i added the fifth node. I know that probably it is related to a latency issue while using corosync, but cannot modify the network for less latency.

Can you describe the network setup across the 5 notes in detail?


The whole scenario:
1. Rebooting node 5, while it is rebooting from the rest of the nodes logically it becomes red with X
2. When the server is back again, i see it in HA and then randomly on the other nodes and also on the rebooted one HA Status is becoming "old timestamp - dead?" and then active/idle again
3. All nodes becomes active/idle, but 1-2 mins after that some other node it is rebooted by itself.

Is it always "some other" node? Nothing deterministic that can be observed about the "other" node? E.g. always the one that most HA VMs migrated to?

Sometimes nodes are rebooted randomly, probably when the network is little bit more loaded.

My first question is - what could be the issue. I bet that it is related to the latency which is sometimes high on 2 of the nodes, but i cannot do anything about it. It is just like that :(

Did you consider using a qdevice with the setup, which does not have requirements on the network like a regular node.

My second question - how i can prevent that rebooting, can i configure somehow HA

It is not a problem for the setup if there's no network from time to time. For this reason each node have 2 NICs in bond.

I suspect the answer would be to fix the network setup (if you intend to use HA), also the bonding is probably not helping in this scenario.

I just want to prevent rebooting the other nodes when other node is not in good condition (rebooting, network issues, etc.)

You basically discovered the pitfalls of HA stack under non-perfect environment.

Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 4 link 0 but the other node is not acknowledging packets of this size. Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

I wonder if this is a false lead, but could it be as simple as MTU needs to be set lower?
 
Hello.

When you have a cluster with HA-enabled guests, it is extremely important to be absolutely sure the HA-enabled guests are shutdown before restoring them on another node, otherwise you would have two guests accessing the same storage which is a recipe for data corruption.

In order to ensure this, the kernel softdog has a 60 second timeout which will reboot the node if it does not have Corosync quorum during the entire timeout, so that the VM can be migrated to a node that has quorum.

If your setup has unstable connection, or an incorrect network config, this will look from the outside as if the nodes (or even the entire cluster) are rebooting. Corosync requires low latency and a dedicated connection, any high bandwidth process like Ceph of backups might saturate a 25G network and increase its latency to a point where Corosync deems the network unusable.
 
Hello,
indeed, this was the issue. I also was having different MTU between the nodes and this was causing loosing of some packages. I just did a separate clusters (in my case is quite ok). If in future Proxmox have something like Multi Datacenter Management solution it would be great. Thanks a lot for the support!
 
Corosync requires low latency

A bit of nitpick here, but corosync itself does not, it is the PVE stack that does, using corosync. Given all this comes up on the forum all too often, it might be a good idea to check for latency at the least at the point when one wants to add first HA service. If you prefer people not to run these scenarios at all, it might be even good to check every time a new node is added.
 
Please open a ticket in our bugzilla instance if you think this is a good idea.
 
Hi all,
I have a 2 nodes Proxmox cluster. I was coping a large number of files (3TB) from a VM in one node, to a Proxmox container running on the other node, connected to a USB drive. The USB drive is a backup of my library and was rsyncing the backup with the source. After about an hour all of the sudden, both Proxmox machines rebooted.

I was under the impression that if there is a corosync issue between the machines, Proxmox itself would not reboot, it would just shutdown the VMs to avoid conflicts between the two nodes. This thread seems to indicate Proxmox itself could reboot as well, is that correct?

I had a look at the log files and there wasn't anything of substance there to show why both nodes rebooted. They just did. So I'm trying to figure out what could have caused it. I first suspected a hardware issue, eg. the external USB HD, but it would be very unlikely that something like that would also bring down the other node.

Thank you all.
 
Hi all,
I have a 2 nodes Proxmox cluster. I was coping a large number of files (3TB) from a VM in one node, to a Proxmox container running on the other node, connected to a USB drive. The USB drive is a backup of my library and was rsyncing the backup with the source. After about an hour all of the sudden, both Proxmox machines rebooted.

I was under the impression that if there is a corosync issue between the machines, Proxmox itself would not reboot, it would just shutdown the VMs to avoid conflicts between the two nodes. This thread seems to indicate Proxmox itself could reboot as well, is that correct?

Are you using High Availability?

I had a look at the log files and there wasn't anything of substance there to show why both nodes rebooted. They just did. So I'm trying to figure out what could have caused it. I first suspected a hardware issue, eg. the external USB HD, but it would be very unlikely that something like that would also bring down the other node.

If you feel like troubleshooting it a bit better, Ideally open a separate thread (drop a link to the new one here), lots of these "random reboots" topics have actually nothing in common in the end. Include last ~100 lines from the system log prior to the reboots from BOTH nodes for a start.
 
Are you using High Availability?



If you feel like troubleshooting it a bit better, Ideally open a separate thread (drop a link to the new one here), lots of these "random reboots" topics have actually nothing in common in the end. Include last ~100 lines from the system log prior to the reboots from BOTH nodes for a start.
Hi there,
Thank you for your reply and didn't mean to hijack the thead. I'll open another thread for this.
Main thing I wanted to do here is confirm whether it was possible for the HA subsystem to restart the whole server, as I thought it only shutdown the VMs. And yes, I'm using HA between the two Proxmox servers plus a corosync arbiter.
Thank you
 
Hi there,
Thank you for your reply and didn't mean to hijack the thead. I'll open another thread for this.

I am neither a (self-proclaimed) moderator nor the OP, just I get lost personally if I reply threads like this because the title is non-specific and they often group unrelated items.

Main thing I wanted to do here is confirm whether it was possible for the HA subsystem to restart the whole server, as I thought it only shutdown the VMs.

Short answer is that HA can cause host reboots, but from the logs it would be possible to confirm if it was indeed the reason in that specific case.

And yes, I'm using HA between the two Proxmox servers plus a corosync arbiter.

No worries, post a link here to the other thread, so everyone interested gets a notification and finds it easily.

EDIT: You may want to read this separate piece explaining watchdog behaviour on PVE nodes:
https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!